1 Introduction
Imagine that a person walks into a hospital with a broken arm. The first question from healthcare personnel would be: “How did you break the arm?” instead of “Do you have a cold?”, because the answer reveals relevant information for this patient. Human experts dynamically acquire information based on the current understanding of the situation. Automating this human expertise of asking relevant questions is difficult. In other applications such as online questionnaires for example, most existing online questionnaire system either present exhaustive questions (Lewenberg et al., 2017; Shim et al., 2018)
or use extremely timeconsuming human labeling work to manually build a decision tree for a reduced number of questions
(Zakim et al., 2008). This wastes the valuable time of experts or users (patients). An automated solution for personalized dynamic acquisition of information has great potential to save much of this time in many reallife applications.What are the technical challenges to build an intelligent information acquisition system? Missing data is a key issue: taking the questionnaire scenario as an example, at any point in time we only observe a small subset of answers yet have to reason about possible answers for the remaining questions. We thus need an accurate probabilistic model that can perform inference given a variable subset of observed answers. Another key problem is deciding what to ask next
: this requires assessing the worth of each possible question or measurement, the exact computation of which is intractable. However, compared to current active learning methods we select individual features, not instances; therefore, existing methods are not applicable. In addition, these traditional methods are often not scalable to the large volume of data available in many practical cases
(Settles, 2012).We propose the EDDI (Efficient Dynamic Discovery of highvalue Information) framework as a scalable information acquisition system for any given task. We assume that information acquisition is always associated with cost. Given a task, such as estimating the costumers’ experience or assessing population health status, we dynamically decide which piece of information to acquire next. The framework is very general, and the information can be presented in any form such as answers to questions, or values of a lab test. Our contributions are:

We propose a novel efficient information acquisition framework, EDDI (Section 3). To enable EDDI, we contribute technically:

A partial amortized inference method with different specifications for the inference network (Section 3.2).
We extend a current amortized inference method, the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014), to account for partial observations. The resulting method, which we call Partial VAE, is inspired by the set formulation of the data (Qi et al., 2017; Zaheer et al., 2017). Partial VAE, as a probabilistic framework, is highly scalable, and serves as the base for the EDDI framework. However, Partial VAE itself is generic and can be used on its own as a nonlinear probabilistic framework for missingdata imputation.

An information theoretic acquisition function with an efficient approximation, yielding a novel variablewise active learning method (Section 3.3).
Based on the partial VAE, we select the unobserved variable which contributes most to the task, such as health assessment, evaluated using the mutual information. This acquisition function does not have an analytical solution and we derive an efficient approximation.


We demonstrate the performance of EDDI on various settings, and apply it in reallife healthcare scenarios (Section 4).

We first show the superior performance of the Partial VAE framework on an image inpainting task (Section
4.1). 
We then use 6 different datasets from the Machine Learning repository of University of Irvine (UCI)
(Dheeru & Karra Taniskidou, 2017) to demonstrate the behavior of EDDI, comparing with multiple baseline methods (Section 4.2). 
Finally, we evaluate EDDI on two reallife healthcare applications: risk assessment in intensive care (Section 4.3) and public health assessment with national survey (Section 4.4), where traditional methods without amortized inference do not scale. EDDI shows clear improvements in these two applications.

2 Related work
EDDI requires a method that handles partially observed data to enable dynamic variable wise active learning. We thus review related methods for handling partial observation and doing active learning.
2.1 Partial Observation
Missing data entries are common in many reallife applications, which has created a long history of research on the topic of dealing with missing data (Rubin, 1976; Dempster et al., 1977). We describe existing methods below:
Traditional methods without amortization
Prediction based methods have shown advantages for missing value imputation (Scheffer, 2002). Efficient matrix factorization based methods have been recently applied (Keshavan et al., 2010; Jain et al., 2010; Salakhutdinov & Mnih, 2008), where the observations are assumed to be able to decompose as multiplication of low dimensional matrices. In particular, many probabilistic frameworks with various distribution assumptions (Salakhutdinov & Mnih, 2008; Blei et al., 2003) have been used for missing value imputation (Yu et al., 2016; Hamesse et al., 2018) and also recommender systems where unlabeled items are predicted (Stern et al., 2009; Wang & Blei, 2011; Gopalan et al., 2014).
The probabilistic matrix factorization method has been used in the active variable selection framework, the dimensionality reduction active learning model (DRAL),(Lewenberg et al., 2017). These traditional methods suffer from limited model capacity since they are commonly linear. Additionally, they do not scale to large volumes of data and thus are usually not applicable in realworld applications. For example, Lewenberg et al. (2017) test the performance of their method with a single user due to the heavy computational cost of traditional inference methods for probabilistic matrix factorization.
Utilizing Amortized Inference
The amortized inference (Kingma & Welling, 2014; Rezende et al., 2014; Zhang et al., 2017) has significantly improved the scalability for probabilistic models such as variational autoencoders (VAEs). In the case of partially observed data, amortized inference is particularly of interest due to the need of speeding up test time applications. Wu et al. (2018) employ traditional nonamortized inference in order to perform partial inference of a pretrained VAE during test time. Amortized inference is only used during training, assuming the training dataset is fully observed. During test time, the traditional inference is used to infer missing data entries from the partially observed dataset using the pretrained model. In this way, only training time is reduced. The model is restrictive since it is not scalable in the test time and the fully observed training set is not available for many applications.
Nazabal et al. (2018) uses zero imputation for amortized inference for both training and test sets with missing data entries. Zero imputation is a generic and straightforward method that first fills the missing data with zeros, and then feeds the imputed data as input for the inference network. The drawback of zero imputation is that it introduces bias when the data are not missing completely at random which leads to not welllearned model. We also observe artifacts when using it for the image inpainting task. In the end, independent of our work, Garnelo et al. (2018) explore interpreting variational autoencoder (amortized inference) as stochastic processes, which also handles partial observation per se.
2.2 Active Learning
Traditional Active Learning
Active learning, also referred to as experimental design, aims to obtain optimal performance with fewer selected data (or experiments) (Lindley, 1956; MacKay, 1992; Settles, 2012). Traditional active learning aims to select the next data point to label. Many information theoretical approaches have shown promising results in various settings with different acquisition functions (MacKay, 1992; McCallumzy & Nigamy, 1998; Houlsby et al., 2011). These methods commonly assume that there exist fully observed data, and the acquisition decision is instance wise. Little work has dealt with missing values within instances. Zheng & Padmanabhan (2002) deal with missing data values by imputing with traditional nonprobabilistic methods (Little & Rubin, 1987) first. It is still an instancewise active learning framework.
Different from traditional active learning, our proposed framework aims to for perform variablewise active learning for each instance. In this setting, information theoretical acquisition functions need a new design as well as nontrivial approximations. The most closely related work is the aforementioned DRAL (Lewenberg et al., 2017), which deals with variablewise active learning for each instance.
Sequential variable selection
Active sequential feature selection is of great need, especially in costsensitive applications. Thus, many methods have also been applied.
Melville et al. (2004); SaarTsechansky et al. (2009)have designed objectives to select any feature from any instance to minimize the cost to archive high accuracy. The proposed framework is very general. However, the realization of these framework relies on various heuristics and suffer from limited scalability. Most recently,
Shim et al. (2018)employ reinforcement learning for sequential feature selection, where an agent takes feature selection actions or stop the decision.
3 Method
In this section, we first formalize the active variable selection problem that we aim to solve. Then, we present our Partial VAE to model and perform inference on partial observations. Finally, we complete the EDDI framework by presenting our acquisition function and estimation method.
3.1 Problem formulation
The core problem that we address in this paper is the following active variable selection problem. Let
be a set of random variables with probability density
. Furthermore, let a subset of the variables , , be observed while the variables , , are unobserved. We assume that we can query the value of variables for . The goal of active variable selection is to query a sequence of variables in with the goal of predicting a quantity of interest , as accurately as possible while simultaneously performing as little queries as possible, where can be any (random) function. This problem, in the simplified myopic setting, can be formalized as that of proposing the next variable to be queried by maximizing a reward function , i.e.(1) 
where quantifies the merit of our prediction of given and . Furthermore, the reward can quantify other properties important to the problem, e.g. the cost of acquiring .
3.2 Partial Amortization of Inference Queries
We first introduce how to establish a generative probabilistic model of random variables , that is capable of handling unobserved (missing) variables with variable size. Our approach to this, named the Partial VAE, is based on the Variational autoencoder (VAE), which enables inference to scale to large volumes of data. A VAE uses a generative model (decoder) that generates observations given latent variables , and an inference model (encoder) that infers the latent state given data . A VAE is trained by maximizing an evidence lower bound (ELBO), which is equivalent to minimize the KL divergence between and .
VAE is not directly applicable to data with missing values. Consider a partitioning that divides the variables into observed variables and unobserved variables . In this setting, we would like to efficiently and accurately infer and . One challenge in the above setting is that there are many possible partitioning of . Therefore, classic approaches to train a VAE with variational bound and amortize inference networks are no longer directly applicable. We propose to extend amortization to our partial inference situation.
Partial VAE
In a VAE, we assume a factorized structure for , i.e.
(2) 
This implies that given , the observed variables are conditionally independent of . Therefore,
(3) 
and inferences about can be reduced to inference about . Therefore, the key object of interest in this setting is , i.e., the posterior over the shared latent variables given the observed variables . Once knowledge about is obtained, we can draw correct inferences about . To approximate we introduce an auxiliary variational inference network and define a partial variational upper bound,
(4)  
This bound, , depends only on the observation , which could vary between different data points. We call the auxiliary distribution the partial inference net since it takes a set of partially observed variables whose length may vary. Specifying requires distribution over random partitioning .
Amortized Inference with partial observations
Inspired by the Point Net (PN) approach for point cloud classification (Qi et al., 2017; Zaheer et al., 2017), we specify the approximate distribution by a permutation invariant set function encoding, given by:
(5) 
where is the number of the observed variables, ,
is the indicator vector of the
feature. There are many ways to define . Naively, it could be the coordinates for points in the point cloud, and onehot embedding of the number of questions in a questionnaire. With different problem settings, it can be beneficial to learn as an embedding of the identity of the variable, either with or without an naive encoding as input. In this work, we treat as an unknown embedding, which is optimized during training process.We use a neural network
to map input from to , where is the dimension of each , is a scalar, and is the latent space size. Key to the PN structure is the permutation invariant aggregation operation, such as maxpooling or summation. In this way, the mapping
is invariant to permutations of elements of and can have arbitrary length. Finally, the fixedsize codeis fed into an ordinary amortized inference net, that transforms the code into the statistics of a multivariate Gaussian distribution to approximate
. The procedure is illustrated in the first dashed box in Figure 1, which is our basic Partial VAE method.We further generalize this basic Partial VAE method in two ways: First, we generalize the specification of the input . Existing set based methods construct input as the concatenation of the feature value and feature identity embedding. We propose to construct instead of concatenation. This formulation generalizes naive Zero Imputation (ZI) VAE (Nazabal et al., 2018) and the aforementioned Point net parameterizations (PN) Partial VAE. We call this approach Pointnet Plus (PNP) specification of Partial VAE. The theoretical consideration of relating ZI to PNP is presented in Appendix C.1.
Moreover, we generalize PN and PNP by recurrently reuse the code to enlarge the capacity of PN. Figure 1 shows the mechanism of the recurrent PN with two recurrent steps using concatenated as an example. The first step is the same as the PN setting with the dimensional output , where is the recurrent step index. For the second step, we concatenate the learned to to form the new input for the next recurrent step . There can be arbitrary number of recurrent steps using the input . Within each recurrent step, the parameters for the neural network are shared, however, different steps have different parameters. When , we recover the basic PN partial VAE setting. Thus, for the rest of the paper, when using PN, we will be always referring to our recurrent generalization.
3.3 Efficient Dynamic Discovery of Highvalue Information
We now cast the active variable selection problem (1) as adaptive Bayesian experimental design, utilizing inferred by Partial VAE. Algorithm 1 summarize the EDDI framework.
Information Reward
We designed a variable selection acquisition function in an information theoretical way following Bayesian experimental design (Lindley, 1956; Bernardo, 1979). Lindley (1956) provides a generic formulation of Bayesian experimental design by maximizing the expected Shannon information. Bernardo (1979) generalizes it by considering the decision task context.
For a given task, we may be interested in statistics of some variables , where . Given a new instance (user), assume we have observed so far for this instance, and we need to select the next variable (an element of ) to observe. Following Bernardo (1979), We select by maximizing:
(6) 
In our paper, we mainly consider the case that a subset of interesting observations represents the statistics of interest . Sampling is approximated by , where is defined by the following process in Partial VAE. It is implemented by first sampling , and then . The same applies for appeared in Equation 9.
Efficient approximation of the Information reward
The Partial VAE allows us to sample . However, the KL term in Equation 6,
(7) 
is intractable to evaluate since both and are intractable. For high dimensional , entropy estimation could be difficult. The entropy term depends on hence cannot be ignored. In the following, we show how to approximate this expression.
Our proposal is based on the observation that analytic solutions of KLdivergences are available under specific variational distribution families of (such as the Gaussian distribution commonly used in VAEs). Instead of calculating information reward in space, we have shown that one can equivalently perform calculations in space (cf. Appendix A.1):
(8)  
Note that Equation 8 is exact. Additionally, we use partial VAE approximation , and . This leads to the final approximation of the information reward:
(9)  
With this approximation, the divergence between and can often computed analytically in Partial VAE setting, for example, under Gaussian parameterization. The only Monte Carlo sampling required is the one set of samples that can be shared across different KL terms in Equation 9.
4 Experiments
We evaluate our proposed EDDI framework with various settings. We first assess the Partial VAE component of EDDI alone on an image inpainting task both qualitatively and quantitatively (Section 4.1). We compare our proposed two PNbased Partial VAE with the zeroimputing (ZI) VAE (Nazabal et al., 2018). We use 5 recurrent steps for PN setting, and 1 recurrent steps for PNP setting to report, as they provide the best performance (cf. Appendix B.1). The same applies in the rest of the paper. Additionally, we modify ZI VAE to use s mask matrix indicating which variables are currently observed as input. We name this method ZIm VAE.
We then demonstrate the performance of the entire EDDI framework on datasets from the UCI repository (Section 4.2 ), as well as in two reallife application scenarios: Risk assessment in intensive care (Section 4.3) and public health assessment with national health survey (Section 4.4). We compare the performance of EDDI, using four different Partial VAE settings, with two baselines. The first baseline is the random active feature selection strategy (denoted as RAND) which randomly picks the next variable to observe. The second baseline method is the single best strategy (denoted as SING) which finds a single global optimal order of picking up variables. This order is then applied to all data points. SING uses the objective function as in Equation (9) to find the optimal ordering by averaging over all the data.
Method  VAEfull  ZI  ZIm  PN  PNP 

Train ELBO  95.05  113.64  113.64  
Test ELBO (Rnd)  101.46  116.01  118.61  125.37  114.01 
Test ELBO (Reg)  101.46  130.61  123.87  128.02  113.19 
4.1 Image inpainting with Partial VAE
Inpainting Random Missing Values
We demonstrate our method using a partially observed MNIST dataset (LeCun, 1998). The same setting are used for all methods (see Appendix B.1 for details). During training, We remove a random portion (uniformly sampled between 0% and 70%) of pixels. The first two rows in Table 1 show training and test ELBOs for all algorithms using this partially observed dataset. Additionally, we show ordinary VAE (VAEfull) trained on the fully observed dataset as an ideal reference. Among all Partial VAE methods, the PNP approach performs best.
Inpainting Regions
We then consider inpainting large contiguous regions of images. It aims to evaluate the capability of the Partial VAEs to produce all possible outcomes with wellcalibrated uncertainties. With the same trained model as before, we remove the region of the upper pixels of the image in the test set. We then evaluate the average likelihoods of the models. The last row of Table 1 shows the results of the test ELBO in this case. PNP based Partial VAE performs better than other settings. Given only the lower half of a digit, the number cannot be identified uniquely. ZI (Figure 2(b)) fails to cover the different possible modes due to its limitation in posterior inference. ZIm (Figure 2(c)) is capable of producing multiple modes. However, some of the generated samples are not consistent with the given part (i.e., some digits of 2 are generated). Our proposed PN (Figure 2(d)) and PNP Figure 2(e)) are capable of recovering different modes, and are consistent with observations.
4.2 EDDI on UCI datasets
Method  ZI  ZIm  PNP  PN 

EDDI  4.75 (0.02 )  4.60 (0.02 )  4.42 (0.02 )  4.55 (0.02) 
Random  7.12 (0.03 )  7.22 (0.03 )  6.97 (0.03 )  7.08 (0.03 ) 
Single best  8.00 (0.02 )  7.78 (0.02 )  7.68 (0.02 )  7.77 (0.02 ) 
Given the effectiveness of our proposed Partial VAE, we now demonstrate the performance of our proposed EDDI framework in comparison with random selection (RAND) and single optimal ordering (SING). We first apply EDDI on 6 different UCI datasets (cf. Appendix B.2) (Dheeru & Karra Taniskidou, 2017). We report the results of EDDI with all these four different specifications of Partial VAE (ZI, ZIm, PN, PNP).
All Partial VAE are first trained on partially observed UCI datasets where a random portion of variables is removed. We actively select variable for each test point starting with empty observation . In all UCI datasets, We randomly sample of the data as the test set. All experiments repeated for ten times.
Taking PNP based setting as an example, Figure 3(c) shows the negative test log likelihood on for each variable selection step with three different datasets, where is defined by the UCI task. We call this curve the information curve (IC). We see that EDDI can obtain information efficiently. It archives the same negative test log likelihood with less than half of the variables. Single optimal ordering also improves upon random ordering. However, it is less efficient compared with EDDI since EDDI perform active learning for each data instance which is “personalized”. Figure 5 shows an example of the decision processes using EDDI and SING. The first step of EDDI overlaps largely with SING. From the second step, EDDI makes “personalized” decisions.
We also present the average performance among all datasets with different settings. The area under the information curve (AUIC), , can then be used to compare the performance across models and strategies. Smaller AUIC value (could be positive or negative) indicates better performance. We present the AUIC for each dataset in Appendix B.2.2. However, due to different datasets have different scales of test likelihoods and different numbers of variables (indicated by steps), it is not fair to average the AUIC across datasets to compare overall performances. We thus defining average ranking of AUIC compares 12 methods (indexed by ) averaging these datasets as: . These 12 methods are with cross combinations of four Partial VAE models with three variable selection strategies. is the final ranking of th combination, is the ranking of the th combination (based on AUIC value) regarding the th test data point in the th UCI dataset, and the size of the th UCI dataset. Table 2 summarize the average ranking results. EDDI outperforms other variable selection order in all different Partial VAE settings. Among different partial VAE settings, PNbased settings perform better than ZIbased settings.
Comparison with nonamortized method
Additionally, we compare EDDI to DRAL (Lewenberg et al., 2017) which is the stateoftheart method for the same problem setting. As discussed in Section 2, DRAL is linear and requires high computational cost. The DRAL paper only tested their method on a single test data point due to its limitation on computational efficiency. We compare DRAL with EDDI on Boston Housing dataset with ten randomly selected test points here. Results are shown in Figure 5, where EDDI significantly outperforms DARL thanks to more flexible Partial VAE model. Additionally, EDDI is 1000 times more efficient than DARL as shown in Table 3.
4.3 Risk assessment with MIMICIII
We now apply EDDI to risk assessment tasks using the Medical Information Mart for Intensive Care (MIMIC III) database (Johnson et al. (2016)). MIMIC III is the most extensive publicly available clinical database, containing realworld records from over 40,000 critical care patients with 60,000 ICU stays. The risk assessment task is to predict the final mortality. We preprocess the data for this task following Harutyunyan et al. (2017) ^{1}^{1}1https://github.com/yerevann/mimic3benchmarks. This results in a dataset of 21139 patients. We treat the final mortality of a patient as a Bernoulli variable. For our task, we focus on variable selection, which corresponds to medical instrument selection. We thus further process the time series variables into static variables based on temporal averaging.
Figure 7 shows the information curve of different strategies, using PNP based Partial VAE as an example (more results in Appendix B.3). Table 4 shows the average ranking of AUIC with different settings. In this application, EDDI significantly outperforms other variable selection strategies in all different settings of Partial VAE, and PNP based setting performs best.
4.4 Public Health Assessment with NHANES
Finally, we apply our methods to public health assessment using NHANES 20152016 data cdc (2005). NHANES is a program with adaptable components of measurements, to assess the health and nutritional status of adults and children in the United States. Every year, approximately thousands individuals of all ages are interviewed in their homes and complete the health examination component of the survey. This 20152016 NHANES data contains three major sections, the questionnaire interview, examinations and lab tests for 9971 subjects in the publicly available version of this cycle. In our setting, we consider the whole set of lab test results (139 dimensions of variables) as the target variable of interest since they are expensive and reflects the subject’s health status, and we active select the questions from the extensive questionnaire (665 variables).
In NHANES, the entire questionnaire is divided into 73 different groups. In practice, questions in the same group are often examined together. Therefore, we perform active variable selection on the group level: at each step, the algorithm will be selecting one group to observe. This is more challenging than the experiments in previous sections since it requires the generative model to simulate a group of unobserved data in Equation (9
) at the same time. When evaluating test likelihood on the target variable of interest, we treat variables in each group equally. For a fair comparison, the calculation of the area under the information curve (AUIC) is weighted by the size of the group chosen by the algorithms. Specifically, AUIC is calculated after spline interpolation. The information curve plots in Figure
7, together with Table 5 of AUIC statistics show that our EDDI outperforms other baselines. This experiment shows that EDDI is capable of performing active selection on a large pool of grouped variables to estimate a high dimensional target.5 Conclusion
In this paper, we present EDDI, a novel and efficient framework for dynamic active variable selection for each instance. Within the EDDI framework, we propose Partial VAE which performs amortized inference to handle missing data. Partial VAE alone can be used as a nonlinear computational efficient probabilistic imputation method. Based on Partial VAE, we design a variable wise acquisition function for EDDI and derived corresponding approximation method. EDDI has demonstrated its effectiveness on active variable selection tasks across multiple realworld applications. In the future, we would extend the EDDI framework to handle more complicated scenarios, such as timeseries, or the coldstart situation.
References
 cdc (2005) National health and nutrition examination survey, 2005. URL https://www.cdc.gov/nchs/nhanes/.
 Bernardo (1979) José M Bernardo. Expected information as expected utility. The Annals of Statistics, pp. 686–690, 1979.
 Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
 Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (methodological), pp. 1–38, 1977.
 Dheeru & Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Garnelo et al. (2018) Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018.
 Gopalan et al. (2014) Prem K Gopalan, Laurent Charlin, and David Blei. Contentbased recommendations with poisson factorization. In Advances in Neural Information Processing Systems, pp. 3176–3184, 2014.

Hamesse et al. (2018)
Charles Hamesse, Paul Ackermann, Hedvig Kjellström, and Cheng Zhang.
Simultaneous measurement imputation and outcome prediction for
achilles tendon rupture rehabilitation.
In
ICML/IJCAI Joint Workshop on Artificial Intelligence in Health
, 2018.  Harutyunyan et al. (2017) Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
 Houlsby et al. (2011) Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.

Jain et al. (2010)
Prateek Jain, Raghu Meka, and Inderjit S Dhillon.
Guaranteed rank minimization via singular value projection.
In Advances in Neural Information Processing Systems, 2010.  Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific Data, 3:160035, 2016.
 Keshavan et al. (2010) Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 2010.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, pp. 1–13, 2015.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representation, 2014.

LeCun (1998)
Yann LeCun.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  Lewenberg et al. (2017) Yoad Lewenberg, Yoram Bachrach, Ulrich Paquet, and Jeffrey S Rosenschein. Knowing what to ask: A bayesian active learning approach to the surveying problem. In AAAI, pp. 1396–1402, 2017.
 Lindley (1956) Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pp. 986–1005, 1956.
 Little & Rubin (1987) RJA Little and DB Rubin. Statistical analysis with missing data. Technical report, J. Wiley, 1987.
 MacKay (1992) David JC MacKay. Informationbased objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
 McCallumzy & Nigamy (1998) Andrew Kachites McCallumzy and Kamal Nigamy. Employing em and poolbased active learning for text classification. In International Conference on Machine Learning, pp. 359–367. Citeseer, 1998.

Melville et al. (2004)
Prem Melville, Maytal SaarTsechansky, Foster Provost, and Raymond Mooney.
Active featurevalue acquisition for classifier induction.
In International Conference on Data Mining, pp. 483–486. IEEE, 2004.  Nazabal et al. (2018) Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653, 2018.

Qi et al. (2017)
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
Pointnet: Deep learning on point sets for 3d classification and segmentation.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 652–660, 2017.  Ranganath et al. (2016) Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In International Conference on Machine Learning, pp. 324–333, 2016.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Interantional Conference on Machine Learning, 2014.  Rubin (1976) Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
 SaarTsechansky et al. (2009) Maytal SaarTsechansky, Prem Melville, and Foster Provost. Active featurevalue acquisition. Management Science, 55(4):664–684, 2009.

Salakhutdinov & Mnih (2008)
Ruslan Salakhutdinov and Andriy Mnih.
Bayesian probabilistic matrix factorization using markov chain monte carlo.
In International conference on Machine learning, pp. 880–887. ACM, 2008.  Scheffer (2002) Judi Scheffer. Dealing with missing data. 2002.
 Settles (2012) Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
 Shim et al. (2018) Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variablesize set encoding. In Advances in Neural Information Processing Systems, 2018.
 Stern et al. (2009) David Stern, Ralf Herbrich, and Thore Graepel. Matchbox: Large scale bayesian recommendations. In International World Wide Web Conference, 2009.
 Wang & Blei (2011) Chong Wang and David M Blei. Collaborative topic modeling for recommending scientific articles. In International Conference on Knowledge Discovery and Data Mining, pp. 448–456. ACM, 2011.
 Wu et al. (2018) Ga Wu, Justin Domke, and Scott Sanner. Conditional inference in pretrained variational autoencoders via crosscoding. arXiv preprint arXiv:1805.07785, 2018.
 Yu et al. (2016) HsiangFu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in Neural Information Processing Systems, pp. 847–855, 2016.
 Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, pp. 3394–3404, 2017.
 Zakim et al. (2008) David Zakim, Niko Braun, Peter Fritz, and Mark Dominik Alscher. Underutilization of information and knowledge in everyday medical practice: Evaluation of a computerbased solution. BMC Medical Informatics and Decision Making, 8(1):50, 2008.
 Zhang et al. (2017) Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. arXiv preprint arXiv:1711.05597, 2017.
 Zheng & Padmanabhan (2002) Zhiqiang Zheng and Balaji Padmanabhan. On active learning for data acquisition. In International Conference on Data Mining, pp. 562–569. IEEE, 2002.
Appendix A Additional Derivations
a.1 Information reward approximation
In our paper, given the VAE model and a partial inference network , the experimental design problem is formulated as maximization of the information reward:
Where , and are approximate condition distributions given by partial VAE models. Now we consider the problem of directly approximating .
Applying the chain rule of KLdivergence, we have:
Using again the KLdivergence chain rule on , we have:
The KLdivergence term in the reward formula is now rewritten as follows,
One can then plug in the partial VAE inference approximation:
Finally, the information reward is now approximated as:
This new objective tries to maximize the shift of belief on latent variables by introducing , while penalizing the information that cannot be absorbed by (by the penalty term ). Moreover, it is more computationally efficient since one set of samples can be shared across different terms, and the KLdivergence between common parameterizations of encoder (such as Gaussians and normalizing flows) can be computed exactly without the need for approximate integrals. Note also that under approximation
, sampling is approximated by , where is defined by the following process in Partial VAE. It is implemented by first sampling , and then . The same applies for .
Appendix B Additional Experimental Results
b.0.1 PN/PNP model structure determination using MNIST
In this section, we briefly discuss how we specify the model structure of our two PNbased model, PN and PNP. We focus on choosing the recurrent steps for recurrent PN and recurrent PNP (shown in Figure 1). For this purpose, we perform plot runs of PN1, PN2, PN5, PNP1, PNP1, PNP2, PNP5 (PNx stands for PN model with x recurrent steps) on MNIST dataset for 300 iterations. Other model settings are consistent with Section B.1.1. Results of negative test log likelihoods are shown in Figure 8:
Based on Figure 8, the conclusion is clear that by increasing the recurrent steps of PN, the performance is increased. Therefore, for the rest of paper, we choose PN5 as the PN model structure. Meanwhile, we can observe that increasing the recurrent steps does not significantly improve PNP. Therefore, we will use PNP1 as our PNP structure in the rest of our paper.
b.1 Image inpainting
b.1.1 Preprocessing and model details
For our MNIST experiment, we randomly draw 10% of the whole data to be our test set. Partial VAE models (ZI, ZIm, PNP and PNs) share the same size of architecture with 20 dimensional diagonal Gaussian latent variables: the generator (decoder) is a 20200500500 fully connected neural network with ReLU activations (where D is the data dimension,
). The inference nets (encoder) share the same structure of D50050020040 that maps the observed data into distributional parameters of the latent space. For the PNbased parameterizations, we use a 500 dimensional feature mapping parameterized by a single layer neural network, and 20 dimensional ID vectors (see Section 3.2) for each variable. We choose the symmetric operator to be the basic summation operator.During training, we apply Adam optimization (Kingma & Ba, 2015)
with default hyperparameter setting, learning rate of 0.001 and a batch size of 100. We generate partially observed MNIST dataset by adding artificially missingness at random in the training dataset during training. We first draw a missing rate parameter from a uniform distribution
and randomly choose variables as unobserved. This step is repeated at each iteration. We train our models for 3K iterations.b.1.2 Image generation of partial VAEs
b.2 UCI datasets
We applied EDDI on 6 UCI datasets; Boston Housing, Concrete compressive strength, energy efficiency, wine quality, Kin8nm, and Yacht Hydrodynamics. The variables of interest are chosen to be the target variables of each UCI dataset in the experiment.
b.2.1 Preprocessing and model details
All data are normalized and then scaled between 0 and 1. For each of the 10  in total repetitions, we randomly draw 10% of the data to be our test set. Partial VAE models (ZI, ZIm, PNP and PNs) share the same size of architecture with 10 dimensional diagonal Gaussian latent variables: the generator (decoder) is a 1050100D neural network with ReLU activations (where D is the data dimensions). The inference nets (encoder) share the same structure D1005020 that maps the observed data into distributional parameters of the latent space. For the PNbased parameterizations, we further use a 20 dimensional feature mapping parameterized by a single layer neural network and 10 dimensional ID vectors (please refer to section 3.2) for each variable. We choose the symmetric operator to be the basic summation operator.
As in the image inpainting experiment, we apply Adam optimization during training with default hyperparameter setting, and a batch size of 100 and ingest random missingness as before. We trained our models for 3K iterations.
During active learning, we draw 50 samples in order to estimate the expectation under in Equation (8). Negative likelihoods of the target variable is also estimated using 50 samples of through , where .
b.2.2 Tables on Area under the information curve (AUIC)
The area under the information curve (AUIC) on each dataset can then be used to compare the performance across models and strategies. AUIC is defined to be , where is the basket of variables observed at step . By definition, smaller AUIC value (could be positive or negative) indicates better performance. We present the AUIC for each dataset in Table 6, 7, 8, 9, 10, and 11.
Method  ZI  ZIm  PNP  PN 

EDDI  25.03 (0.09 )  24.74 (0.15 )  24.49 (0.24 )  24.54 (0.10) 
Random  23.85 (0.14 )  24.52 (0.08 )  23.36 (0.18 )  23.43 (0.14 ) 
Single best  24.77 (0.12 )  23.62 (0.20 )  23.71 (0.15 )  23.82 (0.17 ) 
Method  ZI  ZIm  PNP  PN 

EDDI  12.07 (0.04 )  12.07 (0.05 )  12.09 (0.07 )  12.15 (0.06) 
Random  11.00 (0.09 )  12.03 (0.03 )  11.17. (0.07 )  12.07 (0.06 ) 
Single best  12.03 (0.06 )  11.13 (0.10 )  12.07 (0.06 )  12.11 (0.04 ) 
Method  ZI  ZIm  PNP  PN 

EDDI  13.63 (0.06 )  14.56 (0.07 )  14.49 (0.06 )  14.65 (0.08) 
Random  9.89 (0.15 )  14.53 (0.06 )  11.49. (0.16 )  11.67 (0.17 ) 
Single best  12.79 (0.07 )  11.62 (0.08 )  14.36 (0.09 )  14.56 (0.08 ) 
Method  ZI  ZIm  PNP  PN 

EDDI  14.24 (0.06 )  15.04 (0.02 )  15.04 (0.05 )  15.13 (0.05) 
Random  11.07 (0.20 )  15.10 (0.05 )  12.38 (0.9 )  12.57 (0.14 ) 
Single best  13.85 (0.10 )  12.55 (0.10 )  15.02 (0.05 )  14.99 (0.03 ) 
Method  ZI  ZIm  PNP  PN 

EDDI  20.31 (0.02 )  20.25 (0.01 )  20.18 (0.04 )  20.20 (0.02) 
Random  19.40 (0.04 )  20.24 (0.02 )  19.29 (0.04 )  19.41 (0.02 ) 
Single best  20.28 (0.02 )  19.35 (0.04 )  20.23 (0.03 )  20.19 (0.01 ) 
Method  ZI  ZIm  PNP  PN 

EDDI  14.37 (0.02 )  14.50 (0.02 )  14.57 (0.02 )  14.53 (0.02) 
Random  12.83 (0.03 )  14.50 (0.02 )  13.03 (0.04 )  12.93 (0.04 ) 
Single best  14.43 (0.03 )  12.91 (0.03 )  14.57 (0.02 )  14.50 (0.03 ) 
b.2.3 Additional plots of PN, ZI and ZIm on UCI datasets
Here we present additional plots of the information curve at we actively choose the variables to consider. Figure 10 presents the results for the Boston Housing, the Energy and the Wine datasets and for the three approaches, i.e. PN, ZI and masked ZI.
Boston Housing  Energy  Wine 

b.2.4 Illustration of decision process of EDDI (Boston Housing as example)
The decision process facilitated by the active selection of the variables (for the EDDI framework) is efficiently illustrated in Figure 11 and Figure 12 for the Boston Housing dataset and for the PNP and PNP with single best ordering approaches, respectively.
For completeness, we provide details regarding the abbreviations of the variables used in the Boston dataset and appear both figures.

CR  per capita crime rate by town

PRD  proportion of residential land zoned for lots over 25,000 sq.ft.

PNB  proportion of nonretail business acres per town.

CHR  Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOC  nitric oxides concentration (parts per 10 million)

ANR  average number of rooms per dwelling

AOUB  proportion of owneroccupied units built prior to 1940

DTB  weighted distances to five Boston employment centres

ARH  index of accessibility to radial highways

TAX  fullvalue propertytax rate per $10,000

OTR  pupilteacher ratio by town

PB  proportion of blacks by town

LSP  % lower status of the population
b.3 MimicIii
Here we provide additional results of our approach on the MIMICIII dataset.
b.3.1 Preprocessing and model details
For our active learning experiments on MIMIC III datasets, we chose the variable of interest
to be the binary mortality indicator of the dataset. All data (except the binary mortality indicator) are normalized and then scaled between 0 and 1. We transformed the categorical variables into realvalued using the dictionary deduced from
(Johnson et al., 2016) that makes use of the actual medical implications of each possible values. The binary mortality indicator are treated as Bernoulli variables and Bernoulli likelihood function is applied. For each repetition (of the 5 in total), we randomly draw 10% of the whole data to be our test set. Partial VAE models (ZI, ZIm, PNP and PNs) share the same size of architecture with 10 dimensional diagonal Gaussian latent variables: the generator (decoder) is a 1050100D neural network with ReLU activations (where D is the data dimensions). The inference nets (encoder) share the same structure of D1005020 that maps the observed data into distributional parameters of the latent space. Additionally, for PNbased parameterizations, we further use a 20 dimensional feature mapping parameterized by a single layer neural network, and 10 dimensional ID vectors (please refer to section 3.2) for each variable. We choose the symmetric operator to be the basic summation operator.Adam optimization and random missingness is applied as in the previous experiments. We trained our models for 3K iterations. During active learning, we draw 50 samples in order to estimate the expectation under in Equation (8). Negative likelihoods of the target variable is also estimated using 50 samples of through , where .
b.3.2 Additional plots of ZI, PN and ZIm on MIMIC III
Figure 13 shows the information curves of active variable selection on the risk assessment task for MIMICIII as produced by the three approaches, i.e. ZI, PN and masked ZI.
b.4 Nhanes
b.4.1 Preprocessing and model details
For our active learning experiments on NHANES datasets, we chose the variable of interest to be the lab test result section of the dataset. All data are normalized and scaled between 0 and 1. For categorical variables, these are transformed into realvalued variables using the code that comes with the dataset, which makes use of the actual ordering of variables in questionnaire. Then, for each repetition (of the 5 repetitions in total), we randomly draw 8000 data as training set and 100 data to be test set. All partial VAE models (ZI, ZIm, PNP and PNs) uses gaussian likelihoods, with an diagonal Gaussian inference model (encoder). Partial VAE models share the same size of architecture with 20 dimensional diagonal Gaussian latent variables: the generator (decoder) is a 2050100D neural network. The inference nets (encoder) share the same structure of D1005020 that maps the observed data into distributional parameters of the latent space. Additionally, for PNbased parameterizations, we further use a 20 dimensional feature mapping parameterized by a single layer neural network, and 100 dimensional ID vectors (please refer to section 3.2) for each variable. We choose the symmetric operator to be the basic summation operator.
Adam optimization and random missingness is applied as in the previous experiments. We trained all models 1K iterations. During active learning, 10 samples were drawn to estimate the expectation in Equation (9). Negative likelihoods of the target variable is also estimated using 10 samples.
Appendix C Additional Theoretical Contributions
c.1 Zero imputing as a Point Net
Here we present how the zero imputing (ZI) and PointNet (PN) approaches relate.
Zero imputation with inference net
In ZI, the natural parameter of (e.g., Gaussian parameters in variational autoencoders) is approximated using the following neural network:
,
where is the number of hidden units, is the input image with be the value of the pixel. To deal with partially observed data , ZI simply sets all to zero, and use the full inference model to perform approximate inference.
PointNet parameterization
The PN approach approximates the natural parameter by a permutation invariant set function
where , is the dimensional embedding/ID/location vector of the pixel, is a symmetric operation such as maxpooling and summation, and is a nonlinear feature mapping from to (we will always refer as feature maps ). In the current version of the partialVAE implementation, where Gaussian approximation is used, we set with being the dimension of latent variables. We set to be the elementwise summation operator, i.e. a mapping from to defined by:
This parameterization corresponds to products of multiple ExpFam factors .
From PN to ZI
To derive the PN correspondence of the above ZI network we define the following PN functions:
where is the output feature of . The above PN parameterization is also permutation invariant; setting , , the resulting PN model is equivalent to the ZI neural network.
Generalizing ZI from PN perspective
In the ZI approach, the missing values are replaced with zeros. However, this adhoc approach does not distinguish missing values from actual observed zero values. In practice, being able to distinguish between these two is crucial for improving uncertainty estimation during partial inference. One the other hand, we have found that PNbased partial VAE experiences difficulties in training. To alleviate both issues, we proposed a generalization of the ZI approach that follows a PN perspective. One of the advantages of PN is setting the feature maps of the unobserved variables to zero instead of the related weights. As discussed before, these two approaches are equivalent to each other only if the factors are linear. More generally, we can parameterize the PN by:
where is a mapping from to defined by a neural network, and is a mapping from to defined by another neural network.
c.2 Approximation Difficulty of the Acquisition Function
Traditional variational approximation approaches provide wrong approximation direction when applied in this case (resulting in an upper bound of the objective which we maximize). Justification issues aside, (black box) variational approximation requires sampling from approximate posterior , which leads to extra uncertainties and computations. For common proposals of approximation:

Directly estimate entropy via sampling problematic for high dimensional target variables

Using reversed information reward , and then apply ELBO (KLdivergence) This does not make sense mathematically, since this will result in upper bound approximation of the (reversed) information objective, this is in the wrong direction.

Ranganath’s bound (Ranganath et al., 2016) on estimating entropy gives upper bound of the objective, wrong direction.

All the above methods also needs samples from latent space (therefore second level approximation needed).
c.3 Connection of EDDI information reward with BALD
We briefly discuss connection of EDDI information reward with BALD (Houlsby et al., 2011) and. MacKay’s work (MacKay, 1992). Assuming the model is correct, i.e. , we have
Note that based on McKay’s relationship between entropy and KLdivergence reduction, we have:
Similarly, we have
Comments
There are no comments yet.