1 Introduction
Missing values pervasively exist in realworld machine learning (ML) applications, due to the noisy and uncertain nature of our environment. Consider autonomous driving as an example, where some sensors may be blocked at some point, leaving observations incomplete. Yet, even a small portion of missing data can severely affect the performance of welltrained models, leading to biased predictions
[Dekel et al.2010, Graham2012]. In this paper, we focus on the classification task where features are missing during prediction time. To alleviate the impact of missing data, it is a common practice to substitute the missing values with plausible ones [Schafer1999, Little and Rubin2014]. As we will argue later, a drawback for such approach is that it sometimes makes overly strong assumptions about the feature distribution. Furthermore, to be compatible with multiple types of classifiers, they tend to overlook how the multitude of possible imputed values would in turn affect the trained classifier, which can result in biased predictions.To better address this issue, we propose a principled framework of handling missing features by reasoning about the classifier’s expected output given the feature distribution. One obvious advantage is that it can be tailored to the given family of classifiers and feature distributions. In contrast, the popular mean imputation approach only coincides with the expected prediction for simple models (e.g. linear functions) under very strong independence assumptions about the feature distribution. We later show that calculating the expected predictions with respect to arbitrary feature distributions is computationally highly intractable. In order to make our framework more feasible in practice, we leverage generativediscriminative counterpart relationships to learn a joint distribution that can take expectations of its corresponding discriminative classifier. We call this problem
conformant learning. Then we develop an algorithm, based on geometric programming, for a wellknown example of such relationship: naive Bayes and logistic regression. We call this specific algorithm naive conformant learning (NaCL).Through an extensive empirical evaluation over five characteristically distinct datasets, we show that NaCL consistently achieves better estimates of the conditional probability as measured by average cross entropy and classification accuracy under different percentages of missing features than the commonly used existing imputation methods. Lastly, we conduct a short case study on how our framework can be applied to generating classifier explanations.
2 The Expected Prediction Task
In this section, we describe our intuitive approach to making a prediction when features are missing and discuss how it relates to existing imputation methods. Then we study the computational hardness of our expected prediction task.
We use uppercase letters to denote features/variables and lowercase letters for their assignment. Sets of variables and their joint assignments are written in bold. Concatenation (e.g., ) denotes the union of disjoint sets.
Suppose we have a model trained with features but are now faced with the challenge of making a prediction without knowing all values . In this situation, a common solution is to impute certain substitute values for the missing data (for example their mean) [Little and Rubin2014]. However, the features that were observed provide information not only about the class but also about the missing features, yet this information is typically not taken into account by popular methods such as mean imputation.
We propose a very natural alternative solution: to utilize the feature distribution to reason about what a predictor is expected to return if it could observe the missing features.
Definition 1.
Let be a predictor and be a distribution over features . Given a partitioning of features and an assignment to some of the features , the expected prediction task is to compute
Nevertheless, (mean) imputation and the expected prediction task are related, but only under restrictive assumptions.
Example 1.
Let be a linear function. That is, for some weights . Suppose is a distribution over that assumes independence between features: . Then, using linearity of expectation, the following holds for any partial observation :
Hence, substituting the missing features with their means effectively computes the expected predictions of linear models if the independence assumption holds. Furthermore, if is the true conditional probability of the labels and features are generated by a fullyfactorized , then classifying by comparing the expected prediction to 0.5 is Bayes optimal on the observed features.
Example 2.
Consider a logistic regression model where is a linear function. Now, mean imputation no longer computes the expected prediction, even when the independence assumption in the previous example holds. In particular, if is a partial observation such that is positive for all , then the meanimputed prediction is an overapproximation of the expected prediction:
This is due to Jensen’s inequality and concavity of the sigmoid function in the positive portion; conversely, it is an underapproximation in the negative cases.
Example 1 showed how to efficiently take the expectation of a linear function w.r.t. a fully factorized distribution. Unfortunately, the expected prediction task is in general computationally hard, even on simple classifiers and distributions.
Proposition 1.
Taking expectation of a sufficiently complex but efficient classifier w.r.t. a uniform distribution is
hard.Suppose our classifier tests whether a logical constraint holds between the input features. Then asking whether there exists a positive example is hard. The expected classification on a uniform distribution is solving an even harder task, of counting solutions to the constraint, which is hard.
Obviously, computing expectations, even of trivial classifiers, is as hard as probabilistic reasoning in the feature distribution, which is hard for graphical models [Roth1996].
Proposition 2.
The expectation of a classifier that returns the value of a single feature w.r.t. a distribution represented by a probabilistic graphical model is hard.
Previous propositions showed that the expected prediction task stays intractable, even when we allow either the distribution or the classifier to be trivial.
Our next theorem states that the task is hard even for a simple linear classifier and a tractable distribution.^{1}^{1}1All proofs can be found in Appendix A.
Theorem 1.
Computing the expectation of a logistic regression classifier over a naive Bayes distribution is hard.
That is, the expected prediction task is hard even though logistic regression classification and probabilistic reasoning on naive Bayes models can both be done in linear time.
In summary, while the expected prediction task appears natural for dealing with missing data, its vast intractability provides a serious challenge, in particular compared to efficient alternatives such as imputation. Next, we investigate specific ways of practically overcoming this challenge.
3 Joint Distributions as Classifiers
While the expected prediction task is designed to address missing features in arbitrary predictors, there exists a family of classifiers that inherently support missing features. Given a joint distribution over the features and class label , we can classify a partial observation simply by computing the conditional probability . In some sense, a joint distribution embeds a classifier for each subset of observed features . In fact, computing is equivalent to computing the expected prediction where and is given by the joint:
(1) 
Nevertheless, the prevailing consensus is that in practice discriminatively training a classifier should be preferred to generatively learning , because it tends to achieve higher classification accuracy [Bouchard and Triggs2004].
There are many generativediscriminative pairs obtained from fitting the same family of probabilistic models to optimize either the joint or the conditional likelihood [Ng and Jordan2002]. A wellknown example is naive Bayes and logistic regression. Naive Bayes models assume that features are all mutually independent given the class. Its joint distribution is . Under such assumption, marginal inference is efficient, and so is taking expectations in case of missing features as in Equation 1.
Logistic regression is the discriminative counterpart to naive Bayes. It has parameters and posits that^{2}^{2}2Here, also includes a dummy feature that is always to correspond with the bias parameter .
Any naive Bayes classifier can be translated to an equivalent logistic regression classifier on fully observed features.
Definition 2.
We say conforms with if their classifications agree: for all .
Lemma 1.
Given a naive Bayes distribution , there is a unique logistic regression model that it conforms with. Additionally, such logistic regression model has the following weights:
Here, , denote , respectively.


Consider for example the naive Bayes distribution in Figure 0(b). For all possible feature observations, the NB classification is equal to that given by the logistic regression in Figure 0(a), whose weights are as given by above lemma (i.e., conforms with ). Furthermore, distribution also translates into the same logistic regression. In fact, there can be infinitely many such naive Bayes distributions.
Lemma 2.
Given a logistic regression and , there exists a unique naive Bayes model such that
That is, given a logistic regression there are infinitely many naive Bayes models that conform with it. Moreover, after fixing values for parameters of the NB model, there is a uniquely corresponding naive Bayes model.
We can expect this phenomenon to generalize to other generativediscriminative pairs; given a conditional distribution there are many possible feature distributions to define a joint distribution . For instance, distributions and in Figure 0(b) assign different probabilities to feature observations; whereas . Hence, we wish to define which one of these models is the “best”. Naturally, we choose the one that best explains a given dataset of feature observations.^{3}^{3}3We assume we have i.i.d. sampled data. If a true distribution is known, we can equivalently minimize the KLdivergence to it.
Definition 3.
Let be a discriminative classifier and be a dataset where each example is a joint assignment to . Given a family of distributions over and , let denote the subset of them that conforms with . Then conformant learning on is to solve the following optimization:
(2) 
The learned model thus conforms with and defines a feature distribution; therefore, we can take the expectation of via probabilistic inference. In other words, it attains the high classification performance of the given discriminative model while also returning sophisticated predictions under missing features. Specifically, conformant naive Bayes models can be used to efficiently take expectations of logistic regression classifiers. Note that this does not contradict Theorem 1 which considers arbitrary pairs of LR and NB models.
4 Naive Conformant Learning
In this section, we study a special case of conformant learning–naive conformant learning, and show how it can be solved as a geometric program.
A naive Bayes distribution is defined by a parameter set that consists of , and for all . Naive conformant learning outputs the naive Bayes distribution that maximizes the (marginal) likelihood of given dataset and conforms with a given logistic regression model .
We will next show that above problem can be formulated as a geometric program, an optimization problem of the form:
min  
s.t  
where each is a posynomial and monomial. A monomial is a function of the form defined over positive real variables where and . A posynomial is a sum of monomials. Every geometric program can be transformed into an equivalent convex program through change of variables, and thus its global optimum can be found efficiently [Boyd et al.2007].
As we want to maximize the likelihood, we instead minimize its inverse. Let denote how many times the joint assignment appears in dataset . Then we can express the objective function as
Above formula, directly expanded, is not a posynomial. In order to express it as a posynomial we consider an auxiliary dataset constructed from as follows: for each data point , there are with weight for all values of . If the weights are such that and for all , then the inverse of the expected joint likelihood given the new dataset is
(3) 
For any , the conditional distribution is fixed by the logistic regression model; in other words, the last product term in Equation 3 is a constant. Therefore, maximizing the expected (joint) likelihood on a completed dataset must also maximize the marginal likelihood, which is our original objective. Intuitively, maximizing the joint likelihood on any dataset is equivalent to maximizing the marginal likelihood if the conditional distribution is fixed. Now our objective function can be written as a monomial in terms of the parameters:
(4) 
Next, we express the set of conformant naive Bayes distributions as monomial equality constraints in terms of . An NB model conforms with an LR if and only if its corresponding logistic regression weights, according to Lemma 1, match those of . Hence, precisely when
(5)  
(6) 
We also need to ensure that the parameters define a valid probability distribution (e.g.,
+ = 1). Because such posynomial equalities are not valid geometric program constraints, we instead relax these to posynomial inequalities:^{4}^{4}4The learned parameters may not sum to 1. They can still be interpreted as a multivalued NB with same likelihood that conforms with . These constraints were always active in our experiments.(7) 
Putting everything together, naive conformant learning can be solved as a geometric program whose objective function is given by Equation 4 and constraints by Equations 5 – 7.^{5}^{5}5We have assumed binary class for conciseness, but the approach easily generalizes to multiclass. Details can be found in Appendix B.
Datasets  Size  # Classes: Dist.  # Features  Feature Types 

MNIST  60K  10: Balanced  784  Integer: pixel value 
Fashion  60K  10: Balanced  784  Integer: pixel value 
CovType  581K  7: Unbalanced  54  Continuous Categorical 
Adult  49K  2: Unbalanced  14  Integer Categorical 
Splice  3K  3: Unbalanced  61  Categorical 
Cross Entropy  CovType  Adult  Splice  

Under Missing  20  40  60  80  20  40  60  80  20  40  60  80 
Min Imputation  12.8  15.5  20.7  29.4  41.0  49.2  55.4  59.6  71.3  81.8  97.2  117.2 
Max Imputation  52.6  89.1  133.5  187.7  84.2  114.6  125.3  114.5  70.5  78.3  89.3  103.2 
Mean Imputation  12.8  15.6  21.2  30.5  34.1  38.7  44.8  52.6  69.2  74.7  82.4  92.3 
Median Imputation  12.8  15.7  21.4  30.8  35.3  41.2  48.6  57.8  70.0  75.7  83.0  92.0 
Naive Conformant Learning  12.6  14.8  18.9  25.8  33.6  37.0  41.2  46.6  69.1  74.7  82.8  94.0 
Weighted F1  CovType  Adult  Splice  

under Missing  20  40  60  80  20  40  60  80  20  40  60  80 
Min Imputation  64.0  58.1  52.2  46.1  81.7  79.3  77.5  76.0  86.9  69.8  49.2  38.8 
Max Imputation  49.8  44.4  41.6  37.3  81.7  79.3  77.4  76.0  86.9  69.8  49.1  38.8 
Mean Imputation  64.0  58.0  52.2  46.3  82.9  79.8  75.3  70.7  91.8  82.3  66.2  45.7 
Median Imputation  64.0  58.1  52.2  46.1  82.7  79.2  74.8  70.5  89.4  77.6  59.5  42.5 
Naive Conformant Learning  66.1  61.7  56.9  51.7  83.4  81.2  77.9  73.5  93.3  87.2  76.6  59.1 
5 Empirical Evaluation
In this section, we empirically evaluate the performance of our proposed naive conformant learning (NaCL) and provide a detailed discussion of our method’s advantages over existing imputation approaches in practice. More specifically, we want to answer the following questions:
 Q1

Does NaCL reliably estimate the probabilities of the original logistic regression with full data? How do these estimates compare to those from imputation techniques, including ones that also use the feature distribution?
 Q2

Do higherquality expectations of a logistic regression classifier result in higher accuracy on test data?
 Q3

Does NaCL retain logistic regression’s higher predictive accuracy over unconstrained naive Bayes?
Experimental Setup
To demonstrate the generality of our method, we construct a 5dataset testbed suite that covers assorted configurations [Yann et al.2009, Xiao et al.2017, Blackard and Dean1999, Dua and Karra Taniskidou2017, Noordewier et al.1991]; see Table 1. The suite ranges from image classification to DNA sequence recognition; from fully balanced labels to of samples belonging to a single class; from continuous to categorical features with up to 40 different values. For datasets with no directly provided test set, we construct one by a
split. As our method assumes binary inputs, we transform categorical features through onehot encoding and binarize continuous ones based on whether they are larger than their respective mean plus 0.05 standard deviation.
5.1 Conformity with the Original Predictions
The optimal method to deal with missing values would be one that enables the original classifier to act as if no features were missing. In other words, we want the predictions to be affected as little as possible by the missingness. As such, we evaluate the similarity between predictions made with and without missingness, measured by the average cross entropy. The results are reported in Figure 1(a)^{6}^{6}6The results associated with max imputation are dismissed as they are orders of magnitude worse than the rest. and Table 2. Our method outperforms all the baselines by a significant margin, demonstrating the superiority of the expected predictions produced by our method.
We also compare NaCL with two imputation methods that consider the feature distribution, namely EM [Dempster et al.1977] and MICE [Buuren and GroothuisOudshoorn2010]. EM imputation reports the second to worst average cross entropies and MICE’s results are very similar to mean imputation’s when of features are missing. Due to the fact that both EM and MICE are excessively timeconsuming to run and their imputed values are no better quality than more lightweight alternatives, we do not compare with them in the rest of the experiments. We would like to especially emphasize this comparison; it demonstrates that directly leveraging feature distributions without also considering how the imputed values impact the classifier may lead to unsatisfactory predictions, further justifying the need for solving the expected prediction task and doing conformant learning. This also concludes our answer to Q1.
5.2 Classification Accuracy
Encouraged by the fact that NaCL produces more reliable estimates of the conditional probability of the original logistic regression, we further investigate how much it helps achieve better classification accuracy with different percentages of missing features (i.e., Q2). As suggested by Figure 1(b) and Table 3^{7}^{7}7We report weighted F1 scores as the datasets are unbalanced., NaCL consistently outperforms all other methods except on the Adult dataset with of the features missing.
Lastly, we compare NaCL to naive Bayes to answer Q3.^{8}^{8}8Since naive Bayes optimizes for a different loss and effectively solves a different task than NaCL and the reported imputation methods, we do not include its results in the table. In all datasets except Splice, with fully observed features logistic regression classifies better than naive Bayes. NaCL maintains this advantage until of the features go missing, further demonstrating the effectiveness of our method. Note that these four datasets have a large number of samples, which is consistent with the prevailing consensus that discriminative learners are better classifiers given a sufficiently large number of samples [Ng and Jordan2002].
6 Case Study: Sufficient Explanations
In this section we briefly discuss utilizing conformant learning to explain classifications and show some empirical examples as a proof of concept.
On a high level, the task of explaining a particular classification can be thought of as quantifying the “importance” of each feature and choosing a small subset of the most important features as the explanation. Linear models are widely considered easy to interpret, and thus many explanation methods learn a linear model that is closely faithful to the original one, and then use the learned model to assign importance to features [Ribeiro et al.2016, Lundberg and Lee2017, Shrikumar et al.2017]. These methods often assume a blackbox setting, and to generate explanations they internally evaluate the predictor on multiple perturbations of the given instance. A caveat is that the perturbed values may have a very low probability on the distribution the classifier was trained on. This can lead to unexpected results as machine learning models typically only guarantee generalization if both train and test data are drawn from the same distribution.
Instead we propose to leverage the feature distribution in producing explanations. To explain a given binary classifier, we consider a small subset of feature observations that is sufficient to get the same classification, in expectation w.r.t. a feature distribution. Next, we formally define our method:
Definition 4.
(Support and Opposing Features) Given , , and , we partition the given feature observations into two sets. The first set consists of the support features that contribute towards the classification of :
The rest are the opposing features that provide evidence against the current classification: .
Definition 5.
Sufficient explanation of with respect to is defined as the following:
Intuitively, this is the smallest set of support features that, in expectation, result in the same classification despite all the evidence to the contrary. In other words, we explain a classification using the strongest evidence towards it.
For a qualitative evaluation, we generate sufficient explanations on instances of a binary logistic regression task for MNIST digits 5 and 3; see the last column of Figure 3. Take the first example in Figure 2(a): the white pixels selected as sufficient explanation show that the digit should be a 5. Also notice the black pixels in the explanation: they express how the absence of white pixels significantly contributes to the classification, especially in parts of the image where they would be expected for the opposing class. Similarly, the black pixels in the first example in Figure 2(b) look like a 3, and the white pixels in the explanation look like a 5, explaining why this 3 was misclassified as a 5. We further compare our approach to an alternative one that selects a subset of support features based on their logistic regression weights; see the third column of Figure 3. It selects features that will cause a large difference in prediction if the value was flipped, as opposed to missing, which is what sufficient explanation considers.


7 Related Work
There have been many approaches developed to classify with missing values, which can broadly be grouped into two different types. The first one focuses on increasing classifiers’ inherent robustness to feature corruption, which includes missingness. A common way to achieve such robustness is to spread the importance weights more evenly among features [Globerson and Roweis2006, Dekel et al.2010, Xia et al.2017]. One downside of this approach is that the trained classifier may not achieve its best possible performance if no features go missing.
The second one investigates how to impute the missing values. In essence, imputation is a form of reasoning about missing values from observed ones [Sharpe and Solly1995, Batista et al.2002, McKnight et al.2007]. An iterative process is commonly used during this reasoning process [Buuren and GroothuisOudshoorn2010]. Some recent works also adapt autoencoders and GANs for the task [Gondara and Wang2018, Zhang et al.2018]. Some of these works can be incorporated into a framework called multiple imputations to reflect and better bound one’s uncertainty [Schafer1999, Azur et al.2011]. These existing methods focus on substituting missing values with those closer to the ground truth, but do not model how the imputed values interact with the trained classifier. On the other hand, our proposed method explicitly reasons about what the classifier is expected to return.
We are among the first to incorporate feature distribution to generate explanations. A notable recent work along this line is [Chen et al.2018], which proposes to maximize the mutual information between selected features and the class. To more explicitly leverage the feature distribution, one can marginalize over the universe of plausible alternative values for the select features by conditioning a generative model of the input distribution given the remaining features [Chang et al.2019].
8 Conclusion & Future Work
In this paper we introduced the expected prediction task, a principled approach to predicting with missing features. It leverages a feature distribution to reason about what a classifier is expected to return if it could observe all features. We then proposed conformant learning to learn joint distributions that conform with and can take expectations of discriminative classifiers. A special instance of it–naive conformant learning–was shown empirically to outperform many existing imputation methods. For future work, we would like to explore conformant learning for other generativediscriminative pairs of models, and extend NaCL to realvalued features.
Acknowledgements
This work is partially supported by NSF grants #IIS1657613, #IIS1633857, #CCF1837129, DARPA XAI grant #N660011724032, NEC Research and a gift from Intel.
References
 [Azur et al.2011] Melissa J Azur, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 2011.
 [Batista et al.2002] Gustavo EAPA Batista, Maria Carolina Monard, et al. A study of knearest neighbour as an imputation method. HIS, 2002.

[Blackard and
Dean1999]
Jock A Blackard and Denis J Dean.
Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.
Computers and electronics in agriculture, 1999.  [Bouchard and Triggs2004] Guillaume Bouchard and Bill Triggs. The tradeoff between generative and discriminative classifiers. In COMPSTAT, 2004.
 [Boyd et al.2007] Stephen Boyd, SeungJean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming. Optimization and engineering, 8(1):67, 2007.
 [Buuren and GroothuisOudshoorn2010] S van Buuren and Karin GroothuisOudshoorn. mice: Multivariate imputation by chained equations in r. Journal of statistical software, 2010.
 [Chang et al.2019] ChunHao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image classifiers by counterfactual generation. 2019.
 [Chen et al.2013] Suming Jeremiah Chen, Arthur Choi, and Adnan Darwiche. An exact algorithm for computing the samedecision probability. In IJCAI, 2013.
 [Chen et al.2018] Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: An informationtheoretic perspective on model interpretation. In ICML, 2018.
 [Dekel et al.2010] Ofer Dekel, Ohad Shamir, and Lin Xiao. Learning to classify with missing and corrupted features. Machine learning, 2010.
 [Dempster et al.1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 1977.
 [Dua and Karra Taniskidou2017] Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017.
 [Globerson and Roweis2006] Amir Globerson and Sam Roweis. Nightmare at test time: Robust learning by feature deletion. In ICML, 2006.

[Gondara and Wang2018]
Lovedeep Gondara and Ke Wang.
Mida: Multiple imputation using denoising autoencoders.
In PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 2018.  [Graham2012] John W Graham. Missing data: Analysis and design. Springer Science & Business Media, 2012.
 [Little and Rubin2014] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 333. John Wiley & Sons, 2014.
 [Lundberg and Lee2017] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In NeurIPS. 2017.
 [McKnight et al.2007] Patrick E McKnight, Katherine M McKnight, Souraya Sidani, and Aurelio Jose Figueredo. Missing data: A gentle introduction. Guilford Press, 2007.
 [Ng and Jordan2002] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pages 841–848, 2002.
 [Noordewier et al.1991] Michiel O Noordewier, Geoffrey G Towell, and Jude W Shavlik. Training knowledgebased neural networks to recognize genes in dna sequences. In NeurIPS, 1991.
 [Ribeiro et al.2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, 2016.
 [Roth1996] Dan Roth. On the hardness of approximate reasoning. Artificial Intelligence, 1996.
 [Schafer1999] Joseph L Schafer. Multiple imputation: a primer. Statistical methods in medical research, 1999.
 [Sharpe and Solly1995] Peter K. Sharpe and RJ Solly. Dealing with missing values in neural networkbased diagnostic systems. Neural Computing & Applications, 1995.
 [Shrikumar et al.2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, 2017.

[Xia et al.2017]
Jing Xia, Shengyu Zhang, Guolong Cai, Li Li, Qing Pan, Jing Yan, and Gangmin
Ning.
Adjusted weight voting algorithm for random forests in handling missing values.
Pattern Recognition, 2017.  [Xiao et al.2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.

[Yann et al.2009]
LeCun Yann, Cortes Corinna, and Christopher JC Burges.
The mnist database of handwritten digits, 2009.
 [Zhang et al.2018] Hongbao Zhang, Pengtao Xie, and Eric Xing. Missing value imputation based on deep generative models. arXiv preprint arXiv:1808.01684, 2018.
Appendix A Proofs
a.1 Proof of Theorem 1
The proof is by reduction from computing the samedecision probability, whose decision problem DSDP was shown to be hard. [Chen et al.2013]
DSDP: Given a naive Bayes distribution over variables and , a threshold , and a probability , is the following samedecision probability
(1) 
greater than ?
Here, denotes an indicator function which returns 1 if the enclosed expression is true, and 0 otherwise.
Using Lemma 1 we can efficiently translate a naive Bayes model to a logistic regression with a weight function such that
Note that iff . Then we construct another logistic regression whose weight function is:
for some positive constant . As is a linear model, is also linear, and iff . As grows, approaches and for positive and negative examples, respectively. Hence, this logistic regression model outputs 1 if and 0 otherwise, effectively being an indicator function. Therefore, the expectation of such classifier over is equal to the samedecision probability of . ∎
a.2 Proof of Lemma 1
We want to prove there is a unique solution for
Since we have binary classification, it only suffices to solve for the first equation.
(2)  
We want
(3) 
Using naive Bayes assumption and some simplifications we have:
(4) 
Now we want the RHS of Equation 4 to be a linear function of ’s, so we do the following substitutions ():
(5)  
By combining Equations 4 and 5 we can get the final result of the lemma by simple algebraic manipulations. To solve for we plugin (for because is a dummy feature for the bias term so its always 1). To compute we take the coefficient of in Equation 5.
The last remaining task is to prove given the above weights we have for all possible instantiations of . It suffices to show Equation 4 always holds. Lets say of the ’s are set to 1 and others are set to . Without loss of generality assume they are the first variables. We can show Equation 4 holds as follows:
∎
a.3 Proof of Lemma 2
Through the same algebraic manipulation as before, we get the same equations as in Lemma 1 with the only difference being that our unknown variables are now the parameters of the naive Bayes model rather than weights of the logistic regression model. Intuitively, because NB model has free parameters but the LR model only has
parameters, we would have some degree of freedom. To get rid of the freedom we fix the values for
:(6) 
We can fix the parameters in other ways as long as we fix one parameter per feature. We show this one for notational simplicity.
Now there is a unique naive Bayes model that matches the logistic regression classifier and also agrees with Equation 6. The parameters for that naive Bayes model are as follows:
After this we can uniquely determine based on previous parameters.
∎
Appendix B Beyond Binary Classification: MultiClass
In the paper we gave the constraints and the objective for the case of binary classification, in this section we show that our method can be easily extended to multiclass classification. The flow of the methods is similar to the binary classification but with the following modifications:
Definition 6.
(Multiclass Classifiers) Let’s say we have a classifier with classes, each denoted by (). Conditional probability for logistic regression naive Bayes classifier are defined as:
Definition 7.
(Multiclass Substitutions) We define the following notation to make the equations simpler to read.
Now, we want to get the same classifiers out of and , one way to make solving the equations simpler is to set the ratio of the probabilities to be equal to each other rather than directly (as we did in binary case). Moreover, to get same classifiers it suffices to divide by the probability of only one class, so we want the following be true for all .
This leads to:
Finally, after doing some algebra similar to binary case we get the following constraints:
(7)  
(8)  