RIFLE: Robust Inference from Low Order Marginals

09/01/2021 ∙ by Sina Baharlouei, et al. ∙ University of Southern California 10

The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample size, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for predicting the target variable without imputing missing values. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. In numerical experiments, we compare RIFLE with state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer). Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available.



There are no comments yet.


page 2

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning algorithms have shown promise when applied to a wide range of problems, including healthcare, finance, social data analysis, image processing, and speech recognition. However, this success mainly relied on the availability of large-scale, high-quality datasets, which may be scarce in many practical problems, especially in medical and health applications (Pedersen et al., ; Sterne et al., 2009; Beaulieu-Jones et al., 2018). Moreover, in such applications, many experiments and datasets suffer from small sample size. At the same time, an increasingly large number of datasets are publicly available. To fully and effectively utilize information on related research questions from diverse datasets, information across various datasets needs to be combined in a reliable fashion. After appending these datasets together, the obtained dataset can contain large blocks of missing values as they may not share exactly the same features (Figure 1).

For example, similar data may be collected from different health centers or across waves of a dataset, although questions may not be exactly the same and some questions be asked while others omitted. Appending two or more of these datasets may then lead to many missing values in the resulting dataset. Thus, a pre-processing step should be performed to remove or impute missing values. However, removing rows containing missing values is not an option when the percentage of missingness in a dataset is high, or the distribution of missing values is MNAR (Missing Not At Random). For instance, as it is demonstrated in Figure 1, if we eliminate the rows that have at least one missing entry, the entire dataset will be discarded.

Figure 1: Consider the problem of predicting the trait

from feature vector

. Suppose that we have access to three data sets: The first dataset includes the measurements of for individuals. The second dataset collects data from another individuals by measuring with no measurements of the target variable in it; and the third dataset contains the measurements from the variables for number of individuals; How one should learn the predictor from these three datasets?

The most common approach for handling missing values in a learning task is to impute them in a pre-processing stage. The general idea behind data imputation approaches is that the missing values can be predicted using the other, available data points and correlated features. The imputation of data allows practitioners to run standard statistical algorithms requiring complete data. However, the performance of the prediction model can be highly reliant on the accuracy of the imputer. High error rates in prediction of missing values by the imputer can lead to catastrophic performance of the statistical methods executed on the imputed data.

An alternative approach is to learn the parameters of the model directly (without imputation) based on the available data. In this paper, we propose RIFLE (Robust InFerence via Low-order moment Estimations) for the direct inference of a target variable based on a set of features containing missing values. The proposed framework does not require the data to be imputed in a pre-processing stage. However, if needed, it can be used as a pre-processing tool for imputing data as well. The main idea of the proposed framework is to estimate the first and second-order moments of the data and their confidence intervals by bootstrapping on the available data matrix entries. Then, RIFLE finds the optimal parameters of the statistical model for the worst-case distribution with the low-order moments (mean and variance) within the estimated confidence intervals (See Figure 


Figure 2: Prediction of the target variable without imputation. RIFLE estimates confidence intervals for low-order (first and second order) marginals from the input data containing missing values. Then, it solves a distributionally robust problem over all distributions whose low-order marginals are within the estimated confidence intervals.

1.1 Related Work

The fundamental idea behind many data imputation approaches is that the missing values can be predicted based on the available data of other data points and correlated features. One of the most straightforward imputation techniques is to replace missing values by the mean (or median) of that feature calculated from what data is available see (Little and Rubin, 2019, Chapter 3). However, this naïve approach ignores the correlation between features and does not preserve the variance of features. Another class of imputers have been developed based on the least-square methods (Raghunathan et al., 2001; Kim et al., 2005; Zhang et al., 2008; Cai et al., 2006). Raghunathan et al. (2001) learns a linear model with multivariate Gaussian noise for the feature with the least amount of missing values. It repeats the same procedure on the updated data to impute the next feature with the least amount of missing values until all features are completely imputed. One drawback of this approach is that the error from the imputation of previous features can be propagated to subsequent features. To impute entries of a given feature in a dataset, Kim et al. (2005) learns several univariate regression models that consider that feature as the response, and then it takes the average of these predictions as the final value of imputation. This approach fails to learn the correlations involving more than two features.

Many more complex algorithms have been developed for imputation, although many are sensitive to initial assumptions and may not converge. For instance, KNN-Imputer imputes a missing feature of a data point by taking the mean value of the closest complete data points (Troyanskaya et al., 2001)

. MissForest, on the other hand, imputes the missing values of each feature by learning a random forest classifier using other features of the training data

(Stekhoven and Bühlmann, 2012). MissForest does not need to assume that all features are continuous (Honaker et al., 2011), or all categorical (Schafer, 1997). However, both KNN-imputer and MissForest do not provide statistical or computational convergence guarantees for their algorithm. Moreover, when the proportion of missing values is high, both are likely to have a severe drop in performance as it is demonstrated in Section 5

. The Expectation Maximization (EM) algorithm is another popular approach that learns the parameters of a prior distribution on the data using available values based on the EM algorithm of

(Dempster et al., 1977); see also Ghahramani and Jordan (1994) and Honaker et al. (2011)

. The EM algorithm is also used in Amelia, which fits a jointly normal distribution to the data using EM and bootstrap technique

(Honaker et al., 2011). While Amelia demonstrates a superior performance on datasets following a normal distribution, it is highly sensitive to the violation of the normality assumption (as discussed in Bertsimas et al. (2017)). Ghahramani and Jordan (1994)

adopt the EM algorithm to learn a joint Bernoulli distribution for the categorical data and a joint Gaussian distribution for the continuous variables independently. While those algorithms can be viewed as inference methods based on low-order estimates of moments, they do not consider uncertainty in such low-order moments estimates. By contrast, our framework utilizes

robust optimization to consider the uncertainty around the estimated moments. Moreover, our optimization procedure for imputation and prediction is guaranteed to converge unlike some of the aforementioned algorithms.

Another popular method for data imputation is multiple imputations by chained equations (MICE). MICE learns a parametric distribution for each feature conditional on the remaining features. For instance, it assumes that the current target variable is a linear function of other features with a zero-mean Gaussian noise. Each feature can have its own distinct distribution and parameters (e.g., Poisson regression, logistic regression, etc). Based on the learned parameters of conditional distributions, MICE can generate one or more imputed datasets

(Buuren and Groothuis-Oudshoorn, 2010)

. More recently, several neural network-based imputers have been proposed. GAIN (Generative Adversarial Imputation Network) adopts the idea of learning a generative adversarial network based on the available data, and then imputing the missing values using the trained generator

(Yoon et al., 2018)

. One advantage of GAIN over other existing GAN imputers is that it does not need a complete dataset during the training phase. MIDA (Multiple Imputation using Denoising Autoencoders) is an auto-encoder based approach that trains a denoising auto-encoder on the available data considering the missing entries as noise. Similar to other neural network-based methods, these algorithms suffer from their black-box nature. They are challenging to interpret/explain, which makes them not popular in mission-critical approaches such as healthcare. In addition, no statistical or computational guarantees are provided for these algorithms.

Bertsimas et al. (2017) formulates the imputation task as a constrained optimization problem where the constraints are determined by the underlying classification model such as KNN (

-nearest neighbors), SVM (Support Vector Machine), and Decision Trees. Their general framework is non-convex, and the authors relax the optimization for each choice of the cost function using first-order methods. The relaxed problem is then optimized by the block coordinate descent algorithm. They show the convergence and accuracy of their proposed algorithm numerically, while a theoretical analysis which guarantees the convergence of the algorithm is absent in their work.

Beside these imputation methods, Shivaswamy et al. (2006) and Xu et al. (2009) adopt robust optimization approaches to learn the parameters of a support vector machine model. They consider uncertainty sets for the missing entries in the dataset, and solve a min-max problem over that set. The obtained classifiers are robust to the uncertainty of missing entries within that region. In contrast to the imputation based approaches, the robust classification formulation does not carry the imputation error to the classification phase. However, finding appropriate intervals for each missing entry is a challenging problem by itself, and it might not be available apriori in many real datasets. Moreover, their proposed algorithms are limited to the SVM classifier. On the other hand, as we will see in this work, estimating uncertainty regions for estimated low-order marginals can be done naturally. Furthermore, our framework is not restricted to any particular machine learning module.

1.2 Contributions

We list our main contributions in this report:

  1. We present a general robust optimization framework for learning the parameters of a learning model in the presence of missing values. The proposed framework does not require data imputation as a pre-processing stage. We specialize the framework to linear regression and normal discriminant analysis models in two case studies. The proposed framework provides a novel strategy for inference in the presence of missing values, especially for datasets with large proportions of missing values.

  2. We provide theoretical convergence guarantees and analyze the iteration complexity of the presented algorithms for robust formulations of linear regression and normal discriminant analysis. Moreover, we analyze the asymptotic statistical properties of the solutions found by the algorithms.

  3. While the robust inference framework is primarily designed for direct statistical inference in the presence of missing values without performing data imputation, it can be adopted as an imputation tool as well. To demonstrate the quality of the proposed imputer, we compare its performance with several widely used imputation packages such as MICE, Amelia, MissForest, KNN-Imputer and MIDA on real and synthetic datasets. Generally speaking, our method outperforms all of the mentioned packages when the number of missing entries is large.

  4. Based on our developments in this work, we created a publicly available package for inference in the presence of missing data. Our implementation and the code is available at https://github.com/optimization-for-data-driven-science/RIFLE.

1.3 Organization of the Paper

In Section 2, we present the general robust inference framework in the presence of missing data used in this work. In Section 3, we specialize the proposed framework for the robust linear regression problem. The algorithms in Section 3 are analyzed from both optimization and statistical standpoints. Section 4 is dedicated to the robust classification task. In particular, we demonstrate how to derive an algorithm for the robust normal discriminant analysis as a case study of robust classification framework. Finally, in Section 5, we provide numerical experiments to evaluate the performance of the proposed algorithms.

2 Robust Inference via Estimating Low-order Moments

RIFLE is a distributionally robust optimization (DRO) framework based on estimated low-order marginals. Assume that

follows a joint probability distribution 

. A standard objective in parametric statistical learning is to find the parameter 

that minimizes the population risk with respect to a given loss function



Since the underlying distribution of data is not usually available in practice, the above problem cannot be solved directly. The most common approach for approximating the population risk problem (1) is to minimize the empirical risk with respect to given i.i.d samples

drawn from the joint distribution


In the presence of missing entries in the data points , the above formulation cannot be used. A natural alternative approach in the presence of uncertainties and missing data is training via robust optimization. For example, (Shivaswamy et al., 2006; Xu et al., 2009) suggest the formulation:


where represents the uncertainty region of data point . (Shivaswamy et al., 2006) obtains such an uncertainty set by assuming a known distribution on the missing entries of data point . (Xu et al., 2009) defines as a “box” constraint around the missing entries of data point . However, in practice, it is difficult to find such “per data point” constraint sets. In this work, we instead take distributionally robust approach to estimate the uncertainty set for the data distribution. In particular, we aim to fit the best parameters of a statistical learning model for the worst distribution in a given uncertainty set by solving:


where is an uncertainty set over the underlying distribution of data. A key observation is that defining the uncertainty set  in (3) is easier (and can be done more naturally) than defining the uncertainty sets  in (2). In particular, the uncertainty set  can be obtained naturally by estimating low-order moments of data distribution using available data. To explain this idea and to simplify the notations, let , , and While and are typically not known exactly, one can estimate them (within certain confidence intervals) from the available data despite having missing data points (see Subsection 2.1). In particular, we can estimate the values , and from data such that and

with high probability (where the inequalities for matrices and vectores denote component-wise relations). Given these estimated confidence intervals from data, problem (

3) can be reformulated as


In Section 3 and Section 4, we simplify this formulation and develop efficient algorithms for special popular forms of the loss function . However, before proceeding further, we will first explain how we estimate the parameters , and from data in the next subsection.

2.1 Estimating Confidence Intervals of Low-order Moments

In this section, we explain the methodology of estimating confidence intervals for and . Let be given training data points, where . We define where the symbol represents a missing entry. Assume that is the data matrix whose i-th row is , and is the vector of labels. Moreover, assume that represents the -th column (feature) of matrix . We define:

Thus, is obtained by replacing the missing values with . We estimate the confidence intervals for mean and covariance of features using multiple bootstrap samples on the available data. Let and be the center and the radius of the confidence interval for , respectively. We compute the center of the confidence interval for as follows:


where and . This estimator is obtained from the rows where both features are available.

To find the radius of confidence intervals for each given pair of features, we choose different bootstrap samples with length on the rows where both features and are available. Then, we compute

of two features in each bootstrap sample. The standard deviation of these estimations determines the radius of the corresponding confidence interval. Algorithm 

1 summarizes the required steps for computing the radius of confidence interval for the -th entry of covariance matrix . Note that the confidence intervals for can be computed in the same manner.

1:Input: Number of estimations
2:for  do
3:   Pick samples with replacement from the rows where both -th and -th are available.
4:   Let be the -th and -th features of the selected samples
Algorithm 1 Estimating Confidence Interval Length for Feature and Feature .

Having and , the confidence interval for the matrix is computed as follows:


where the constant is a hyper-parameter determining the level of robustness. In Section 3 and Section (4) we specialize the general distributionally robust problem described in (4) to the linear regression and normal discriminant analysis models.

Remark 1

Since the computation of confidence intervals for different entries of the covariance matrix are independent of each other, they can be computed in parallel. In particular, if cores are available, features (columns of covariance matrix) can be assigned to each one of the available cores.

3 Robust Linear Regression in the Presence of Missing Values

In a linear regression model, we aim to minimize the following objective function:

The optimal for the linear regression model can be written as


Thus, for obtaining the optimal solution, we only need to compute the second-order moments of the data and . In other words, It suffices to compute the inner product of any two column vectors , of , and the inner product of any column vector of with vector . Since the matrix and vector are not fully observed due to the existence of missing values, use the same approach as (5) to compute the point estimators and

. These point estimators can be highly inaccurate, especially when the number of non-missing rows for two given columns is small. In addition, if the pattern of missing entries does not follow the MCAR assumption, the point estimators are not unbiased estimators of

and . To learn a linear regression model which is robust against large proportion of missing values, and MNAR missing patterns, we specialize the framework presented in Section 2 to the linear regression model.

3.1 A Distributionally Robust Formulation of Linear Regression

As we mentioned above, for solving the linear regression problem, we only need to estimate the second-order moments of the data ( and ). Thus, the distributionally robust formulation described in (4) is equivalent to the following optimization problem for the linear regression model:


where the last constraint guarantees that the covariance matrix is positive semi-definite. To compute confidence intervals in the above formulation we specialize Algorithm 1 to the linear regression problem. Let be the mask of the input data matrix defined as:

Assume that , which is the number of rows in the dataset where both features and are available. To estimate the confidence intervals for , we select multiple () samples of size from the rows where both features are available. Each one of these samples with size are obtained by applying a bootstrap sampler (sampling with replacement) on the rows where both features are available. Then, for each sample we compute the second-order moment of two features. Let be the variance of the these estimations. We set , where is a hyper-parameter determining the level of robustness. A larger corresponds to bigger confidence intervals and thus a more robust estimator. However, leads to very large confidence intervals that can adversely affect the performance of the trained model. Algorithm 2 summarizes this procedure of computing confidence intervals.

1:Input: Number of estimations, feature indices and .
2:for  do
3:   Pick samples with replacement from the rows where both -th and -th are available.
4:   Let be the -th and -th features of the selected samples
Algorithm 2 Estimating Confidence Interval Length for Feature and Feature .

Fixing and , the minimization problem in (9) has a closed form solution with respect to . To solve the inner maximization problem, we apply the projected gradient ascent to and . Note that the projection to the box constraint can be done entriwise and has the following closed form:

However, the projection to the set of positive semi-definite matrices requires a singular value decomposition (SVD) of

at each iteration. Moreover, satisfying both constraints simultaneously is a challenging problem and cannot be done in closed-form. To avoid this complexity and time-consuming singular value decomposition at each iteration, we relax the problem by removing the PSD constraint on

. This relaxation does not drastically change the performance of the algorithm as our experiments show in Section 5. In Section 3.4 we propose an algorithm for solving the original problem efficiently. Algorithm 3, demonstrates the steps of optimizing problem (9). At each iteration of the algorithm, we first perform one step of projected gradient ascent on matrix and , then we solve the minimization problem with respect to for the updated and exactly.

2:Initialize: .
3:for  do
4:   Update
5:   Update
6:   Set
Algorithm 3 Robust Linear Regression

Since, the objective function is convex in and concave in and , the minimization and maximization sub-problems are interchangeable (Sion and others, 1958). Thus, we can equivalently rewrite Problem (9) as:


where . Algorithm 3 describes the procedure of applying projected gradient ascent on function to find an optimal solution of Problem (10). We use and obtained from Equation (5) as the initialization points for and .

3.2 Convergence Guarantee of Robust Linear Regression Algorithm

Since the minimization problem in definition of function is strongly convex with respect to , the minimizer is unique and can be found in closed form. Moreover, function is concave with respect to and . Thus, applying the projected gradient ascent algorithm on with appropriate choice of step-size will converge to the global maximizer of function . Note that, the convergence rate of algorithm 3 depends on the Lipschitz constant of function . We use Lemma 1 in (Barazandeh and Razaviyayn, 2020) to compute the Lipschitz constant of function . Without loss of generality, by rescaling of the data, if it is necessary, we can assume that . Having the Lipschitz constant of function , Theorem 3 establishes the convergence rate of Algorithm 3 to an -optimal saddle point of Problem (10).

Lemma 2

The Lipschitz constant of the gradient of the function defined in Problem (10) is equal to .

Proof  The proof is relegated to Appendix B.  

Theorem 3

Assume that . Then Algorithm 3 computes an -stationary solution of the objective function in (10) in iterations, where .

Proof  The proof is relegated to Appendix B.  

3.3 Obtaining Faster Convergence and Acceleration

The convergence rate of Algorithm 3 to a saddle point of optimization problem (9) can be slow in practice, since the algorithm requires to do a matrix inversion for updating and applying the box constraint to and at each iteration. While we update the minimization problem in closed-form with respect to , we can speed up the convergence rate of the maximization problem by applying Nesterov’s acceleration method to function in (10). Note that, we can compute the gradient of function with respect to and using Danskin’s theorem.

Algorithm 4 describes the steps to optimize Problem (10) using Nesterov’s acceleration method.

2:Initialize: .
3:for  do
9:   Set
Algorithm 4 Applying the Nesterov’s Acceleration Method to Robust Linear Regression
Theorem 4

Assume that . Then Algorithm 4 computes an -stationary solution of the objective function in (10) in iterations, where .

Proof  The proof is relegated to Appendix B.  

3.4 Solving the Dual Problem of Robust Linear Regression via ADMM

Alternating Direction Method of Multipliers (ADMM) is a popular algorithm for efficiently solving linearly constrained optimization problems (Gabay and Mercier, 1976; Hong et al., 2016). It has been extensively applied to large-scale optimization problems in machine learning and statistical inference in recent years (Assländer et al., 2018; Zhang et al., 2018). Consider the following optimization problem consisting of two blocks of variables and that are linearly coupled:


The augmented Lagrangian of the above problem can be written as:


ADMM schema updates the primal and dual variables iteratively as it is depicted in Algorithm 5.

1:for  do
Algorithm 5 Schema of ADMM algorithm Applied to the Augmented Lagrangian Problem

As we mentioned earlier, simultaneous projection of to the set of positive semi-definite matrices and to the box constraint is difficult. To deal with this problem, we eliminate the PSD constraint in Algorithm 3 and Algorithm 4. Moreover, in those algorithms, we need to carefully tune the step-size to avoid inconsistency and guarantee convergence.

An alternative approach for solving Problem (9) that avoids these issues is to solve the dual of the inner maximization problem. Since the maximization problem is concave with respect to and , and the relative interior of the feasible set of constraints is non-empty, the duality gap is zero. Hence, instead of solving the inner maximization problem, we can solve its dual which is a minimization problem. Theorem 5 describes the dual problem of the inner maximization problem in  (9). Thus, Problem (9) can be alternatively formulated as a minimization problem rather than a min-max problem. We can solve such a constrained minimization problem efficiently via ADMM algorithm. As we will show, the ADMM algorithm applied to the dual problem does not need tuning of step-size or doing simultaneous projections to the box constraints and positive semi-definite (PSD) constraints.

Theorem 5

(Dual Problem) The inner maximization problem described in (9) can be equivalently formulated as:

Therefore, Problem (9) can be alternatively written as:


Proof  The proof is relegated to Appendix B.  

To apply the ADMM method to the dual problem, we require to divide the optimization variables into two blocks as in (11) such that both sub-problems in Algorithm 5 can be efficiently solved. To do so, first, we introduce the auxiliary variables and to the dual problem. Also, let . Therefore, Problem (40) is equivalent to:


Since handling both constraints on in Problem (13) is difficult, we interchange with in the first constraint. Moreover, the non-negativity constraints on and are exchanged with non-negativity constraints on and . For the simplicity of presentation, assume that , , , , and . Algorithm 6 describes the ADMM algorithm applied to Problem (14) (see Appendix C for the derivation).

2:Initialize: .
3:for  do
Algorithm 6 Applying ADMM to the Dual Problem of Robust Linear Regression
Corollary 6

If the feasible set of Problem (9) has non-empty interior, then Algorithm 6 converges to an -optimal solution of Problem (14) with a linear rate .

Proof  Since the inner maximization problem in (9) is convex, and its interior feasible set is not empty, the duality gap is zero by the Slater’s condition. Thus, according to Theorem  in (He and Yuan, 2015), Algorithm 6 converges to an optimal solution of the primal-dual problem with a linear rate. Moreover, the sequence of constraint residuals converges to zero with a linear rate as well.  

Remark 7

The optimal solution obtained from the ADMM algorithm can be different from the one given by Algorithm 3 because we remove the positive semi-definite constraint on in the latter. We investigate the difference between solutions of two algorithms in three cases: First, we generate a small positive semi-definite matrix and the matrix of confidence intervals () as follows:

Moreover, let and are generated as follows:

Initializing both algorithms with a random matrix within

and , and a random vector within and , ADMM algorithm returns a different solution from Algorithm 3.

Besides, the difference in the performance of algorithms during the test phase can be observed in the experiments on synthetic datasets depicted in Figure 6 as well, especially when the number of samples is smaller.

3.5 Performance Guarantees for RIFLE

In the previous section, we discussed how to efficiently solve the robust linear regression problem in the presence of missing values up to optimality. A natural question in this context is the performance of the proposed algorithms in the previous section on the unseen test data points. Theorem 8 investigates such question from two perspectives: Under the assumption that the missing values are distributed completely at random (MCAR), the optimal solution of Problem (3) is consistent, meaning as the number of samples goes to infinity, the optimal solution converges to the ground truth parameter . Moreover, for the finite case, Theorem 8 part (b) states that with the proper choice of confidence intervals, with high probability, the population (test) loss is bounded by the training loss.

Theorem 8

(a) Consistency of the Covariance Estimator: If the missing pattern of the data follows MCAR, the estimator proposed in Algorithm 2 converges to its actual value. Thus, when goes to infinity, the estimated confidence intervals for each entry of the covariance matrix converges to .


(b) Let , be the matrix of data, and the corresponding assigned labels to each data point. Assume that each entry of and is missing with probability of . Moreover, Let be the ground-truth parameters, and be a saddle point of Problem (8). Then with probability of at least , we have:


Proof  The proof is relegated to Appendix B.  

3.6 How to Handle Missing Values in the Test Phase

Let be the optimal parameter of the robust linear regression problem (3) obtained by running one of the algorithms described in the previous section on the training dataset. If, a test data point does not contain missing values, then can be easily computed by . There are several options for handling data points with several missing entries. A naïve idea is to make all of the missing entries zero. However, this approach can affect the performance of the model drastically. On the other hand, it is possible to restrict the training dataset to the features that are available in the given test data point. The main problem of this approach is that for any test data point we must solve the entire min-max problem that can be extremely inefficient unless the set of test data points is small. In practice, we keep the optimal and obtained from optimizing problem (9), and find a for each one of the testing points, by restricting the optimal and to the features that are available for the testing point. Thus, it suffices to compute , where and are obtained by eliminating the features that are not available for the current test point from the optimal and .

3.7 Imputation of Missing Values via Robust Linear Regression

We impute a given dataset containing missing values feature by feature. In other words, at each iteration, we consider one feature as the target variable , and the rest as input data features (). Then, we learn a robust linear regression model by solving the min-max problem (8). Let the obtained optimal solutions be and . For a given row, we restrict and to available features on that row. Then, we find the corresponding optimal in closed-form for the restricted and (similar to what we explained for handling test data points containing missing values). Thus, to impute each feature of the dataset, we only solve the min-max problem (8) once, and for each row we find its optimal by restricting and to available entries of that row. Note that, if the dataset only contains a few missing patterns for different rows (for instance we have different missing patterns in Figure 1) we can find the optimal only for the distinct patterns instead of solving the problem with respect to for each row separately. Since imputation of each feature is completely independent of the others, features can be distributed to multiple cores (or computers) without losing performance.

4 Robust Classification Framework

In this section we specialize the proposed framework in Section 2 to the classification task in the presence of missing values. For simplicity of presentation, assume that

, the target feature, is a binary variable. A basic idea for learning a robust classifier in this context is to apply the proposed robust linear regression algorithm presented in (

3) to the given dataset assuming that is continuous. Then, the assigned labels can be obtained by thresholding on the output of the robust regression model. As it is discussed earlier, since the linear regression loss depends on data through and , it suffices to estimate first and second order moments of features (means and covariances) using the available data.

For the classification task, the most commonplace measure of accuracy is 0-1 loss. Since 0-1 loss is highly non-convex, and intractable in general, different classification algorithms minimize an approximation of 0-1 loss. A relevant question in this context is whether we can guarantee a bound on 0-1 loss by minimizing the least-square loss proposed in (8). The following theorem addresses such a question.

Beside the idea of thresholding on the robust linear regression output, one can adapt the general framework described in Equation (4) to a classifier model such as logistic regression or support vector machines. Let the data points  come from a parametric distribution . Assume that is the estimation of based on the given incomplete data points. The general robust classification formulation problem in the presence of missing values can be described as follows:


where and are the estimated confidence intervals for the difference of and . In the following section, we present the robust logistic regression in the presence of missing values as a case study of the above framework.

4.1 Robust Linear Discriminant Analysis

The problem of training a logistic regression model on datasets containing missing values has been studied extensively in the literature. Beside the deletion of missing values, and imputation based approaches, (Fung and Wrobel, 1989) formulates the logistic regression as an equivalent linear discriminant analysis problem. Mathematically speaking, they assume that the data points assigned to a specific label follow a Gaussian distribution . They use the available data to estimate the parameters of each Gaussian distribution. Therefore, the parameters of the logistic regression model can be assigned based on the estimated parameters of the Gaussian distributions for different classes. Similar to the linear regression case, the estimations of means and covariances are unbiased only when the data satisfies MCAR condition. Moreover, when the number of data points in the dataset is very small the variance of the estimations can be very high. Thus, to train a logistic regression model which is robust to the percentage and different types of missing values, we specialize the general robust classification framework formulated in (17) to the logistic regression model. Assume that . We aim to find the optimal solution of the following problem:


4.2 No Missing Values in Target Variable

To solve Problem (18), first we focus on the case when the target variable has no missing values. In this case, each data point contributes to the estimation of either or . Similar to the robust linear regression case, we can apply Algorithm 2 to estimate the confidence intervals for using data points whose target variable equals ().

Obviously, the objective function is convex in , since the logistic regression loss is convex, and the expectation of loss can be seen as a weighted summation, which is convex. Thus, fixing the outer minimization problem can be solved with respect to using standard first order methods such as the gradient descent.

Although, the robust reformulation of logistic regression stated in (18) is convex in and concave in and , the inner maximization problem is intractable with respect to . We approximate Problem (18) in the following manner:


where and . To compute optimal and , we have:

Theorem 9

Let be the -th element of vector . The optimal solution of Problem (20) has the following form:


Note that we relaxed (18) by taking the maximization problem over a finite set of estimations. We estimate each by bootstraping on the available data using Algorithm 2. Define as:


Similarly we can define: