We encounter multi-source or multi-modality data frequently in many real data applications. For example, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data involve multisite longitudinal observational data from elderly individuals with normal cognition (NC), mild cognitive impairment (MCI), or Alzheimer’s Disease (AD) [36, 35]. The ADNI data contain multi-source measurements: magnetic resonance imaging (MRI), florbetapir-fluorine- (AV-) positron emission tomography (PET) imaging, fludeoxyglucose F (FDG) PET imaging, biosamples, gene expression, and demographic information. Such multi-source data are also common for electronic medical record (EMR) systems adopted by most health care and medical facilities nowadays, which contain diverse-source patient information, e.g., demographics, medication status, laboratory tests, medical imaging and text notes.
However, blocks of variable information could be completely missing as there might be no need or it might be infeasible to collect certain sources of information given other known variables. E.g., patients might be either too healthy or too ill. For EMR systems, it could be due to lack of information exchange or common practice between different medical facilities 
. Block missing variables cause a large fraction of subjects with certain sources missing, which could lead to biased parameter estimation and inconsistent feature selection. Therefore, it is important to fully integrate data from all complementary sources to improve model prediction and variable selection.
The most common approach for handling missing data is to perform complete-case analysis which removes observations with missing values and only utilizes the complete cases. However, the complete-case method produces biased estimates when the missing is not completely at random. The inverse probability weighting method  is able to reduce this bias via reweighting the complete observations ; nevertheless, incomplete observations are still not fully utilized. In real applications, such as the ADNI data, removing incomplete cases could incur a great loss of information since complete cases only account for a small fraction of the data. Alternatively, likelihood-based methods [24, 29, 8] can incorporate all observations. However, this relies on specifying a known distribution which might not be available, and could be computationally intractable if the number of missing variables is large.
] propose a structured matrix completion (SMC) method through singular value decomposition to recover a missing block under a low rank approximation assumption. However, the SMC imputes only one missing block at a time.  is capable of imputing all missing values through matrix completion and then apply the adaptive Lasso [28, 59] to select variables. However, this approach does not guarantee estimation consistency. Alternatively, multiple imputation  (MI) is applicable for conducting variable selection, e.g.,  propose a multiple imputation-least absolute shrinkage and selection operator (MI-LASSO), and adopt the group Lasso  to detect nonzero covariates. Furthermore,  and  select variables on combined multiple imputed data. In addition, MI can be combined with bootstrapping techniques [26, 31, 33]. However, these imputation methods are not effective for block-wise missing data.
Recently, several methods have been developed to target block-wise missing data. E.g., [54
] propose an incomplete multi-source feature learning (iMSF) method, which models different missing patterns separately and minimizes a combined loss function. In addition,  introduce an incomplete source-feature selection (iSFS) model, utilizing shared parameters across all missing patterns and imposing different weights on different data sources. However, the iSFS is unable to provide coefficient estimation for all samples due to the different weighting strategy. Alternatively, the direct sparse regression procedure using covariance from multi-modality data (DISCOM)  estimates the covariance matrices among predictors and between the response and predictors. However, the DISCOM only considers missing completely at random, which could be restrictive for missing not completely at random data.
The single regression imputation (SI) method [2, 57, 6, 19, 39] is another popular approach which predicts missing values through regression using observed variables as predictors. Suppose that the subjects from multi-source data are divided into groups according to their missing patterns. For a group with a given missing block, the SI estimates association between missing variables and observed variables within the group based on complete observations. However, in practice, the complete observations might only account for a small fraction of the entire data.
To integrate information from the multi-source observed data we propose a Multiple Block-wise Imputation (MBI) approach, incorporating not only the SI based on complete observations but also imputations from incomplete observations. The additional imputations in MBI involve fewer observed variables within a given missing group, but are able to integrate more observations from multiple groups than the SI. Thus, the MBI can improve estimation and model selection especially when the missing rate is high. In addition, the proposed method aggregates more groups with different missing patterns to impute missing variables, which does not rely on the missing completely at random assumption, and is capable of handling missing at random data.
Furthermore, we propose a new multiple block-wise imputation model selection method. Specifically, we propose to construct estimating equations based on all possible missing patterns and imputations, and integrate them through the generalized methods of moments (GMM) . In theory, we show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, our estimator is asymptotically more efficient than the SI estimator. Numerical studies and the ADNI data application also confirm that the proposed method outperforms existing variable selection methods for block-wise missing data in missing completely at random, missing at random, and informative missing scenarios.
In general, our work has the following major advantages. First, we are able to optimally combine block-wise imputations of missing values from all missing pattern groups to improve estimation efficiency and model selection consistency. Second, the proposed method is capable of handling block-wise missing data which might not contain any complete observations, while most traditional methods, including the matrix completion , require partial subjects to have fully completed observations.
The remainder of the paper is organized as follows. Section 2 introduces the background and framework for the block-wise missing problem. In Section 3, we propose the MBI approach incorporating all missing patterns. In Section 4, the implementation and algorithm are illustrated. In Section 5, we establish the theoretical properties of the proposed method. Sections 6 and 7 provide numerical studies through simulations and the ADNI data application.
2 Background and Motivation
In this section, we introduce the framework for the block-wise missing problem. Let
be the response variable, andbe the
design matrix. Suppose that all the samples are drawn independently from a random vector, whose covariance matrix is positive definite. Then, for any and , represents the -the sample of the -th covariate. Suppose that all the covariates in are from sources where the -th source contains covariates for . Figure 1 illustrates a setting with three sources.
We divide samples into disjoint groups based on the missing patterns across all sources, where , the -th row of , is in the -th group if , and is an index set of samples. For any , let and be the index sets of the observed covariates and missing covariates corresponding to the -th group, respectively, and obviously, . Then, and represent observed variables and missing variables in the -th group, respectively. In addition, let be the index set of the groups where missing variables and variables in at least one of the other sources are observed. If there are no missing values in the -th group, let , a completely observed dataset. We assume that is nonempty containing elements for . For illustration, the design matrix on the left of Figure 1 consists of sources which are partitioned into missing pattern groups, where each white area represents a missing block. For example, refers to samples in Group and refers to missing covariates in Group . Since Groups , and contain observed values of and covariates in Source or , and .
We consider the following linear model
where is the true coefficient vector corresponding to all covariates and represents an error term independent of . Let denote the regression term in the model (1) for . We assume that the model is sparse; that is, most of the true coefficients are zero, where and correspond to relevant and irrelevant covariates, respectively, and the total number of relevant covariates is .
The likelihood-based approaches  typically formulate likelihood based on completely observed variables. However, it is likely that no covariate is completely observed under the block-wise missing structure. Alternatively,  construct a model for each missing pattern separately and use observed variables within each missing pattern as predictors. For instance, for Group in Figure 1, the above method treats the covariates in Sources and as predictors and ignores information from Source . However, Source covariates could be incorporated as well, since they are relevant to the response variable.
Traditional imputation methods [57, 6, 19] impute missing values in Group based on the associations between missing and observed variables obtained from complete observations in Group , while samples in Groups and are not utilized. However, Groups and , also containing values from Source , can provide additional information in imputing missing variables through correlations with other covariates. This is especially useful when completely observed subjects are scarce. In the following section, we propose a new imputation approach to fully utilize information not only from the group with complete cases but also from other groups.
3.1 Multiple Block-wise Imputation
In this subsection, we propose a multiple block-wise imputation approach which can utilize more observed information from incomplete case groups than traditional imputation methods. Specifically, for a given Group with missing values of , each of the groups contains observed values corresponding to missing , and also observed values corresponding to a subset of observed . Therefore, we can predict missing values in the -th group with ways to borrow information from all the groups in , instead of using a complete case group only.
More specifically, for each , let be an index set of covariates which are observed in Groups and . For each , we estimate utilizing all the groups containing observed values of both and , and then impute missing values for in the -th group using association information in the conditional expectation. Let represent the imputation for all missing values in Group . The proposed multiple imputations approach is referred to as Multiple Block-wise Imputation (MBI). We illustrate the MBI with an example in Figure 1. For Group , covariates observed in both Group and a group in are indexed by , , and , respectively, where is an index set of covariates from Source for .
The traditional imputation methods, such as the SI, only utilize observed values in Group and Group to impute the missing values in Group , namely , as shown on the top right of Figure 1. In contrast, the proposed method can incorporate more information from Groups and in addition to Groups and , and impute the missing values in Group using three different blocks of observed variables. Namely, we estimate , and based on Group , Groups and , and Groups and , respectively. We then impute the missing values via the above three estimated conditional expectations and the observed information in Group . Compared with the SI, the proposed MBI incorporates additional imputed values and via and , where the estimation involves more observed samples than the SI approach. In particular, when estimating conditional expectation for the imputations, we aggregate subjects from different missing pattern groups, which can diminish the influence of specific missing patterns of covariates.
3.2 Integration of MBI
In this subsection, we propose to integrate information from all available sources and multiple block-wise imputations. Specifically, we construct estimating functions for each group according to its missing pattern. For a given Group containing missing values and , since missing are estimated through which is a projection onto , the covariates are uncorrelated with residuals of the projection . Therefore, for each ,
where and denote the true coefficients of and , respectively. In addition, for any , since is a function of ,
Note that since for .
Thus, we construct estimating functions corresponding to observed covariates in the -th group using imputed values . In general, for each , let be the -th imputed sample based on Group , where if the -th covariate is observed in the sample , otherwise is an imputed value of in . The estimating functions for the imputed samples in Group are
where is a sub-vector of consisting of for , is the derivative of with respect to , and is the coefficient vector corresponding to .
To integrate information from all available missing patterns and imputations, we propose an aggregated vector of estimating functions:
is the number of samples from the -th group, and is a vector consisting of for . If the -th group only has complete observations, then , and
Note that the total number of equations exceeds the number of coefficient parameters, and estimating functions from groups with fewer missing variables or more accurate imputations tend to have smaller variance. To optimally combine all the estimating functions in, we estimate coefficients through the penalized generalized method of moments  which minimizes
is the sample covariance matrix of , and is a penalty function with tuning parameter . Here we can choose the SCAD penalty due to its oracle properties . The sample covariance matrix is a block diagonal matrix since estimating functions are formulated based on different missing patterns. However, could be singular or close to singular due to overlapping information in imputations, or due to a large number of estimating functions compared to a relatively small sample size. For example, as illustrated in Figure 1, the observed values of Source covariates in Group are utilized in the estimation of both and .
3.3 Solving the singularity issue of estimating equations
To solve the singularity issue of estimating equations, we reduce the dimension of for , through combining informative estimating equations, e.g., utilizing the first several largest principle components (PCs) [50, 10]. Specifically, we divide the estimating functions in into two parts and , where consists of the functions with the imputation based on complete observations, and contains the remaining estimating functions in . We proceed to extract informative principle components from and separately. Let Group be the complete case group, and and be the sample covariance matrices of and , respectively. If the dimension of is too large such that is singular or close to singular, we extract the first principle components from , where contains eigenvectors of corresponding to the largest
nonzero eigenvalues, andcan be selected to retain sufficient information. If is neither singular nor close to singular, we retain all the estimating functions in , and let
be an identity matrix, that is,.
We orthogonalize against the to store additional information beyond , where consists of orthogonalized estimating functions, , and is the sample covariance matrix between and . Similarly, if the sample covariance of is singular or close to singular, we select the first principle components from the orthogonalized , where contains eigenvectors of the sample covariance matrix of corresponding to the largest nonzero eigenvalues. Otherwise, we retain all the , and let be an identity matrix.
If there is no complete case group or , then either or is null, and is either or . Thus, contains all the essential information from the estimating functions of the -th group, while solving the singularity issue of the sample covariance matrix. The numbers of principle components and can be tuned through the Bayesian information type of criterion proposed by  to capture sufficient information from the estimating functions in (2). Consequently, the proposed estimator is obtained via minimizing
where . In the following section, we also provide an algorithm and implementation strategy of the proposed method.
In this section, we provide the detailed algorithm for the proposed method. The conditional expectations of missing covariates in MBI can be estimated via linear regression models, generalized linear models (GLM) or non-parametric models. In this paper, we utilize the GLM to accommodate not only continuous covariates but also discrete covariates. Specifically, for each group , , and , we adopt the GLM to predict if groups containing observed values of both and have a larger sample size than the number of observed variables , or adopt the -regularized GLM  otherwise. To obtain the -regularized GLM estimator, we apply the glmnet package ( https://cran.r-project.org/web packages/glmnet/index.html ) in R. The imputed values in MBI are then computed based on the estimated conditional expectation.
If the sample covariance matrix of from Group is singular or close to singular, we select the numbers of principle components and corresponding to , which is or , through minimizing the BIC-type of criterion 
where is the dimension of . Here, the is an approximation of based on the largest eigenvectors, where is the -th largest eigenvalue of , and is the eigenvector of corresponding to . Since , the minimizer of is indeed the number of eigenvalues which are larger than .
We plot an example of the objective function in Figure 2 to illustrate the objective function near true coefficients. In this example, there are three sources and four groups with and , where each source contains one relevant predictor with a signal strength of and the missing patterns are the same as in Groups – in Figure 1. The true coefficients of and are and , respectively. Figure 2 shows that has a unique minimizer around the true coefficients.
To obtain the minimizer, we propose to iteratively decrease via the nonlinear conjugate gradient algorithm  which converges quadratically  without requiring the second derivative of the objective function. At the -th iteration, the conjugate direction is
where is the gradient of at ,
and . Here, the initial values are obtained by performing the Lasso method  on complete observations, and the gradient is numerically calculated via central differences. We determine the step size in the conjugate direction through a line search:
We summarize the whole procedure for the implementation of the proposed method in Algorithm 1. Note that estimation of MBI is carried out in Step 2, and the nonlinear conjugate gradient method is performed in Step 3.
To select the tuning parameter in the penalty function , we propose a BIC-type criterion (MBI-BIC) as follows:
where is the proposed estimator for a given , is the number of non-zero estimated coefficients in , and is the residual sum of squares from all the missing pattern groups with the -th group
Compared with the traditional Bayesian information criterion (BIC) , the proposed MBI-BIC incorporates additional information from incomplete observations via the MBI. We select the optimal tuning parameter corresponding to the lowest MBI-BIC.
In this section, we provide the theoretical foundation of the proposed method under regularity conditions. In particular, we establish estimation consistency, selection consistency, and asymptotic normality of the proposed estimator. We also show that the proposed MBI leads to more efficient estimation than a single imputation method. Throughout this section, we assume that sources of covariates for each subject are missing at random.
5.1 Asymptotic properties for fixed and
In this subsection, we assume that and are both fixed as , where . Let be the estimating functions from samples, where is a column vector consisting of the -th sample of all available estimating functions with if for . We require the following regularity conditions:
For any , , , and , we have
where is a vector consisting of samples for all , and is an estimator of .
For any , with a sufficiently large ,
where is a submatrix of with columns representing estimating functions of the -th group.
For each , there exists such that .
Assume that exists and is finite for any .
Note that, equations (8) and (9) in Condition 1 are satisfied if the consistency of a coefficient estimator for holds under a linear model in predicting and , which can be obtained through the least squares or GLM  estimator. Moreover, Condition 1 requires the existence of the fourth moments of covariates and the error term. Condition 2 holds when the block-wise imputations based on different missing pattern groups are distinct with probability . Condition 3 is satisfied when the block-wise missing data contain complete cases, while for data with no complete cases, it requires that each covariate is observed from at least one group and utilized in the MBI to predict missing values. Additionally, Condition 4 assumes that ratios between sample sizes of different missing pattern groups are finite as goes to infinity.
To simplify expression of the following theorem, we define some notations. Let and be vectors of for or , respectively. Let be the sample matrix for transformed estimating functions with linearly independent columns at . Denote an index set for estimating functions of the -th group in by . In addition, we let , be the first derivative of with respect to , and , where and are expectations of and , , and and are diagonal matrices with if , and if , respectively.
(i) Estimation consistency: .
(ii) Sparsity recovery: as .
(iii) Asymptotic normality: If Condition 4 holds, then .
Theorem 1 states that the proposed estimator is almost root- consistent and selects the true model with probability approaching . In addition, the estimator of nonzero coefficients is asymptotically normal under Condition 4. The empirical covariance matrix of is similarly defined as but replacing , , and by , , and , respectively, where , and is a diagonal matrix with if .
If only a single regression imputation based on complete observations is utilized, the sample estimating functions are , where selects estimating functions corresponding to the single imputation. Then, the empirical covariance matrix of the estimator induced by is , where , , , and are similarly defined as , , , and , respectively, except that and are replaced by and , and is a diagonal matrix with if the -th estimating function in corresponds to the -th group.
In the following proposition, we show that utilizing the MBI improves the empirical efficiency of the parameter estimation with a smaller asymptotic variance compared with a single imputation approach. Let denote the largest eigenvalue of a matrix.
Under the conditions in Theorem 1, is positive semi-definite if , where and .
Proposition 1 indicates that the proposed estimator with MBI gains efficiency through incorporating additional information from incomplete case groups. Since the eigenvalues of are all smaller than based on the proof of Proposition 1, holds for sufficiently large when subjects are almost evenly assigned to different groups as goes to infinity. In the following, we will establish consistency of the proposed estimator for diverging and .
5.2 Consistency for diverging and
In this subsection, we consider cases when and increase as increases, that is, and . We assume that the number of sources does not diverge as goes to infinity. Let , and be sub-matrices of consisting of rows corresponding to covariates indexed by and , respectively, where denotes the first derivative of with respect to for . We also let be an estimator of the weighting matrix for all estimating functions, , and