Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data

02/01/2018 ∙ by Shehroz S. Khan, et al. ∙ 0

Presence of missing values in a dataset can adversely affect the performance of a classifier; it deteriorates rapidly as missingness increases. Single and Multiple Imputation (MI) are normally performed to fill in the missing values. In this paper, we present several variants of combining MI and bootstrapping to create ensemble that can model uncertainty and diversity in the data and that are robust to high missingness in the data. We present three ensemble strategies: bootstrapping on incomplete data followed by single imputation and MI, and MI ensemble without bootstrapping. We use mean imputation, Gaussian random imputation and expectation maximization as the base imputation methods to be used in these ensemble strategies. We perform an extensive evaluation of the performance of the proposed ensemble strategies on 8 datasets by varying the missingness ratio. Our results show that bootstrapping followed by average of MIs using expectation maximization is the most robust method that prevents the classifier's performance from degrading, even at high missingness ratio (30 perform equivalently but better than their single imputation counterparts. Kappa-error plots suggest that accurate classifiers with reasonable diversity is the reason for this behaviour. A consistent observation in all the datasets suggests that for small missingness (up to 10 data without any imputation produces equivalent results to other ensemble methods with imputations.

READ FULL TEXT VIEW PDF

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predictive models assume that the data they use are complete, i.e., there are no missing values present in it. However, missingness in data is common and difficult to deal with 17, 10, 6

. Data with missing attribute values is called incomplete data. Many predictive algorithms cannot handle incomplete data, such as support vector machines, neural networks, logistic regression, etc. Some classification algorithms can handle missingness in the data, such as decision trees (C4.5

26) and their variants. However, presence of a large amount of missingness in the data can deteriorate the performance of those classifier methods.

There are several strategies to deal with incomplete data. A naive method is to remove any data object (or observation) with missing values. This strategy reduces the training data size; if the missingness ratio is high then generalizable models are difficult to learn. A better strategy is to replace a missing attribute value with some value - this is called imputation. Imputation can be single or multiple. In single imputation, a missing value is replace by one value, whereas in multiple imputation (MI), several values are imputed. MI performs better than single imputation in terms of modelling the uncertainty and variation due to the missing value 27. Some common methods for imputation are fixed-value imputation, random imputation, nearest neighbour imputation, mean imputation 12, 28 and expectation maximization imputation 23 (see Section 3.1 for more details).

MI methods generate multiple values corresponding to a missing value. To use multipe imputed data for training a classification algorithm, one option is to average multiple imputed values and replace the missing attribute values with a single value. The other option is to train different classifiers on different copies of imputed complete data and create an ensemble 18. It has been shown that combining bootstrapping with MI can result in accurate classifiers 36. The reason is that MI accounts for the uncertainty due to the missing data, whereas bootstrapping accounts for the uncertainty due to sampling fluctuations 36. Combining both the ideas result in more diverse classifiers that aides the ensemble to perform better than a base classifier. In this paper, we discuss several ideas for creating ensembles to handle missing data. The ensemble techniques we test are: (i) bootstrapping with single imputation and average of MI, (ii) MI on bootstrap samples of incomplete data, and (iii) an ensemble of multiple imputed data. We use three popular data imputation techniques to validate the ensemble methods and discuss their relative performance. We systematically increase the amount of missingness in datasets (from to ) and evaluate the performance of each of these methods. We also show a comparison of kappa-error graphs to explain the diversity and accuracy of the different ensemble methods at different level of missingness. The results on UCI datasets 22 show that the performance of MI after bootstrapping with expectation maximization imputation technique remains very robust despite increasing the missingness to a large value (up to ); however, it can be computationally extensive. Kappa error graphs show that bootstrapping with expectation maximization imputation technique creates accurate and diverse classifiers. MI after bootstrapping with mean imputation emerged as a robust and faster alternative when the missingness is low (up to ). We obtained a consistent observation on all the datasets that for low missingness (up to ), bagging ensembles on incomplete data performs equivalent to other imputation methods.

The rest of the paper is organized as follows. In Section 2, we present the literature survey on data imputation using ensemble learning techniques. In Section 3, we present the different imputation methods used in the paper. Section 4 discusses the different bootstrapping and multiple imputation ensemble methods for handling incomplete data. Section 5 describes the experimental set up, datasets and results. We conclude the paper in Section 6.

2 Literature Review

MI for missing data has been studied extensively in the literature (e.g., 16, 10, 14). In this literature review, we survey research papers that use ensemble learning with either multiple or single imputation to deal with missing data.

Feelders 11

compares surrogate splits in a decision tree with single and MI based on expectation maximization (EM) method. In the MI case, they compute the average over different imputations. Both the imputation methods perform better than surrogate split. They comment that averaging over MI and replacing with one value reduces the variance, in the same way as bagging, which improves the performance. Twala and Cartwright

34 propose an ensemble approach by creating sub-samples of incomplete data using bootstrap sampling. Each incomplete sample is fed to a decision tree classifier. The resulting ensemble is optimized in size by only choosing de-correlated decision trees and their output is combined to take a decision. In this method, direct imputation does not happen; however, they later incorporated additional MI techniques. The paper does not clearly state that at what stage MI was used in the ensemble. Although the proposed techniques are for classification, the results are shown on regression problems by discretizing the response attribute. Wu and Jian 36 present a procedure that performs MI on the incomplete dataset followed by non-parametric bootstrapping, which is much faster than performing bootstrapping followed by MI. Baneshi and Talei 2 propose to perform MI using multiple imputation by chained equations (MICE) method on incomplete data followed by bootstrapping and the results are aggregated using statistical techniques. Tran et al. 32 perform MI using MICE followed by bootstrapping with C4.5 decision tree as the base classifier. Their results show better performance in comparison to MI method to generate single imputed dataset and using three other single imputation methods to generate a complete dataset. Valdiviezo and Van Aelst 35 combine missing data procedures with tree-based prediction methods after single and MI methods (MICE, MIST). They comment that if missingness is small, then single imputation is sufficient. However, if the missingess is moderate to large, then MI followed by tree-bagging is useful. Schomaker and Heumann 29

comment that MI on bootstrapped samples and bootstrapped samples on multiple imputed datasets are the best options to calculate randomization valid confidence intervals when combining bootstrapping with MI. They further suggest that MI of bootstrap samples may be preferred for large imputation uncertainty (or low missingness) and bootstrapping of MI may be preferred for smaller imputation uncertainty (high missingness).

Other types of classifier fusion techniques are also explored by researchers to handle incomplete data. Su et al. 31 propose a classifier ensemble method to handle missing data. They started with an incomplete data, then further remove fixed percentage of attribute values to create different versions of the original incomplete data. They imputed these datasets separately, present them to separate classifiers and combine their classification results. Their results suggest that ensemble learning with (bayesian) expectation maximization performs better than several single classifiers on many datasets. An issue with this approach is that, it removes more missing values from an incomplete data to create different datasets, which can compromise the accuracy of the methods. Twala and Cartwright 33 present an ensemble method that impute incomplete data using bayesian MI and nearest neighbour imputation separately. These two imputations are fed to decision trees and their results combined. Nanni et al.25 propose a MI approach that uses random subspaces method. Their general idea is to cluster incomplete data into a fixed number of clusters and then replace the missing values of missing data objects within a cluster with its center (or the mean of the cluster). This can reduce the information loss introduced by mean imputation if the full data is replaced by the mean vector. Several runs of random subspace is then performed on the imputed data to create an ensemble. Their method shows high performance on several health datasets and it does not drop when the missingness is increased to . Setz et al.30 present a classifier fusion of Linear and Quadratic classifiers with mean imputation and reduced feature modeling for emotion recognition task. Hassan et al. 15 propose to perform MI several times to generate several samples of the original data and then fed them to classifiers and created an ensemble of several neural networks. They propose a univariate and multivariate version and showed that they performed better than mean imputation and EM. Kumutha and Palaniammal 19

perform KNN imputation on gene expression data followed by bootstrapping. Khan et al.

18 propose a bayesian MI ensemble method for one-class classification problems. They create two types of ensemble: one that averages the MI and trains a single classifier and the other that learns different classifiers on multiple imputed datasets. Their results show better performance of these methods in comparison to mean imputation as the missingness is increased.

The literature review shows that several ensemble methods exist to handle missing data while building generalizable classifiers. Bootstrapping the MI and MI of the bootstrap samples of the incomplete data are being used to learn better classifiers from incomplete data. Averaging MI and classifier fusion are other plausible techniques. Most of the research papers we reviewed did not compare different techniques of ensemble and study the effect on performance as the missingness in the data increases. These papers also did not provide insights into the diversity and accuracy of classifiers within an ensemble that might influence its performance. In this paper, we consider three types of ensemble methods to handle incomplete data: (i) bootstrapping with single or average of MI, (ii) bootstrapping with MI, and (iii) ensemble of MI. These three approaches span different ways of creating diverse ensembles on incomplete data. Within each category, different types of imputation methods are used, such as mean imputation, gaussian random imputation and expectation imputation (see Section 3.1 for details). Fusion of different types of classifiers is out of the scope of this paper. The different imputation methods used in this paper are described next.

3 Imputation Methods

Missingness can occur due to several reasons and can be of different types, such as Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not At Random (MNAR) 1. Rubin 1

proposed a topology for different kinds of missingness distributions. MAR allows the probabilities of missingness to depend on observed data but not on missing data. An important special case of MAR, called MCAR, occurs when the distribution does not depend on any value of the observed and missing data. MNAR is a situation that is neither MAR nor MCAR and arises when the distribution of missingness depends on the missing values in the data. Mathemtically, let the full data (

) comprises of observed data () and missing data (), i.e.

then missing data will be MAR, if

(1)

where is a missingness indicator variable. when is observed and when is missing. Here represents a group of items that is either entirely observed or entirely missing. can be integer indicating the highest for which is observed. can also be a matrix of binary indicators of the same dimension as the data. The missing data will be MCAR, if

(2)

Imputation is the process of replacing a missing value with another value. In this paper, we investigate the use of the following four base imputation methods:

  1. Mean Imputation (MEI) – In the MEI method, a missing attribute value is replaced by its mean 18. If there are multiple missing values in an attribute, they all will be replaced by the same value because MEI gives one imputed value.

  2. Gaussian Random Imputation (GRandI) – In this method, we find the mean (

    ) and standard deviation (

    ) of an attribute with missing values. Then we generate a uniformly distribute random standard normal variate (

    ) between and . We use the following formula to impute a missing value

    Thus, the imputed value follows a Gaussian distribution. If there are multiple missing values in an attribute, they will not be imputed with same value because every time a different randomly chosen

    is generated. Similarly, if a missing value is imputed multiple times, GRandI will give different imputed values.

  3. Expectation Maximization Imputation (EMI) Dempster, Laird and Rubin 8

    propose the use of an iterative solution, Expectation Maximization (EM) algorithm, for imputation for data with MAR missingness. The estimation or E-step of the EM algorithm computes the expected value of the sum of the variables with missing data assuming that we have a value for the population mean, and variance-covariance matrix. The maximization, or M-step, uses the expected value of the sum of a variable to estimate the population mean and covariance. When the fraction of missing values is large with one or more parameters, the convergence of this method is slower. Different initialization to EM produce different imputations for a missing value; hence, EMI can produce MI.

The above three methods are used as the base imputation methods in this paper. They will be combined in different ways to create ensembles, which is described in detail in Sections 3.1 and 4. No Imputation (No-Imp) is the simplest method to handle missingness in the data, i.e., the incomplete data is not imputed for a given missingness ratio. In this case, a classifier is trained on the incomplete data. This serves as the baseline method to compare against other imputation approaches. We choose a C4.5 decision tree as the base classifier because it can handle missing attribute values.

3.1 Single Imputation

Single imputation refers to the approaches that impute one value for a given missing value in incomplete data. In these methods, either a single value or multiple values are generated. If multiple values are generated, then their average is used as a single value and imputed in place of the missing value. After average imputation, one classifier can be trained on the complete data set. We use the following imputation methods for single imputation:

  1. MEI

  2. Average of GRandI

  3. Average of EMI

MEI imputes one value for a given missing value; therefore, there is no need to take an average. Whereas, GRandI and EMI generate multiple values for imputation. In these cases, the average of their MI is taken and a missing value is replaced by a single value. Both the approaches for single imputation and average of MI are shown in Figure 1.

((a)) Single Imputation
((b)) Average of MI
Figure 1: Two variations of Single Imputation

4 Bagging and MI Ensemble

In a complete data set, ensemble approaches can improve the classification performance 20. Bootstrapping or bagging is a popular ensemble learning approach where data is re-sampled with substitution several times 3, 7. The reason for good performance of bagging is that it introduces diversity in the data set and can lead to diverse and accurate classifiers. In this paper, we consider bagging on incomplete data for ensemble learning. A C4.5 decision tree is used as a base classifier. Since C4.5 can handle missing values; it can be used to train a No-Imp equivalent of each ensemble methods for comparison purposes. Let us now define the following parameters that we will used to describe different ensemble approaches for missing data:

  • – missingness ratio,

  • – number of MI, and

  • – size of the ensemble.

We now discuss the three types of ensemble learning approaches to handle missing values.

4.1 Bagging Single Imputation

In this method, an incomplete data set is re-sampled times. This will result in sub-samples of the incomplete data set. Depending on the value of , some sub-samples may be complete or incomplete. Then, we perform average MI (or equivalently, single imputation) on all the incomplete sub-samples and train a decision tree classifier on them. This leads to classifiers and a majority voting can be used to take a final decision. In summary, this method first performs bootstrapping on the incomplete data, followed by single imputation on each of the sub-samples. Therefore, it retains the basic diversity aspect of bagging and replaces missing values with imputed values. Depending upon a particular imputation method, this can also lead to accurate classifiers. Combined with both the ideas of diversity and accuracy, we expect this method to perform better than the average imputation method (or equivalently, single imputation). The No-Imp equivalent method for this approach does not impute missing values in the sub-samples. Figure 2 shows the different components of Bagging Single Imputation.

Figure 2: Bagging Single Imputation

4.2 Bagging MI

In this method, an incomplete data set is re-sampled times. Then, on each of these sub-samples, MI is performed times. This will result in imputed (complete) data sets. Thus, separate classifiers can be trained and their results combined with majority voting. As MEI does single imputation, only sub-samples will be different in this case and MI on these bootstrap samples will generate same imputed values. Therefore, MEI is excluded from this approach. There is no No-Imp equivalent of this method because, it will have only unique incomplete sub-samples and the rest of them will be duplicates. This method generates classifiers, which is the same number as Bagging Single Imputation method. Therefore, both the methods can be fairly compared when the base classifier is the same (C4.5 in our case). It is to be noted that this method generates less diverse sub-samples than Bagging Single Imputation method; however, MI on these sub-samples can lead to more accurate classifier. To avoid numerical calculation problems, should be a multiple of in this method. Figure 3 shows the steps involved in performing Bagging MI.

The other possibility is to perform MI on the incomplete data followed by bootstrapping each of those samples. In our case, we are using small value of MI, which means there will be higher uncertainty in the estimates. As commented by Schomaker and Heumann 29, bootstrapping of MI may be preferred for smaller imputation uncertainty (or moderate to large values of M). Therefore, we do not use this type of ensemble technique in this paper.

Figure 3: Bagging Multiple Imputation

4.3 MI Ensemble

In this method, MI is performed on the original incomplete data times. This is done to generate different copies of the incomplete dataset; hence classifiers can be trained and fair comparison can be done with the two above approaches. The results of these classifiers are combined using the majority voting method. This method will have the least diversity in comparison to the Bagging Single Imputation and Bagging MI because the same data is always used for imputing missing values. However, the individual classifiers may be more accurate if the underlying imputation method gives good estimates for missing values. The MEI does not impute multiple times and No-Imp method would result in duplicate copies of original incomplete data; therefore, both the methods cannot be applied while using MI ensemble. A graphical representation of MI Ensemble is shown in Figure 4

Figure 4: Multiple Imputation Ensemble

4.4 Analysis

In general, all three types of ensemble methods will be computationally expensive than running a single decision tree on an incomplete dataset. However, their performance is expected to be much higher due to the diversity and accuracy modeled by bootstrapping and MI ensemble. We keep the number of classifiers in all of these ensemble techniques to be same, so that no one technique may benefit from them and the comparisons are fair. That is, bootstrapping and MI, whether combined or not should always yield exactly imputed datasets for a given incomplete data.

Bagging Single Imputation generates bootstraps, perform imputations on each of these sub-samples and averages them to a single value. Therefore it generates number of datasets. Bagging MI generates bootstrap samples followed by number of MI on them; therefore, the number of datasets generated are . Whereas, MI ensemble approach generates number of different copies of the original incomplete dataset by performing MI on them. Performing MI takes the most time in creating an ensemble. We make sure that the number of decision tree classifiers among these ensemble methods remain same; however, their computational complexity will be different due to the generation of different number of datasets. Bagging Single Imputation generates the largest number of datasets; hence, it is the most computationally expensive, followed by Bagging MI and MI Ensemble. A comparison between these methods in terms of number of bootstrap samples generated, number of MI performed and number of datasets created by a given ensemble imputation method is shown in Table 1.

Method Name #Bootstrap Samples #MIs #Datasets Created
Bagging Single Imputation
Bagging MI
MI Ensemble
Table 1: Comparison of different ensemble imputation approaches.

5 Experimentation

We use Weka API Developer version version 3.9.2 13 to implement different imputation algorithms and the decision tree classifier. The C4.5 algorithm (J48 package in Weka) is used as a base decision tree classifier because it can handle missing values. Weka uses The EMImputation package for EMI. The initial parameters in the original implementation of the package are fixed i.e. all means are zero, all variances are one, and all covariances are zero because the data is standardized. Therefore, this method always gives one fixed imputed value for a given missing value irrespective of multiple runs. This setting prevents variation in the MI for the EMI method. Therefore, we changed the code of EMImputation, such that the elements of initial covariance matrix can randomly vary between to (as the data is already standardized). This allows EMI to produce different values for every run of MI. The full source code along with the data sets used in this paper is available at https://github.com/titubeta/EnsembleImputation.

In this paper we discuss three base imputation methods (MEI, EMI, GRandI, see Section 3). The average imputations (or single imputation) of each of these base methods along with No-Imp give four methods (Methods in Table 2). Similarly, the Bagging Single Imputation approach gives three methods corresponding to each of the three base imputation imputation method and one corresponding to No-Imp method (Methods in Table 2). For both the Bagging MI and MI Ensemble, there are no No-Imp or MEI methods; therefore, they give two methods each corresponding to EMI and GRandI (Methods in Table 2). Therefore, in total, we compare different imputation methods, out of which

  • Eight are different ensemble imputation methods and four methods do not use ensemble imputation.

  • Ten are imputation methods and two method does not involve any imputation (Methods and in Table 2).

#Method Acronym Description
1 No-Imp No Imputation on incomplete data
2 MEI Mean Imputation
3 GRandI Average of Gaussian Random Imputation
4 EM Average of Expectation Maximization imputation
5 BagNoImp Ensemble by Bagging without imputation
6 BagMEI Ensemble by Bagging with mean imputation
7 BagGRandI Ensemble by Bagging with Gaussian Random Imputation
8 BagEM Ensembles by Bagging with Expectation Maximization Imputation for each dataset
9 BagMIGRandI Ensembles by MI over Bagging by Gaussian Random Imputation
10 BagMIEM Ensembles by MI over Bagging by Expectation Maximization Imputation
11 MIGrandI Ensembles by MI by Gaussian Random Imputation
12 MIEM Ensembles by MI by Expectation Maximization Imputation
Table 2: Acronyms for different imputation methods and their descriptions.

5.1 Introducing Missing

To introduce amount of missingness in the data, we adopt the following strategy. For every attribute of the data, we randomly remove number of attribute values. The attributes values are removed such that the same attribute value is not removed more than once. As the ratio of missingness increases, the probability for inducing missing values to an entire data object also increases. This situation, in particular, is problematic for EMI method because it will not impute such data objects and the same amount of training data may not be used for training the models. For MEI and GRandI, this situation is not a problem because they either replace all the missing values of the entire missing data object with the mean value of that attribute or with Gaussian distributed random value. To avoid this problem, we keep track of the last attribute while removing attribute values to check if such a case is happening. If a flag is set, then we do not remove that attribute value, rather set the index to the top of that feature and replace the first available non-missing attribute value. This will prevent removing all the attribute values of a data object. Therefore, if the number of features are , then for a given missingness ratio , a total of attribute values will be removed.

5.2 Parameters

The following values are set for the different parameters used in the experiments:

  • - missingness ratio is varied from . It is to be noted that missingness means complete data with no missingness.

  • - number of imputations is set to 18.

  • - size of ensemble is set to 21.

  • - for GRandI, it is set to .

  • - Number of cross validation folds is set to .

  • - times to repeat the experiment to balance out random variations. It is set to

A -fold cross validation is performed for every imputation method (or corresponding No-Imp) and it is repeated times by randomizing the data. The average of performance across times -folds is reported as the performance metric. Performance metrics used are accuracy and Kappa-error plots. Kappa-error plots are used to study the accuracy and diversity of members of an ensemble. Kappa-Error diagram shown in figures corresponds to and i.e. for one fold for times=1.

5.3 Datasets

We use four datasets from the health domain and four from the general domain from the UCI data repository 22 to evaluate different ensemble imputation methods. The description of these datasets is presented in Table 3.

Domain Dataset # Data Objects # Features

Health

Breast Tissues
New Thyroid
Parkinsons
Pima Indiana diabetes

Non-Health

Column
Glass
Seeds
Wine
Table 3: Datasets description

5.4 Results

Tables 4 - 11 show the results for each of the datasets. In each table, the first column represents the imputation method. The subsequent columns show the accuracy of each imputation method as the missingness ratio is increased from complete data () to . To compare various classification methods, we perform Friedman’s rank sum test. Calvo and Rodrigo recommend the Bergmann and Hommel test as a post-hoc all pair-wise comparison method with high statistical power 4. We used their R package is this paper 5. This test computes the p-value for each pair of algorithms corrected for multiple testing using Bergman and Hommel’s correction. Results are presented in Table 14 and Table 15. The significant differences (p-value 0:05) are shown in bold. We summarize the results from these tables as follows:

  1. Presence of a large number of missing values deteriorate the performance of a decision tree classification algorithm with single imputation. For example, for Breast Tissue data, the accuracy of EM algorithm reduce to from when missingness was introduced in comparison to complete data. Similar performance degradation was observed for all the methods for all the datasets.

  2. All the ensemble imputation methods performed better than their corresponding single imputation methods for or more missingness ratio. Ensemble methods showed that they they are more robust as the missingness ratio is increased as compared to the corresponding single imputation methods. For example, for Breast Tissues dataset with missingness ratio, the accuracy of single EM method decreases by around (0.629 to 0.524), whereas the accuracy of Bagging Single Imputation with EM method decreases by around (0.648 to 0.614).

  3. For smaller missingness ratio (up to ), MI over bootstrap and MI ensemble with MEI and GRandI showed no significant superiority over each other. They also perform worse than or are equivalent to methods that use imputation on bootstrap samples of incomplete data. It is to be noted that MEI or GRandI are less computationally extensive than EM.

  4. For smaller missingness ratio (up to ), bootstrapping of incomplete data without imputation generally show similar performance to other ensemble imputation techniques. However, ensemble imputation methods can have slight advantage for some datasets.

  5. Overall, the methods BagEM and BagMIEM that combine bagging with EM emerge as the best choice due to their robust performance on high missingness ratio (up to 30%). For example, with 30% missingness, for Breast Tissues dataset with no-imputation the accuracy degrades from 0.629 to 0.468 whereas the accuracy of BagEM degrades from 0.648 to 0.614 and the accuracy of BagMIEM degrades from 0.625 to 0.608. Similar behaviour is observed for other datasets. The statistical test for classifiers for 30% missingness (Table 15) suggests that BagEM has advantage over BagMIEM as BagEM shows statistically better results against most of the classification methods. However, a large number of missingness ratio means that the EM method will take more time to impute missing values; (as shown in Table 1).

Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.629 0.581 0.565 0.554 0.516 0.51 0.468
MEI 0.629 0.579 0.548 0.525 0.504 0.495 0.474
GRandI 0.629 0.576 0.544 0.513 0.468 0.436 0.418
EM 0.629 0.625 0.61 0.6 0.586 0.553 0.524
BagNoImp 0.648 0.614 0.593 0.591 0.562 0.541 0.525
BagMEI 0.648 0.613 0.583 0.576 0.553 0.536 0.526
BagGRandI 0.648 0.633 0.614 0.622 0.595 0.561 0.535
BagEM 0.648 0.642 0.638 0.647 0.64 0.625 0.614
BagMIGRandI 0.625 0.615 0.612 0.607 0.577 0.567 0.539
BagMIEM 0.625 0.624 0.626 0.631 0.629 0.626 0.608
MIGrandI 0.629 0.607 0.609 0.606 0.584 0.56 0.55
MIEM 0.629 0.626 0.622 0.614 0.617 0.62 0.594
Table 4: Breast Tissues data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.915 0.906 0.902 0.892 0.881 0.867 0.857
MEI 0.915 0.907 0.899 0.885 0.888 0.877 0.862
GRandI 0.915 0.887 0.855 0.83 0.798 0.772 0.74
EM 0.915 0.906 0.904 0.898 0.886 0.866 0.855
BagNoImp 0.929 0.921 0.919 0.902 0.896 0.884 0.867
BagMEI 0.929 0.918 0.913 0.902 0.901 0.893 0.875
BagGRandI 0.929 0.93 0.923 0.902 0.89 0.881 0.855
BagEM 0.929 0.922 0.927 0.922 0.919 0.915 0.907
BagMIGRandI 0.924 0.928 0.917 0.902 0.892 0.874 0.859
BagMIEM 0.924 0.921 0.916 0.912 0.912 0.907 0.9
MIGrandI 0.915 0.919 0.916 0.906 0.898 0.888 0.866
MIEM 0.915 0.907 0.908 0.899 0.893 0.883 0.878
Table 5: New Thyroid data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.835 0.834 0.822 0.827 0.814 0.809 0.809
MEI 0.835 0.824 0.81 0.806 0.796 0.79 0.794
GRandI 0.831 0.813 0.787 0.781 0.754 0.732 0.724
EM 0.831 0.833 0.824 0.829 0.809 0.804 0.785
BagNoImp 0.862 0.854 0.846 0.846 0.833 0.824 0.823
BagMEI 0.862 0.856 0.845 0.837 0.841 0.825 0.828
BagGRandI 0.862 0.863 0.852 0.843 0.838 0.828 0.818
BagEM 0.862 0.865 0.859 0.858 0.857 0.85 0.845
BagMIGRandI 0.849 0.856 0.845 0.838 0.832 0.822 0.816
BagMIEM 0.849 0.852 0.841 0.845 0.852 0.843 0.841
MIGrandI 0.835 0.861 0.85 0.848 0.836 0.832 0.827
MIEM 0.835 0.836 0.827 0.829 0.827 0.836 0.838
Table 6: Parkinsons data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.731 0.734 0.723 0.723 0.719 0.713 0.706
MEI 0.731 0.727 0.713 0.711 0.7 0.695 0.691
GRandI 0.731 0.713 0.701 0.695 0.681 0.672 0.666
EM 0.731 0.721 0.719 0.713 0.71 0.698 0.69
BagNoImp 0.754 0.75 0.746 0.74 0.737 0.724 0.721
BagMEI 0.754 0.749 0.741 0.735 0.731 0.721 0.714
BagGRandI 0.754 0.747 0.743 0.734 0.728 0.722 0.713
BagEM 0.754 0.743 0.744 0.742 0.736 0.725 0.721
BagMIGRandI 0.733 0.739 0.735 0.731 0.727 0.713 0.709
BagMIEM 0.733 0.731 0.725 0.727 0.722 0.715 0.717
MIGrandI 0.731 0.736 0.738 0.729 0.721 0.718 0.709
MIEM 0.731 0.72 0.716 0.717 0.71 0.703 0.705
Table 7: Pima Indiana Diabetes data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.804 0.785 0.767 0.751 0.735 0.715 0.699
MEI 0.804 0.771 0.762 0.732 0.725 0.71 0.698
GRandI 0.804 0.767 0.734 0.704 0.673 0.664 0.62
EM 0.804 0.791 0.781 0.756 0.74 0.72 0.703
BagNoImp 0.83 0.81 0.79 0.769 0.761 0.737 0.721
BagMEI 0.83 0.803 0.789 0.773 0.766 0.754 0.731
BagGRandI 0.83 0.807 0.788 0.772 0.764 0.75 0.725
BagEM 0.83 0.817 0.81 0.792 0.788 0.775 0.758
BagMIGRandI 0.819 0.8 0.787 0.77 0.764 0.744 0.716
BagMIEM 0.819 0.804 0.798 0.782 0.779 0.759 0.748
MIGrandI 0.804 0.788 0.783 0.761 0.757 0.738 0.717
MIEM 0.804 0.79 0.782 0.76 0.754 0.729 0.717
Table 8: Column data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.648 0.623 0.607 0.59 0.575 0.546 0.532
MEI 0.648 0.626 0.594 0.578 0.557 0.531 0.519
GRandI 0.648 0.592 0.548 0.499 0.48 0.447 0.419
EM 0.648 0.637 0.62 0.61 0.594 0.557 0.511
BagNoImp 0.699 0.683 0.66 0.64 0.617 0.592 0.587
BagMEI 0.699 0.682 0.652 0.636 0.618 0.601 0.587
BagGRandI 0.699 0.682 0.659 0.639 0.611 0.583 0.567
BagEM 0.699 0.694 0.682 0.68 0.668 0.654 0.633
BagMIGRandI 0.667 0.67 0.646 0.629 0.601 0.579 0.557
BagMIEM 0.667 0.667 0.66 0.646 0.636 0.638 0.622
MIGrandI 0.648 0.661 0.649 0.633 0.615 0.59 0.577
MIEM 0.648 0.637 0.618 0.612 0.595 0.593 0.592
Table 9: Glass data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.891 0.864 0.854 0.834 0.83 0.818 0.806
MEI 0.891 0.858 0.848 0.842 0.828 0.817 0.808
GRandI 0.891 0.845 0.806 0.779 0.752 0.722 0.686
EM 0.891 0.893 0.889 0.873 0.874 0.853 0.836
BagNoImp 0.903 0.885 0.883 0.872 0.867 0.853 0.851
BagMEI 0.903 0.872 0.864 0.865 0.854 0.846 0.842
BagGRandI 0.903 0.888 0.883 0.871 0.868 0.859 0.854
BagEM 0.903 0.901 0.898 0.896 0.89 0.889 0.888
BagMIGRandI 0.892 0.877 0.874 0.865 0.861 0.859 0.843
BagMIEM 0.892 0.889 0.89 0.884 0.886 0.888 0.883
MIGrandI 0.891 0.876 0.882 0.871 0.866 0.864 0.852
MIEM 0.891 0.893 0.891 0.881 0.881 0.875 0.874
Table 10: Seeds data
Imputation Methods Missingness Ratio ()
0 5 10 15 20 25 30
No-Imp 0.889 0.883 0.861 0.851 0.84 0.827 0.809
MEI 0.889 0.875 0.859 0.85 0.838 0.825 0.81
GRandI 0.889 0.861 0.827 0.788 0.765 0.714 0.672
EM 0.889 0.89 0.884 0.88 0.864 0.834 0.805
BagNoImp 0.919 0.917 0.913 0.902 0.896 0.88 0.868
BagMEI 0.919 0.911 0.903 0.899 0.886 0.88 0.868
BagGRandI 0.919 0.931 0.929 0.922 0.916 0.893 0.881
BagEM 0.919 0.92 0.913 0.92 0.928 0.924 0.921
BagMIGRandI 0.901 0.917 0.921 0.911 0.907 0.885 0.881
BagMIEM 0.901 0.897 0.896 0.898 0.905 0.911 0.913
MIGrandI 0.889 0.91 0.915 0.921 0.913 0.894 0.886
MIEM 0.889 0.892 0.884 0.884 0.885 0.862 0.89
Table 11: Wine data
Table 12: Kappa-error plots for 10% missingness for different ensemble methods.
Table 13: Kappa-error plots for 30% missingness for different ensemble methods.
BagNoImp BagMEI BagGRandI BagEM BagMIGRandI BagMIEM MIGrandI MIEM
BagNoImp n/a 1.000 1.000 1.000 0.127 0.127 0.003 0.003
BagMEI 1.000 n/a 1.000 1.000 0.127 0.127 0.003 0.003
BagGRandI 1.000 1.000 n/a 1.000 0.127 0.127 0.003 0.003
BagEM 1.000 1.000 1.000 n/a 0.127 0.127 0.003 0.003
BagMIGRandI 0.127 0.127 0.127 0.127 n/a 1.000 1.000 1.000
BagMIEM 0.127 0.127 0.127 0.127 1.000 n/a 1.000 1.000
MIGrandI 0.003 0.003 0.003 0.003 1.000 1.000 n/a 1.000
MIEM 0.003 0.003 0.003 0.003 1.000 1.000 1.000 n/a
Table 14: Friedman’s post-hoc test with Bergmann and Hommel’s correction with complete data (or no missingness). The significant differences (p-value 0:05) are shown in bold.
BagNoImp BagMEI BagGRandI BagEM BagMIGRandI BagMIEM MIGrandI MIEM
BagNoImp n/a 1.000 1.000 0.006 1.000 0.075 1.000 1.000
BagMEI 1.000 n/a 1.000 0.008 1.000 0.118 1.000 1.000
BagGRandI 1.000 1.000 n/a 0.002 1.000 0.039 1.000 1.000
BagEM 0.006 0.008 0.002 n/a 0.000 1.000 0.006 0.186
BagMIGRandI 1.000 1.000 1.000 0.000 n/a 0.002 1.000 0.346
BagMIEM 0.075 0.118 0.039 1.000 0.002 n/a 0.088 1.000
MIGrandI 1.000 1.000 1.000 0.006 1.000 0.088 n/a 1.000
MIEM 1.000 1.000 1.000 0.186 0.346 1.000 1.000 n/a
Table 15: Friedman’s post-hoc test with Bergmann and Hommel’s correction with 30% missingness ratio. The significant differences (p-value 0:05) are shown in bold.

5.5 Ensemble Diversity

Kappa-error plots 24 is a method to understand the diversity-error behaviour of an ensemble. These plots represent a point for each pair of classifiers in the ensemble. The x coordinate is a measure of diversity of the two classifiers and known as the kappa () measure, where low values suggest high diversity. The y coordinate is the average error of the two classifiers and . When the agreement of the two classifiers equals than expected by chance, ; when they agree on every instance, 9. Negative values of mean a systematic disagreement between the two classifiers.

We draw kappa-error plots for four datasets, i.e., Breast Tissue, New-Thyroid, Column, and Seeds, for different ensemble methods. The scales of and are same for each given dataset, so we can easily compare different ensemble methods. Tables 12 and 13 show the kappa-error plots of the testing phase of first run of the first cross-validation fold for each of the data at missing ratio of and . The rows show each of the datasets and the columns show the kappa-error plots for the different ensemble imputation methods.

Some of the graphs show only few points, which means that only few of the classifiers have distinct results. Generally, these kinds of graphs are for MIEM method. This suggests that this method is not creating diverse classifiers. These graphs suggest that most of the ensemble methods have similar diversity pattern. However, BagEM classifiers have better accuracy as groups of points are lower as compared to other ensemble methods. Accurate classifiers with reasonable diversity is the reason for the robust performance of BagEM at high missingness ratio.

6 Conclusion and Future Work

Handling missing data is a challenging task in data mining application. MI methods are commonly employed because they can model the uncertainty due to missingness. Bootstrapping is another method through which diversity may be incorporated in the incomplete data. Combining both the ideas together can lead to more accurate and diverse classifier, that can lead to robust ensemble with respect to high missingness ratio. In this paper, we present different variations of combining ideas from MI and bootstrapping for data imputation. Our results show that ensemble based imputations perform better than their single imputation counterparts for smaller missingness ratio of or more. The performance of MI over bootstrap samples with EM as the base imputation method does not degrade much for up to missingness ratio. It is consistently observed that no imputation on incomplete data with bootstrapping performs better than single imputation and is equivalent to other ensemble imputation methods for missingness ratio of up to . The kappa-error plots further verify that bagging and MI lead to diverse and accurate classifiers. Thus, their ensemble are more robust to missingness, in comparison to MI ensemble or single imputation methods. These findings in this paper are important from data scientists’ perspective because based on the missing ratio in the data, they can choose the right type of classification strategies without performing hit and trial methods

. In future, we plan to use non-decision tree based machine learning classifiers with these ensemble methods.

References

  • B 1976 Rubin1976B, RD.  1976. Inference and missing data Inference and missing data. Biometrika581-592.
  • Baneshi  Talei 2012 baneshi2012assessmentBaneshi, M.  Talei, A.  2012. Assessment of internal validity of prognostic models through bootstrapping and multiple imputation of missing data Assessment of internal validity of prognostic models through bootstrapping and multiple imputation of missing data. Iranian journal of public health415110.
  • Breiman 1996 BriemanBreiman, L.  1996. Bagging Predictors Bagging Predictors. Machine Learning242123–140.
  • Calvo  Rodrigo 2016 statisticaltestCalvo, B.  Rodrigo, GS.  2016Aug.. scmamp: Statistical comparison 1180 of multiple algorithms in multiple problems scmamp: Statistical comparison 1180 of multiple algorithms in multiple problems. The R Journal81248-256.
  • Calvo  Santafe 2016 statisticaltestpackageCalvo, B.  Santafe, G.  2016. Package scmamp. Package scmamp. Available online https://cran.r-project.org/web/packages/scmamp/scmamp.pdf. Accessed on: 15-Jul.-2018
  • Conroy . 2016 conroy2016dynamicConroy, B., Eshelman, L., Potes, C.  Xu-Wilson, M.  2016. A dynamic ensemble approach to robust classification in the presence of missing data A dynamic ensemble approach to robust classification in the presence of missing data. Machine Learning1023443–463.
  • Dahiya . 2017 dahiya2017featureDahiya, S., Handa, S.  Singh, N.  2017.

    A feature selection enabled hybrid-bagging algorithm for credit risk evaluation A feature selection enabled hybrid-bagging algorithm for credit risk evaluation.

    Expert Systems346.
  • Dempster . 1977 RubinMissingEMIDempster, A., Laird, N.  Rubin, DB.  1977. Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion) Maximum likelihood estimation from incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical AssociationB391-38.
  • Dietterich 2000 Dietterich2000Dietterich, TG.  2000. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning402139–157.
  • Donders . 2006 Donders2006ReviewDonders, ART., van der Heijden, GJ., Stijnen, T.  Moons, KG.  2006. Review: A gentle introduction to imputation of missing values Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology59101087 - 1091.
  • Feelders 1999 feelders1999handlingFeelders, A.  1999. Handling missing data in trees: surrogate splits or statistical imputation? Handling missing data in trees: surrogate splits or statistical imputation? European Conference on Principles of Data Mining and Knowledge Discovery European conference on principles of data mining and knowledge discovery ( 329–334).
  • Gelman  Hill 2006 gelman2006dataGelman, A.  Hill, J.  2006. Data analysis using regression and multilevel/hierarchical models Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
  • Hall . 2009 hall2009wekaHall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.  Witten, IH.  2009. The WEKA data mining software: an update The weka data mining software: an update. ACM SIGKDD explorations newsletter11110–18.
  • Harel  Zhou 2007 Harel2007MultipleHarel, O.  Zhou, XH.  2007. Multiple imputation: review of theory, implementation and software Multiple imputation: review of theory, implementation and software. Statistics in Medicine26163057–3077.
  • Hassan . 2007 hassan2007regressionHassan, MM., Atiya, AF., El-Gayar, N.  El-Fouly, R.  2007. Regression in the presence missing data using ensemble methods Regression in the presence missing data using ensemble methods. Neural Networks, 2007. IJCNN 2007. International Joint Conference on Neural networks, 2007. ijcnn 2007. international joint conference on ( 1261–1265).
  • Horton  Lipsitz 2001 horton2001multipleHorton, NJ.  Lipsitz, SR.  2001. Multiple imputation in practice: comparison of software packages for regression models with missing variables Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician553244–254.
  • Huang . 2016 HuangData2016Huang, MW., Lin, WC., Chen, CW., Ke, SW., Tsai, CF.  Eberle, W.  2016. Data preprocessing issues for incomplete medical datasets Data preprocessing issues for incomplete medical datasets. Expert Systems335432–438.
  • Khan . 2012 khan2012bayesianKhan, SS., Hoey, J.  Lizotte, D.  2012. Bayesian multiple imputation approaches for one-class classification Bayesian multiple imputation approaches for one-class classification.

    Canadian Conference on Artificial Intelligence Canadian conference on artificial intelligence ( 331–336).

  • Kumutha  Palaniammal 2013 kumutha2013enhancedKumutha, V.  Palaniammal, S.  2013. An enhanced approach on handling missing values using bagging k-NN imputation An enhanced approach on handling missing values using bagging k-nn imputation. Computer Communication and Informatics (ICCCI), 2013 International Conference on Computer communication and informatics (iccci), 2013 international conference on ( 1–8).
  • Kuncheva 2004 KunchevaKuncheva, LI.  2004. Combining Pattern Classifiers: Methods and Algorithms Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience.
  • Kuncheva  Rodriguez 2007 kuncheva2007classifierKuncheva, LI.  Rodriguez, JJ.  2007. Classifier ensembles with a random linear oracle Classifier ensembles with a random linear oracle. IEEE Transactions on Knowledge and Data Engineering194500–508.
  • Lichman 2013 Lichman:2013Lichman, M.  2013. UCI Machine Learning Repository. UCI machine learning repository. http://archive.ics.uci.edu/ml
  • Lin 2010 Lin2010Lin, TH.  2010. A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data A comparison of multiple imputation with em algorithm and mcmc method for quality of life missing data. Quality & Quantity442277–287.
  • Margineantu  Dietterich 1997 kappaMargineantu, DD.  Dietterich, TG.  1997. Pruning Adaptive Boosting Pruning Adaptive Boosting. Proc. 14th International Conference on Machine Learning Proc. 14th international conference on machine learning ( 211–218). Morgan Kaufmann.
  • Nanni . 2012 nanni2012classifierNanni, L., Lumini, A.  Brahnam, S.  2012. A classifier ensemble approach for the missing feature problem A classifier ensemble approach for the missing feature problem. Artificial intelligence in medicine55137–50.
  • Quinlan 2014 quinlan2014c4Quinlan, JR.  2014. C4. 5: programs for machine learning C4. 5: programs for machine learning. Elsevier.
  • Rezvan . 2015 rezvan2015riseRezvan, PH., Lee, KJ.  Simpson, JA.  2015. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC medical research methodology15130.
  • Schmitt . 2015 schmitt2015comparisonSchmitt, P., Mandel, J.  Guedj, M.  2015. A comparison of six methods for missing data imputation A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics611.
  • Schomaker  Heumann 2016 schomaker2016bootstrapSchomaker, M.  Heumann, C.  2016. Bootstrap Inference when Using Multiple Imputation Bootstrap inference when using multiple imputation. arXiv preprint arXiv:1602.07933.
  • Setz . 2009 setz2009usingSetz, C., Schumm, J., Lorenz, C., Arnrich, B.  Tröster, G.  2009. Using ensemble classifier systems for handling missing data in emotion recognition from physiology: one step towards a practical system Using ensemble classifier systems for handling missing data in emotion recognition from physiology: one step towards a practical system. Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on Affective computing and intelligent interaction and workshops, 2009. acii 2009. 3rd international conference on ( 1–8).
  • Su . 2009 su2009makingSu, X., Khoshgoftaar, TM.  Greiner, R.  2009. Making an accurate classifier ensemble by voting on classifications from imputed learning sets Making an accurate classifier ensemble by voting on classifications from imputed learning sets. International Journal of Information and Decision Sciences13301–322.
  • Tran . 2017 tran2017multipleTran, CT., Zhang, M., Andreae, P., Xue, B.  Bui, LT.  2017. Multiple Imputation and Ensemble Learning for Classification with Incomplete Data Multiple imputation and ensemble learning for classification with incomplete data. Intelligent and Evolutionary Systems: The 20th Asia Pacific Symposium, IES 2016, Canberra, Australia, November 2016, Proceedings Intelligent and evolutionary systems: The 20th asia pacific symposium, ies 2016, canberra, australia, november 2016, proceedings ( 401–415).
  • Twala  Cartwright 2005 twala2005ensembleTwala, B.  Cartwright, M.  2005. Ensemble imputation methods for missing software engineering data Ensemble imputation methods for missing software engineering data. Software Metrics, 2005. 11th IEEE International Symposium Software metrics, 2005. 11th ieee international symposium ( 10–pp).
  • Twala  Cartwright 2010 twala2010ensembleTwala, B.  Cartwright, M.  2010. Ensemble missing data techniques for software effort prediction Ensemble missing data techniques for software effort prediction. Intelligent Data Analysis143299–331.
  • Valdiviezo  Van Aelst 2015 valdiviezo2015treeValdiviezo, HC.  Van Aelst, S.  2015. Tree-based prediction on incomplete data using imputation or surrogate decisions Tree-based prediction on incomplete data using imputation or surrogate decisions. Information Sciences311163–181.
  • Wu  Jia 2013 wu2013newWu, W.  Jia, F.  2013. A new procedure to test mediation with missing data through nonparametric bootstrapping and multiple imputation A new procedure to test mediation with missing data through nonparametric bootstrapping and multiple imputation. Multivariate behavioral research485663–691.