1 Introduction
Repeated measure designs are commonly present in medical and scientific research. The simplest design captures the comparison of two different treatments or time points over repeatedly observed subjects. For inferring the predominance of a treatment the paired ttest is commonly applied. It is finitely exact under normality and asymptotically exact, if the latter assumption is dropped. However, first limitations of the paired ttest arise when missing values are present. This is frequently the case in biomedical studies, e.g. due to dropout or insufficient probes. In addition, rare diseases often lead to small sample sizes in matched pairs. Hence, conducting completecase analyses through the paired ttest while neglecting all information from the incomplete observations might lead to improper inferences, especially when missing values are not occurring completely random resulting into selection bias or when sample sizes are small
(Schafer, 1999). Therefore, scientific research has been going on involving missing value mechanisms in the test statistic. One simple approach is to impute missing values singly and conduct statistical tests as if there has not been any missing values so far. However, this procedure fails to account for the uncertainty involved in the imputation technique and statistical inference can be heavily distorted (Schafer, 1999; Rubin, 2004; Sterne et al., 2009). Multiply imputing missing values overcome this issue and the combination of them into a potentially asymptotic exact test statistic has been derived by Rubin (2004). Various algorithms have been developed to impute missing values. For example, the mice package in Ris able to impute missing values multiple times by assuming a parametric model for the data generating process and iteratively draw through fully conditional specification missing values from a predictive posterior distribution. Recently, nonparametric regression and classification methods such as the random forest have been proposed for the imputation of missing values
(Stekhoven and Bühlmann, 2011). In extensive simulation studies (Stekhoven and Bühlmann, 2011; Waljee et al., 2013; Ramosaj and Pauly, 2017), the missForest approach accurately imputes missing values and outperforms existing procedures in simulations such as MICE in terms of normalizing root mean squared or missclassification error. Thus, we would expect that inference drawn from the imputed data might lead to reliable conclusions. However, as recently pointed out by Dunson (2018) for machine learning techniques in general, ’there is a huge danger in doing so due to the lack of uncertainty quantification’. In the present paper, we investigate this issue by checking statistical validity for drawing inference after multiple imputation.Differing to imputation, research has also been going on in developing statistical testing procedures that only take into account the observed information (Lin and Stivers, 1974; Ekbohm, 1976; Bhoj, 1978; Looney and Jones, 2003; Kim et al., 2005; Samawi and Vogel, 2014; Maritz, 1995; Yu et al., 2012). In particular, statistical permutation tests were provided recently by Amro and Pauly (2017) and Amro et al. (2018) that require no parametric assumptions and possess nice small and large sample properties while only using all observed information. Below we will compare them to the previously mentioned inference strategies based on imputation techniques.
Similar studies on classical imputation techniques have shown that inference based on classical multiple imputation techniques and incomplete dataanalysis can differ, arising the question under which condition multiple imputation lead to correct statistical inference (Fay, 1992; Meng, 1994; Sterne et al., 2009; Peto, 2007). Meng (1994) for example, provided a general framework under which an imputation method leads to correct inference. He coined the term congeniality for assessing differences in imputation and analysis model under a Bayesian scheme. Under specific conditions defined in his work such as
secondmoment properness, selfefficiency
and information regularity, multiple imputation can lead to even more efficient estimates than solely based on incomplete data. However, if the imputation model is not correctly specified, inference is generally not valid. Therefore, the main task of the imputer is to deliver a predictive model that is generalizable, covering potential submodels in the subsequent analysis while correctly reflecting the uncertainty by the imputation
(Meng, 1994). Due to the flexibility and predictive accuracy of the Random Forest model without making assumptions on the data generating process, this method might be regarded as a potential candidate model for imputing missing values according to Meng (1994). However, less attention has been paid to the interesting question, whether a Random Forest based imputation scheme leads to correct statistical inferences in e.g. matched pairs design. Therefore, the aim of the current paper is trilateral. First, we aim to i) enlighten the mystery of correct statistical inference in matched pairs design when advanced Machine Learning techniques are used for multiply imputing missing values, ii) to check whether imputation accuracy measures such as root mean squared errors are capable of reflecting correct statistical inferences and iii) recommend a valid statistical testing procedure in matched pairs under ignorable missing mechanisms such as missing completely at random.After stating the statistical model and procedures of interest in Section 2, we describe multiple imputation techniques in Section 3. Section 4 explains the application of the weighted permutation procedure. In an extensive simulation study (Section 5) under various sample sizes and covariance structures, we then compute the typeIerror rate and the power of the paired ttest under different imputation schemes and oppose it to a recently proposed permutation test. The results are discussed in Sections and the analysis of a breast cancer gene study exemplifies potentially diverse findings in Section 6.
2 Statistical Model and Hypotheses
We consider the general matched pairs design given by a sample , where
are i.i.d. random vectors on some probability space
with mean vector and an arbitrary covariance matrix . Let the vectors indicate whether a certain component is observed or missing on subject . Denoting componentwise multiplication of vectors by , then we only observe . Hence our framework has the following form, where stands as a placeholder for :(1) 
Here, are the complete pairs and partially observed pairs. Rubin defines the missing mechanism through a parametric distributional model on
and classifies their presence through Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing not at Random (MNAR) schemes
(Rubin, 2004). In our work, we restrict our attention to missing mechanisms being MCAR. To fix some notation, let denote the index set of the complete pairs, i.e. for all . Similarly, resp. denote the index sets of observations with missing second for resp. first for component and set , , . In this setup (1), we are now interested in testing the null hypothesis
of equal means against the twosided alternative . Tests for the onesided analogue can be achieved by transforming the corresponding twosided value to .3 Imputing Incomplete Data
A usual habit is to impute missing values in the sample and conduct further statistical analyzes on the imputed data set . Various imputation schemes have been analyzed with respect to their imputation accuracy. Recently, promising results were obtained by a random forest based imputation scheme implemented as missForest in R (Stekhoven and Bühlmann, 2011)
. It is based on a nonparametric regression model combining several weak learners to construct strongly predictive models while including random subsampling in feature selection and bootstrapping. In several competitive simulation studies
(Stekhoven and Bühlmann, 2011; Waljee et al., 2013; Ramosaj and Pauly, 2017), missForest outperformed different imputation techniques as, e.g.,NN based techniques or the usually applied MICE algorithm. The latter is based on an iterative algorithm generating MarkovChains for potential imputation values
(Van Buuren and GroothuisOudshoorn, 2011). However, imputation retrieves the risk that imputer and data analyst could share completely different viewpoints for subsequent analyses. While the imputer is interested in using a method that is generalizable and thus minimizes normalized root mean squared error (NRMSE) or proportion of false classification (PFC), the interest of a statistical data analyst lies more on uncertainty quantification and valid inference. From a biostatistical viewpoint, it is unclear whether a method with comparable low NRMSE results in a testprocedure that reasonably controls typeIerror under the nullhypothesis. For this task, an ideal imputation scheme should be based on preserving the distributional property of the data generating process. That is, should hold for , where is the imputed value of the th observation. In this case, a natural test statistic for under a distributionpreserving imputation scheme would be the paired ttest statistic given by(2) 
Here, denotes the dimensional vector of s, the difference contrasts, , and , the mean differences of the first and second components of the complete and imputed pairs, respectively, which are pooled in the vector . Moreover, building the quadratic form in the matrix
leads to the usual empirical variance estimator in the denominator. Althought the stated
nulldistribution in (2) is only valid for independent bivariate normal vectors, the pairedttest is asymptotically exact even under nonnormal observations due to the central limit theorem and the law of large numbers.
3.1 Mice
Multiple Imputation by Chained Equations, often referred as fully conditional specification, uses a set of conditional parametric models for the imputation of missing values (Hughes et al., 2014). Imputation is then conducted by introducing a regression model for every variable with a suitable prior and draws are made iteratively form the corresponding predictive posterior of the regression model. That is, for every variable a separate model with parameter is assumed and simulation is conducted by the following iteration steps for ,
where and . The final imputation is then the th draw, i.e. . This allows for a higher flexibility in contrast to joint modeling, where the specification of a parametric joint model with joint parameter
can be challenging under different variable types. Under the ignorability assumption, imputations in joint modeling using the Bayesian framework are then draws from the posterior predictive distribution of the missing data given the observed observations
(Schafer, 1997). In case of finite samples sizes, compatibility and a noninformative margins condition, both methods, MICE and joint modeling, draw from the same predictive distribution of
. This is especially the case if data results from a multivariate normal distribution
(Hughes et al., 2014) such that statistical inference should be  at least asymptotically  valid in our case.Depending on the conditional distribution of resp. , , different algorithms have been proposed for MICE. In case of normally distributed conditionals, a linear model is assumed and missing values are drawn from it using a multistage wildbootstrap approach (Van Buuren et al., 2006). In the sequel, this approach will be named NORM and is implemented in the Rpackage mice. Predictive Mean Matching (PMM
), is more flexible towards deviations from the normality assumption including nonlinear association and heteroscedasticity
(Schenker and Taylor, 1996; Morris et al., 2014; Vink et al., 2014; Yu et al., 2007). In contrast to NORM, PMM imputes values by randomly drawing from a set of possible donors, that are determined based on the euclidean distance between observed items and predicted values of missing variables under the same linear model as in NORM (Van Buuren et al., 2006). This way, the regression model is only used for finding appropriate matches between observed and missing cases.3.2 Random Forest Imputation
Random Forest is a specific class of CART algorithms combining bagging and feature subspacing at each terminal node during the tree construction process. Due to its capability of detecting collinearity effects and accommodating nonlinear relations and interactions (Shah et al., 2014), the Random Forest has become popular for its usage in missing value estimation (Burgette and Reiter, 2010; Stekhoven and Bühlmann, 2011; Doove et al., 2014; Shah et al., 2014). In several simulation studies and medical data records, the implemented routine missForest in R has been used as a favorite choice for imputation (Stekhoven and Bühlmann, 2011; Waljee et al., 2013; Ramosaj and Pauly, 2017). Extending the usage of missForest and RandomForest based imputation schemes to multiple imputations that account for uncertainty, some attempts have been made to incorporate these methods into the MICE framework (Doove et al., 2014; Shah et al., 2014). However, it seems that none of these methods simultaneously considered the validity of statistical inference and imputation accuracy, focusing for example on either high imputation accuracy (Stekhoven and Bühlmann, 2011) or the estimation of unbiased regression coefficients in twoway interaction regression models (Doove et al., 2014). In this paper, we aim to fill this gap by considering Random Forest based imputation methods within the framework of

Multiple Imputation by repeatedly applying missForest (RF MI)

Multiple imputation using Random Forest within the MICE framework (RF MICE) (Doove et al., 2014)
while comparing the corresponding typeIerror and power of the paired ttest and the NRMSE in matched pairs. There are a couple of differences between RF MI and RF MICE, focusing either on more stable prediction of missing values or on extra variability due to imputation. RF MI initially imputes mean resp. mode values for missing values originating from continuous resp. categorical outcomes. RF MICE, however, randomly selects one candidate from the observed part for imputation. In contrast to RF MI, where the prediction of missing values is based on averaging donors in each leaf, RF MICE randomly selects candidates among the donors. In addition, the iteration process in RF MICE is repeated for a fixed number of times, whereas in RF MI, iteration is stopped when a stopping criterion measuring the imputation accuracy is reached.
3.3 Rubin’s Rule
In the matched pairs design with imputed missing values, the quantity of interest in the null hypothesis is the difference in expectations
, where a natural unbiased estimate under a distributionpreserving imputation scheme is given by
. Rubin proposed to impute multiple times, in oder to reflect the uncertainty of the imputation scheme in later inference. Hence, if an imputation scheme estimates missing values times, leading to estimates of mean value differences and of empirical variances, i.e. for , a combination of them is given by and (Rubin, 2004). Rubin (2004) suggested to use as the total variance of the estimate , where is the variance between the completedata estimates, i.e. . The combined test statistic for the matched pairs design can then be approximated by a t reference distribution with degrees of freedom, i.e.(3) 
where and being the relative increase in variance due to missingness. However, under small samples and a modest proportion of missing data, the approximation of the degrees of freedom for the t reference distribution in might be inappropriate (Barnard and Rubin, 1999). Adjusted degrees of freedom for are given by
(4) 
where . In the sequel, we will restrict our attention to the choice given in , since is always smaller than and coincides with , if the sample size is sufficiently large (Barnard and Rubin, 1999). This is especially the case when considering datasets with sample size vectors of the form for larger in Section 5. So far, theoretical guarantees for the validity of (2) resp. (4) using the missForest imputation procedure have not been delivered. Hence, it is completely unclear whether the paired ttest under the random forest based imputation scheme will lead to valid inference.
4 Permuting Incomplete Data
Maritz (1995) and Yu et al. (2012) proposed permutation techniques for incomplete paired data. However, these methods have the drawback that certain distributional assumptions (such as symmetry) are required to develop a valid level test even at least asymptotically. A novel permutation method overcame these problems while preserving the finite sample exactness property under similar assumptions (Amro and Pauly, 2017; Amro et al., 2018)
. Besides, the suggested method has the advantage that it is asymptotically robust against heteroscedasticity and skewed distributions. The idea of
Amro and Pauly (2017); Amro et al. (2018) is based on permuting the complete and the incomplete cases separately and then combining the results using either a general weighted test statistic or a multiplication combination test involving both, the paired ttest and the Welchtype test statistic.Based on a simulation study (Amro et al., 2018), the two permutation methods showed similar behavior in most of the considered cases. Therefore, we only consider the method of Amro and Pauly (2017) for further analysis. Their suggested test statistic consists of the following weighting scheme:
(5) 
where, is used to weight the inference from the complete pairs and the incomplete ones. Moreover, is the usual paired ttype statistic, , denote the differences of the first and second component for while is their empirical variance. Furthermore, is the Welchtype test statistic where and are the sample means and and the corresponding empirical variances. A recommended value of is (Amro and Pauly, 2017). This choice guarantees the correct treatment in the extreme scenarios of and
. In order to derive empirical quantiles of the test statistic
, a separate permutation technique for the complete and incomplete pairs has been used. While the components of the complete cases are permuted independently, for the missing parts , the components are shuffled together (Amro and Pauly, 2017). This method has the advantage of using all observed information while not being based on any parametric assumption. In particular, it was shown to be asymptotically correct in general and even finitely exact, if the distribution of the data is invariant with respect to the considered permutations (keyword: exchangeable in a certain way).5 Simulation strategy
In our simulation study, we compare the methods described in Section 3  4 with respect to their

typeIerror rate control at level and

power to detect deviation from the null hypothesis.
We thereby restrict the analysis to the paired ttest based on four imputation techniques; The established and easy to use MICE approach by Van Buuren and GroothuisOudshoorn (2011) with NORM and PMM as well as the missForest procedures RF MI and RF MICE described in Section 3. The latter was shown to be the method of choice in terms of accuracy in several simulation studies (Stekhoven and Bühlmann, 2011; Waljee et al., 2013; Ramosaj and Pauly, 2017). The crucial simulation study was conducted within the statistical computing environment R based on MonteCarlo runs. Small to large sized paired data samples are generated by
where is an i.i.d. bivariate random vector with zero mean and identity covariance matrix. Different choices of symmetric as well as skewed residuals are considered such as standardized normal, exponential, Assymetric Laplace or the distribution with degrees of freedom. For the covariance matrix , we considered a homoscedastic and a heteroscedastic setting given by the choices
for varying correlation factors . Missing values are generated under the MCAR scheme by randomly inserting them to the first or second component of the bivariate vector until a fixed amount of missing values of size for the second and for the first component is achieved. Furthermore, the sample sizes were chosen as and later increased by adding for to the balanced sample size and multiplying the unbalanced setting with a factor . In order to assess the power of the pairedttest for imputation methods and the permutation test of Amro and Pauly (2017), we set with shifting parameter . Here, the permutation method was based on permutation runs. Furthermore, we used imputations for the multiple imputation settings. In addition to reporting typeIerror () and power of the tests, we also report the NRMSE of the imputation methods defined by
Where, , is the true value whenever missing values are present, , and with , is the imputed value of missing observations in the th iteration using a multiple imputation procedure (Rubin, 2004).
Dist  

RF MI  RF MICE  PMM  NORM  RF MI  RF MICE  PMM  NORM  
Normal  0.9  0.052  0.070  0.058  0.061  0.045  0.056  0.073  0.063  0.062  0.047 
0.5  0.052  0.118  0.071  0.061  0.045  0.054  0.117  0.073  0.062  0.045  
0.1  0.050  0.163  0.077  0.055  0.038  0.054  0.162  0.077  0.057  0.042  
0.1  0.054  0.201  0.085  0.055  0.041  0.064  0.188  0.085  0.051  0.041  
0.5  0.052  0.260  0.085  0.048  0.039  0.052  0.216  0.078  0.047  0.040  
0.9  0.051  0.302  0.048  0.036  0.038  0.058  0.167  0.054  0.046  0.042  
Exp  0.9  0.049  0.070  0.060  0.059  0.046  0.074  0.085  0.075  0.076  0.058 
0.5  0.052  0.118  0.071  0.058  0.045  0.074  0.124  0.078  0.065  0.051  
0.1  0.048  0.165  0.076  0.052  0.039  0.077  0.173  0.088  0.063  0.045  
0.1  0.052  0.195  0.078  0.050  0.040  0.083  0.190  0.091  0.065  0.052  
0.5  0.050  0.262  0.084  0.048  0.040  0.084  0.221  0.092  0.06  0.056  
0.9  0.045  0.295  0.043  0.034  0.038  0.097  0.183  0.065  0.056  0.049  
Chisq  0.9  0.049  0.070  0.060  0.059  0.046  0.054  0.073  0.062  0.061  0.046 
0.5  0.052  0.118  0.071  0.058  0.045  0.055  0.115  0.073  0.056  0.041  
0.1  0.048  0.165  0.076  0.052  0.039  0.052  0.167  0.079  0.055  0.044  
0.1  0.052  0.195  0.078  0.050  0.040  0.054  0.187  0.081  0.054  0.042  
0.5  0.050  0.262  0.084  0.048  0.040  0.058  0.219  0.082  0.057  0.049  
0.9  0.045  0.295  0.043  0.034  0.038  0.062  0.166  0.053  0.046  0.039  
Ass.  0.9  0.050  0.070  0.060  0.061  0.045  0.065  0.076  0.066  0.066  0.075 
Laplace  0.5  0.051  0.108  0.067  0.054  0.040  0.065  0.118  0.077  0.062  0.047 
0.1  0.050  0.167  0.077  0.055  0.040  0.069  0.170  0.083  0.061  0.048  
0.1  0.056  0.194  0.083  0.050  0.042  0.069  0.191  0.083  0.057  0.046  
0.5  0.052  0.252  0.082  0.046  0.046  0.070  0.217  0.085  0.053  0.044  
0.9  0.049  0.278  0.037  0.031  0.047  0.090  0.182  0.061  0.051  0.049 
Dist  

RF MI  RF MICE  PMM  NORM  RF MI  RF MICE  PMM  NORM  
Normal  0.9  0.260  0.315  0.299  0.300  0.264  0.733  0.792  0.780  0.780  0.747 
0.5  0.271  0.396  0.319  0.284  0.239  0.768  0.847  0.803  0.771  0.719  
0.1  0.306  0.477  0.349  0.290  0.238  0.832  0.909  0.846  0.788  0.734  
0.1  0.338  0.551  0.383  0.306  0.250  0.858  0.934  0.869  0.799  0.729  
0.5  0.439  0.703  0.463  0.345  0.302  0.947  0.985  0.924  0.864  0.805  
0.9  0.869  0.967  0.657  0.605  0.628  1.000  1.000  0.973  0.967  0.957  
Exp  0.9  0.287  0.332  0.312  0.315  0.269  0.777  0.814  0.800  0.801  0.760 
0.5  0.336  0.435  0.364  0.318  0.271  0.818  0.862  0.825  0.785  0.742  
0.1  0.379  0.516  0.394  0.328  0.269  0.863  0.909  0.857  0.796  0.745  
0.1  0.408  0.570  0.414  0.332  0.279  0.888  0.928  0.861  0.813  0.756  
0.5  0.528  0.727  0.496  0.380  0.343  0.952  0.978  0.918  0.841  0.810  
0.9  0.876  0.950  0.676  0.617  0.661  1.000  0.998  0.968  0.954  0.958  
Chisq  0.9  0.255  0.311  0.290  0.295  0.254  0.739  0.796  0.783  0.784  0.752 
0.5  0.278  0.391  0.322  0.287  0.243  0.776  0.848  0.810  0.774  0.729  
0.1  0.315  0.500  0.367  0.301  0.248  0.831  0.905  0.843  0.786  0.729  
0.1  0.345  0.550  0.390  0.308  0.256  0.874  0.938  0.873  0.809  0.748  
0.5  0.442  0.700  0.448  0.342  0.304  0.951  0.985  0.927  0.856  0.802  
0.9  0.868  0.966  0.661  0.610  0.636  1.000  1.000  0.970  0.966  0.957  
Ass.  0.9  0.287  0.336  0.315  0.316  0.270  0.767  0.812  0.791  0.792  0.753 
Laplace  0.5  0.327  0.421  0.354  0.317  0.263  0.814  0.859  0.822  0.791  0.739 
0.1  0.367  0.519  0.385  0.322  0.267  0.858  0.908  0.851  0.798  0.743  
0.1  0.400  0.572  0.412  0.336  0.277  0.889  0.935  0.874  0.812  0.754  
0.5  0.504  0.726  0.497  0.375  0.342  0.952  0.979  0.920  0.846  0.815  
0.9  0.877  0.950  0.669  0.614  0.660  1.000  0.999  0.971  0.955  0.959 
The NRMSE has been reported as a key measure for assessing imputation quality of various imputation techniques (Stekhoven and Bühlmann, 2011; Waljee et al., 2013; Ramosaj and Pauly, 2017). However, it is completely unclear whether an imputation technique with relatively low NRMSE automatically lead to typeIerror rate control or higher power in incompletely observed matched pairs designs. Below, we shed light into this obstacle.
5.1 NRMSE Results
The simulation was conducted under with homogeneous covariance matrix and a sample size of for different correlation coefficients and different distributions as described above. It can be readily seen from Figure 1 that RF MI resulted in the lowest NRMSE overall settings under consideration. For low to moderate correlation factors (up to ) RF MICE showed the second best overall NRMSE, followed by PMM and NORM. For larger correlation factors this ordering changes with NORM showing the second lowest NRMSE, followed by RF MICE and NORM. Anyhow, our findings are in line with previous results (Waljee et al., 2013; Ramosaj and Pauly, 2017) in that they also result in a recommendation to apply RF MI if NRMSE is the major assessment criterion. However, we will see below that NRMSE is not a suitable measure for evaluating an imputation method when aiming to conduct inference on small to moderate sample sizes; at least in matched pairs designs.
5.2 TypeI Error Control Results
Simulation results of typeIerror of the  permutation test as well as the paired ttest based on imputation are summarized in Table 1 under various covariance structures and residual distributions for . Starting with the permutation test of Amro and Pauly (2017) we see that it tends to be an adequate exact level test among all considered ranges of for homoscedastic covariance structures and most heteroscedastic settings. Only in the heteroscedastic cases with skewed exponential (and to some extend also for asymmetric Laplace) distributed residuals, the permutation test has some difficulties in controlling typeIerror rate. In those cases, the MICE procedure under the algorithmic implementation of NORM has shown favorable typeI error rate control, but this with the cost of considerably less power (Table 2). For other residual distributions, NORM obtains the significance level of well, while being slightly conservative, especially when the correlation structure turns positive. This effect of a decrease in typeI error with increasing positive correlation factors could also be obtained for the PMM procedure and partly for the RF MICE. Here, the permutation test seems not to be affected by the various degrees of correlations. In contrast to the permutation test and the MICE procedures, both RF MI and RF MICE, imputation procedures based on Random Forest models, failed to control typeI error rate in almost all settings. Especially RF MI, which is based on repeatedly using missForest, heavily inflates typeI error of the paired ttest. RF MICE, which introduces more variability in the imputation draws, could be able to reduce typeI error, but is only in strongly positive correlated settings able to drop below the threshold.
In addition to these small sample size settings, we are also interested in typeI error rate control when sample sizes increases, while missing rates remain nearly unchanged. For moderate to large sample sizes, we considered the choices and , where ranges from (balanced case) and (unbalanced), respectively. Figure 2 summarizes typeI error rate control at for these settings. The results indicate that the imputation procedures based on Random Forest models, RF MI and RF MICE, failed to control typeI error rate also in larger samples. This especially holds for RF MI with typeI error rates between and for the largest sample size setting () while RF MICE was only slightly liberal then (). Only the permutation test and the MICE procedure under the algorithmic implementation of NORM control the typeI error accurate for moderate to large sample sizes, while PMM even got more conservative.
5.3 Power Results
In addition to the typeIerror rate, we simulated the power of the permutation test and the paired ttest based on the four imputation approaches. Table 2 summarizes the power analysis for the sample sizes under the range of and for a homoscedastic covariance structure. Power simulations for the other combinations of sample sizes and covariance structures were similar, see the supplement for details. It can be seen from Table 2 that RF MI combined with Rubin’s Rule has the largest power among all considered tests and procedures. However, this is not remarkable due to its extreme liberal behavior. Thereafter, RF MICE and the permutation test performed comparable in terms of power. Only in case of large positive correlation performed considerably better. Contrary, parametric imputation methods such as PMM and NORM resulted into much lower statistical power. For example NORM, which can be seen as very competitive to the permutation test in terms of typeI error rate control, often exhibits less power than . Hence, the permutation test qualifies as a suitable test statistic throughout different correlations while controlling typeIerror and delivering a comparably high power in matched pairs designs with partially observed data.
6 Analysis of the Cancer Genome Atlas Dataset
We consider data from a breast cancer study to illustrate the behavior of the considered imputation methods as opposed to the permutation method in a clinical study. Our data comes from the Cancer Genome Atlas (TCGA) project, which was launched in 2005 with a financial support from the National Institutes of Health. It aims to understand the genetic basis of several types of human cancers through the application of highthroughput genome analysis techniques. TCGA is willing to improve the ability of diagnosing, treating and even preventing cancer through discovering the genetic basis of carcinoma. Molecular information such as mRNA/miRNA expressions, protein expressions, weight of the sample as well as clinical information about the participants were collected. Clinical and RNA sequencing records could be obtained from breast cancer patient, where of them provided both, normal and tumor tissues. We used the
transformed data set for gene expressions. Luckily, no missing values were present, so that the data could be analyzed by means of the classical paired ttest. The results are then compared with the above procedures after artificially inserting missing values as follows: we generated Bernoulli distributed random variables
such that with leading to an overall missing rate of .Gene  Group  Selected Auxiliary Gene Expressions 

TP53  Tumor Protein p53  TPCN2, TPD52L2, TPST1, TP63, TP53TG1, TPD52L1, TPRKB, TP53INP2, TP53BP1, TPRN, TPM4, TPRA1, TPP2, TPPP3, TPSB2, TPT1, TPM3, TPO, TPK1, TP53INP1, TPBG, TPX2, TPTE2P1, TPSAB1, TPMT, TP53BP2, TPR, TPRG1, TP53I13, TPCN1, TPI1P2, TPI1, TPRA1, TPM2, TPM1, TPST2, TPR, TPI1, TP53INP2, TPRN 
ABCC1  Multidrug Resistanceassociated Protein 1  ABCC10, ABCC11, ABCC3, ABCC4, ABCC5, ABCC6P2, ABCC6, ABCC9, ABCC10, ABCC11, ABCC2, ABCC3, ABCC4, ABCC5, ABCC6P2, ABCC6, ABCC9 
HRAS  Transforming Protein p21  HRASLS5, HRASLS2, HRASLS5, HRASLS 
GSTM1  Glutathione Stransferase Mu 1  GSTA4, GSTCD, GSTK1, GSTM2, GSTM3, GSTM4, GSTM5, GSTO1, GSTO2, GSTP1, GSTT2, GSTZ1, GSTA4, GSTCD, GSTK1, GSTM2, GSTM3, GSTM4, GSTM5, GSTO1, GSTO2, GSTP1, GSTT2, GSTZ1 
ERBB2  ErbB2 Receptor Tyrosine Kinase 2  ERBB2IP, ERBB3, ERBB4, ERBB2IP, ERBB3, ERBB4 
CD8A  CD8a Molecule  CD80, CD81, CD82, CD83, CD84, CD86, CD8B, CD80, CD81, CD82, CD83, CD84, CD86, CD8B 
C1D  Nuclear Nucleic AcidBinding Protein  PRKD1, PRKD2, PRKD3, PRKDC, TSNAX, PRKD1, PRKD2, PRKD3, PRKDC, TSNAXIP1, TSNAX 
GBP3  GuanylateBinding Protein  GBP1, GBP2, GBP4, GBP5, GBP1, GBP2, GBP4, GBP5 
*The sceintific names of the genes were provided from GeneCards; http://www.genecards.org.^{1}^{1}1Footnote
Based on previous studies, six genes have been identified to be significantly related to breast cancer: TP53, ABCC1, HRAS, GSTM1, ERBB2 and CD8A (Finak et al., 2008; Harari and Yarden, 2000; Munoz et al., 2007; De Jong et al., 2002). Another two genes; C1D and GBP3 were under investigation but did not show any significant relation towards breast cancer patients (Qi et al., 2018). In this paper, we aim to test the hypothesis whether mean genetic expressions of the eight genes differ significantly between normal and tumor tissues of breast cancer patients. Boxplots representing the characteristics of the genes are shown in Figure 3. For more details regarding the dataset, we refer to Firehouse (www.gdac.broadinstitute.org).
Gene  T(Comp)  RF MI  RF MICE  PMM  NORM  
TP53  0.153  0.054  0.011  0.046  0.056  0.041 
ABCC1  0.001  0.016  0.008  0.022  0.085  0.017 
HRAS  0.008  0.002  0.001  0.001  0.001  
GSTM1  0.002  0.008  0.001  0.001  0.003  
ERBB2  0.000  
CD8A  0.933  0.558  0.205  0.280  0.418  0.378 
C1D  0.414  0.480  0.278  0.287  0.489  0.529 
GBP3  0.006  0.01  0.064  0.135  0.103  0.059 
The bold figures represent false decisions (rejecting a true or failing to reject a false at )^{2}^{2}2Footnote
Gene  RF MI  RF MICE  PMM  NORM 

TP53  0.021  0.054  0.04  0.069 
ABCC1  0.001  0.009  0.011  0.007 
HRAS  0.002  0.003  0.003  
GSTM1  0.015  0.156  0.001  0.011 
ERBB2  
CD8A  0.559  0.682  0.618  0.621 
C1D  0.217  0.292  0.297  0.364 
GBP3  0.006  0.068  0.021  0.064 
The bold figures represent false decisions (rejecting a true or failing to reject a false at ).^{3}^{3}3Footnote
The data set delivers in total gene expressions. From the viewpoint of an imputer, a rich data base that can be used to impute missing values is not only restricted to the paired observations of the gene expressions required for analysis. Meng (1994), for example, mentioned that additional information (called auxiliary variables in the sequel) should be considered in an imputation method. Since in general, the number of available gene expression results into a problem, in which least square estimates used in PMM and NORM will suffer from instable estimation, restriction to a suitable subset is required. In Table 3, the gene expressions selected for auxiliary candidates are given. The selection of additional gene expressions was based on their membership to groups, that have been specified based on the biochemical structure of the eight genes. In addition, all gene expressions consisting of at least one missing values have been dropped, in oder to make the result comparable in the later analysis.
All eight genes have been tested separately for mean equality between normal and tumor tissue using the

paired ttest in the completecase analysis before artificially inserting missing values,

the paired ttest after (multiple) imputing missing values with RF MI, RF MICE, PMM and NORM,

the paired ttest after (multiple) imputing missing values with RF MI, RF MICE, PMM and NORM using auxiliary variables,

the weighted permutation test after inserting missing values.
The resulting values of the corresponding tests are summarized in Table 4 and 5. In contrast to the simulation study, we extended the number of imputation draws to . The paired ttest in the completecase analysis identified five out of eight genes having significantly different genetic expressions in normal and tumor tissues; genes ABCC1, HRAS, GSTM1, ERBB2, and GBP3. Mean gene expressions in TP53, CD8A, and C1D have been identified as non significantly different between normal and tumor tissues. Same results for all eight genes could be obtained using the weighted permutation test . However, all considered imputation methods except PMM led to different results for the TP53 gene indicating significantly different results in mean gene expressions between normal and tumor tissues. Even NORM, which indicated favorable results among the imputation procedures during simulation, failed to deliver similar results as the paired ttest in complete case analysis or . Here, including auxiliary variables for TP53 gene helped NORM and also in case of RF MICE. However, this was not the case for RF MI or PMM. For example, the PMM test decisions were flipped after rerunning it with auxiliary variables, finding three more significant results of which one (TP53) was different to the the completecase analysis before inserting missings. To discuss another example, all imputation techniques (RF MI, RF MICE, PMM and NORM) without auxiliary information had difficulties in rejecting the null hypothesis of equal mean gene expressions in GBP3 gene leading again to different results compared to the paired ttest in completecase analysis or . After including auxiliary variables, only the imputation tests RF MI and PMM found the same significance. A similar observation can be made for gene expression GSTM1, where adding extra information lead to a nonrejection of the imputation test RF MICE.
7 Discussion
We investigated the effect of imputation procedures on inference in matched pairs designs. Our simulation results indicated that NRMSE is not a suitable measure for evaluating an imputation method with respect to subsequent inference. Although imputation methods based on the random forest have shown high predictive accuracy, they failed to control typeIerror under various settings and distributions. Moreover, even incorporating uncertainty in the test statistic by multiply imputing missing values as suggested by Rubin (2004) failed to deliver better typeIerror control for the random forest based methods. Results turn into the right direction, when a modified version called RF MICE in the mice
package is used. This because of the additional variance introduced in the random draws from observations falling in the corresponding leaf of a random decision tree. However, this imputation method realized a serious increase in NRMSE and a considerable loss of power while still exhibiting a rather liberal behavior under the null hypothesis. Parametric methods such as PMM and NORM in
mice performed better in terms of typeI error rate control, but realized even less power to detect deviations from the null hypothesis. In contrast, a recently proposed permutation test that only works with the observed data at least delivered acceptable results and can be considered as the favorite choice when testing the equality of means in matched pairs with missing values.In a real data example with gene expressions form breast cancer patients, the effect of auxiliary variables in an imputation scheme has been considered as recommended by Meng (1994). For nonparametric imputation schemes based on random forest models, the effect of auxiliary variables were less effective compared to parametric choices such as PMM and NORM. Eventhough, all the parametric and nonparametric imputation schemes; RF MI, RF MICE, PMM and NORM failed in mirroring all the significant and the non significant genes as obtained using the paired ttest in a complete case analysis before artificially inserting missing values. Contrary, the permutation procedure, capable of working with missings, led to the same conclusions. From these findings we recommend a cautionary use of imputation techniques (with or without auxiliary variables) for drawing inference.
In summary, it is important to which further extent statistical analysis will be conducted when dealing with missing values. In terms of inference, imputation methods can fail to deliver suitable results as shown in the simulation and data example. Here, the permutation test of Amro and Pauly (2017) appeared to be the more reasonable choice. However, if the statistical analysis on partially observed data aims to predict outcomes, imputation methods based on random forest models by Stekhoven and Bühlmann (2011) and Ramosaj and Pauly (2017) could be a suitable solution. Future research will extend the usual investigation to more complex designs involving more than two endpoints and possibly multiple groups as well as the inclusion of more auxiliary variables. Regarding the latter we avoided a situation as parametric imputation methods such as PMM and NORM may run into difficulties. However, random forest models are known to be capable of dealing with highdimensional feature problems and it would be interesting to also investigate this in case of inference after imputation.
8 Acknowledgments
We acknowledge the support of the Daimler AG represented by the research department ’Computer Vision and Camera Systems’. This work was also supported by the German Academic Exchange Service (DAAD) under the project: Research Grants  Doctoral Programmes in Germany, 2015/16 (No. 57129429).
Conflict of Interest: None declared.
References
 Amro et al. (2018) Amro, L., Konietschke, F., and Pauly, M. (2018). MultiplicationCombination Tests for Incomplete Paired Data. arXiv preprint arXiv:1801.08821.
 Amro and Pauly (2017) Amro, L. and Pauly, M. (2017). Permuting incomplete paired data: A novel exact and asymptotic correct randomization test. Journal of Statistical Computation and Simulation, 87(6):1148–1159.
 Barnard and Rubin (1999) Barnard, J. and Rubin, D. B. (1999). Smallsample degrees of freedom with multiple imputation. Biometrika, 86(4):948–955.
 Bhoj (1978) Bhoj, D. S. (1978). Testing equality of means of correlated variates with missing observations on both responses. Biometrika, 65(1):225–228.
 Burgette and Reiter (2010) Burgette, L. F. and Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172(9):1070–1076.
 De Jong et al. (2002) De Jong, M., Nolte, I., Te Meerman, G., Van der Graaf, W., Oosterwijk, J., Kleibeuker, J., Schaapveld, M., and De Vries, E. (2002). Genes other than brca1 and brca2 involved in breast cancer susceptibility. Journal of Medical Genetics, 39(4):225–242.
 Doove et al. (2014) Doove, L. L., Van Buuren, S., and Dusseldorp, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72:92–104.
 Dunson (2018) Dunson, D. B. (2018). Statistics in the big data era: Failures of the machine. Statistics & Probability Letters, 136:4–9.
 Ekbohm (1976) Ekbohm, G. (1976). On comparing means in the paired case with incomplete data on both responses. Biometrika, 63(2):299–304.
 Fay (1992) Fay, R. E. (1992). When are inferences from multiple imputation valid ? In Proceedings of the Survey Research Methods Section of the American Statistical Association, volume 81, pages 227–32.
 Finak et al. (2008) Finak, G., Bertos, N., Pepin, F., Sadekova, S., Souleimanova, M., Zhao, H., Chen, H., Omeroglu, G., Meterissian, S., Omeroglu, A., et al. (2008). Stromal gene expression predicts clinical outcome in breast cancer. Nature Medicine, 14(5):518.
 Harari and Yarden (2000) Harari, D. and Yarden, Y. (2000). Molecular mechanisms underlying erbb2/her2 action in breast cancer. Oncogene, 19(53):6102.
 Hughes et al. (2014) Hughes, R. A., White, I. R., Seaman, S. R., Carpenter, J. R., Tilling, K., and Sterne, J. A. (2014). Joint modelling rationale for chained equations. BMC Medical Research Methodology, 14(1):28.
 Kim et al. (2005) Kim, B. S., Kim, I., Lee, S., Kim, S., Rha, S. Y., and Chung, H. C. (2005). Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer. Bioinformatics, 21(4):517–528.
 Lin and Stivers (1974) Lin, P.E. and Stivers, L. E. (1974). On difference of means with incomplete data. Biometrika, 61:325–334.
 Looney and Jones (2003) Looney, S. W. and Jones, P. W. (2003). A method for comparing two normal means using combined samples of correlated and uncorrelated data. Statistics in Medicine, 22(9):1601–1610.
 Maritz (1995) Maritz, J. S. (1995). A permutation paired test allowing for missing values. Australian Journal of Statistics, 37(2):153–159.
 Meng (1994) Meng, X.L. (1994). Multipleimputation inferences with uncongenial sources of input. Statistical Science, pages 538–558.
 Morris et al. (2014) Morris, T. P., White, I. R., and Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14(1):75.
 Munoz et al. (2007) Munoz, M., Henderson, M., Haber, M., and Norris, M. (2007). Role of the mrp1/abcc1 multidrug transporter protein in cancer. IUBMB life, 59(12):752–757.
 Peto (2007) Peto, R. (2007). Doubts about QRISK score: total/HDL cholesterol should be important [electronic response to hippisleycox j, et al]. British Medical Journal.
 Qi et al. (2018) Qi, Q., Yan, L., and Tian, L. (2018). Testing equality of means in partially paired data with incompleteness in single response. Statistical Methods in Medical Research, 0(0).
 Ramosaj and Pauly (2017) Ramosaj, B. and Pauly, M. (2017). Who wins the Miss Contest for Imputation Methods? Our Vote for Miss BooPF. arXiv preprint arXiv:1711.11394.
 Rubin (2004) Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons.
 Samawi and Vogel (2014) Samawi, H. M. and Vogel, R. (2014). Notes on two sample tests for partially correlated (paired) data. Journal of Applied Statistics, 41(1):109–117.
 Schafer (1997) Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman and Hall/CRC.
 Schafer (1999) Schafer, J. L. (1999). Multiple imputation: a primer. Statistical Methods in Medical Research, 8(1):3–15.
 Schenker and Taylor (1996) Schenker, N. and Taylor, J. M. (1996). Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis, 22(4):425–446.
 Shah et al. (2014) Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., and Hemingway, H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. American Journal of Epidemiology, 179(6):764–774.
 Stekhoven and Bühlmann (2011) Stekhoven, D. J. and Bühlmann, P. (2011). Missforest: nonparametric missing value imputation for mixedtype data. Bioinformatics, 28(1):112–118.
 Sterne et al. (2009) Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., Wood, A. M., and Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. British Medical Journal, 338:b2393.
 Van Buuren et al. (2006) Van Buuren, S., Brand, J. P., GroothuisOudshoorn, C. G., and Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12):1049–1064.
 Van Buuren and GroothuisOudshoorn (2011) Van Buuren, S. and GroothuisOudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3):1–67.
 Vink et al. (2014) Vink, G., Frank, L. E., Pannekoek, J., and Van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Statistica Neerlandica, 68(1):61–90.
 Waljee et al. (2013) Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., and Higgins, P. D. (2013). Comparison of imputation methods for missing laboratory data in medicine. British Medical Journal Open, 3(8):e002847.
 Yu et al. (2012) Yu, D., Lim, J., Liang, F., Kim, K., Kim, B. S., and Jang, W. (2012). Permutation test for incomplete paired data with application to cDNA microarray data. Computational Statistics & Data Analysis, 56(3):510–521.
 Yu et al. (2007) Yu, L.M., Burton, A., and RiveroArias, O. (2007). Evaluation of software for multiple imputation of semicontinuous data. Statistical Methods in Medical Research, 16(3):243–258.