Incomplete data are nearly inevitable in medical studies and require careful attention, since missing data can seriously affect the validity of statistical inference when not handled properly (Rubin, 1987)
. A commonly used technique to obtain valid inference with incomplete datasets is multiple imputation: A procedure that replaces each missing value by several plausible values thereby creating multiple completed datasets. The imputation procedure is followed by an analysis procedure in which each dataset is analyzed separately before these individual results are pooled into a single estimate. The pooling of multiple parameter estimates is useful to reflect an additional source of uncertainty that imputed values bring to the analysis: Since we do not know how well imputed values correspond to the unobserved values, differences between imputations are used to quantify and incorporate this uncertainty in statistical inference(Rubin, 1987).
When the imputation and analysis procedures have been completed, new data may be added to the existing dataset. Such multistage data are common, with longitudinal (new variables) and sequential (new cases) designs as well-known examples. New data raise the question what to do with the existing imputations. We can overwrite the existing imputations with new values or we can keep the existing imputations and impute the new data only (Rubin, 2003; Harel, 2003; McGinniss and Harel, 2016).
2 Problem illustration
We illustrate these methods and their operational aspects with the Project on Preterm and Small for Gestational Age Infants (POPS) (Verloove-Vanhorick et al., 1986; Veen et al., 1991). The POPS study included about 94% of children born in 1983 in The Netherlands with a birth weight below grams and/or a gestational age below weeks (
). POPS has multiple waves since these children have been followed at various ages (e.g. 1, 5, 10, 14, 19 years) to evaluate physical, cognitive and psychosocial outcomes. Like many longitudinal studies, POPS suffers from dropout that should not be ignored: Of thesurviving participants, completed participation at age 19 and those who dropped out differed systematically from full responders on multiple outcomes of interest (Hille et al., 2005). Each wave of data has missing values that we aim to treat with multiple imputation.
Overwriting existing imputations with new values is called re-imputation and treats the combined data of multiple waves as a new missing data problem. This strategy merges the incomplete data of previous waves with new variables and performs a multiple imputation on these combined data. Each new wave of data results in a new set of imputed datasets. Updating existing imputations with new data has consequences for the analysis of the old data. Repeating previous analyses with newly imputed data may result in different parameter estimates or conclusions. We demonstrate this with an adapted example from van Dommelen et al. (Van Dommelen et al., 2014). The researchers used the POPS data to predict the effect of early catch-up growth (developing towards the median of the growth charts in the first year of life) on health and well-being in young adulthood from the POPS data. Multiple imputation of the data available at age 14 showed that catch-up growth in weight did not predict length and weight at age 14. We observed the opposite when we ran the same analysis with data from age 19 included in the imputation: Catch-up growth did predict length and weight at age 14111, CI , , CI , , CI , , CI . Re-imputation potentially lacks replicability and contrasting conclusions as these raise the question which of the analyses should be trusted.
Keeping the existing imputations and imputing only the new data is known as nested imputation (Rubin, 2003). This strategy combines the completed datasets from previous waves with the new incomplete data and performs a multiple imputation on each of these combined datasets (Rubin, 2003; Harel, 2003; Harel and Schafer, 2003). The data are imputed times in wave resulting in a total of nested datasets. Nested imputation preserves completed datasets from previous waves, which is useful since all existing analyses remain unchanged. The strategy can be computationally challenging however, since the number of imputed datasets increases exponentially with each additional wave of data. These datasets require specific pooling rules to take the nested structure into account and the complexity of these rules increases sharply when the number of stages goes up (Shen, 2000; Rubin, 2003; McGinniss and Harel, 2016). Nested imputation would be highly inconvenient for datasets with several waves.
As a potentially more convenient alternative, we propose to append imputations to the existing ones and do a single imputation on the new data of each of completed datasets. Appended imputation is a special case of nested imputation with for . The method has several attractive features. Whereas re-imputation includes future information to overwrite existing imputations, the successive structure of appended imputation preserves existing imputations and allows for replicable conclusions. Appended imputation is computationally convenient, since the resulting imputations can be pooled with a procedure for non-nested datasets.
While appended imputation has operational advantages over re-imputation and nested imputation, differences between strategies may impact on the validity and the efficiency of statistical inference with the completed data. To be more specific, it is unclear whether the initial multiple imputation of appended imputation can account for sufficient imputation uncertainty to obtain unbiased and confidence-valid estimates. Moreover, the difference between imputing the entire dataset (re-imputation) or new data only (nested imputation) in multistage data might influence the validity of statistical inference as well. The current study aims to evaluate the statistical properties of the three introduced strategies for creating imputations in a multistage context. Section 3 discusses potential effects of updating or preserving existing imputations on validity. In Section 4, the statistical performance of the three strategies are evaluated by means of simulation. We apply the three strategies to the POPS dataset in Section 5 and conclude the paper with a discussion in Section 6.
When the statistical model of interest (i.e. the analysis model) describes relations between variables, unobserved values should be imputed in such a way that the completed data properly reflect these relationships. This condition has been met if the variables in the imputation model correspond to the variables in the analysis model: An assumption that we call congeniality between the imputation and the analysis (Meng, 1994). Any discrepancy between the variables in the imputation model and the analysis model results in an uncongenial imputation model.
Uncongeniality can be problematic for the validity of statistical inference, in particular when the imputation models contains fewer variables than the analysis model. Ignoring variables in the imputation model assumes there is no relationship between the incomplete and the omitted variable, resulting in subjects with (falsely) unrelated imputations (Meng, 1994)
. Mixing these subjects with (correctly) correlated observed values results in systematical underestimation of the strength of this relationship and undercoverage of nominal confidence intervals of relationship parameters(Meng, 1994; Zhu and Raghunathan, 2015; Xie and Meng, 2015; Daniels et al., 2014; Gelman and Raghunathan, 2001)
. In contrast, validity is generally not threatened by uncongeniality if the imputation model contains more variables than the analysis model. The information from extra variables results in so-called superefficient parameter estimates, that have smaller variances than can be obtained with a congenial imputation model(Rubin, 1996; Xie and Meng, 2015).
The three imputation strategies include new variables in the imputation models in different ways and consequentially deal with congeniality differently. Re-imputation updates existing imputations with information obtained from the new variables and can result in congenial imputation models for analysis models that relate old and new variables. If the analysis model is limited to variables from previous waves while the imputation model includes new variables, the updated imputations include extra information and may result in sharper inferences than the analysis using the existing imputations (which is illustrated by the example in Section 2).
Since nested and appended imputation ignore any information available from later waves, existing imputations were created under the assumption that the old variables are unrelated to the new variables. Although the imputation model of the existing imputations and any analysis model relating old and new values are thus uncongenial, estimates of nonzero relationships are not necessarily incorrect. Whether these parameters are estimated accurately, depends on the combination of missing values as reflected by the missingness pattern. Estimates may be correct under monotone missingness, which is characterized by the following two situations: 1.) Incomplete cases have missing data in the new variables only; 2.) Incomplete cases have missing data on old variables and all new variables. In contrast, relationships will be underestimated when missingness is nonmonontone, such that incomplete cases have missing data in the old variables while the new variables have been observed. Newly observed values might provide essential information about the unobserved values in the old data, if these old and new variables are related. The existing imputations could not take this information into account since it was not in their imputation model. Monotone missingness does not suffer from biased relationships: There are no newly observed values to provide information about the existing imputations and their relations with new variables.
3.2 Imputation uncertainty
Proper quantification of uncertainty requires reflection of imputation uncertainty: The variance of parameter estimates must correctly reveal the extra uncertainty of adopting imputations as data (Rubin, 1987). The pooling procedure therefore accommodates differences between imputed datasets, such that parameter estimates are confidence-valid. Whereas re-imputation uses regular pooling rules for multiple imputation, nested multiple imputation requires specific pooling rules that respect the nested data structure (Rubin, 2003; Harel, 2003; Shen, 2000). These pooling rules for two-level nested datasets are more complex than the standard pooling rules for non-nested data, as shown in Table 3 (Shen, 2000; McGinniss and Harel, 2016).
Datasets completed with appended imputation have nests of , such that the pooling procedure for nested imputation simplify to the regular pooling rules (Rubin, 2003). However, single imputation in general underestimates variance (Rubin, 1987), and it is unclear whether the multiple imputation structure of the first stage can account for sufficient imputation uncertainty to justify single imputation in later waves. A full multiple imputation at each later stage (i.e. nested imputation) may be necessary to obtain confidence-valid results.
To conclude, nested and appended imputation in general make more restrictive assumptions about the missingness pattern of incomplete data than re-imputation, and appended imputation may underestimate variance. In the current simulation study, we investigated these assumptions and their consequences for statistical inference, resulting in the following research questions:
In which situations could possible uncongeniality between the imputation and complete-data models created under nested and appended imputation affect the validity of statistical inferences?
Would the three imputation strategies perform similarly under a monotone missing data mechanism?
Does appended imputation result in confidence-valid parameter estimates?
4 Simulation study
4.1 Simulation setup
We evaluated the perfomances of re-imputation, nested imputation and appended imputation in a two-stage setup, resembling a longitudinal design with two waves. The data consisted of one completely observed () and one incomplete () covariate. The data had two incomplete variables ( and ). To investigate the consequences of potential uncongeniality, we manipulated the correlation structure and the missingness pattern of the data.
Since the relation between old and new data is crucial for valid inference, we distinguished correlations between variables within waves () from correlations between variables from different waves (). We specified 16 correlation structures (), as presented in Table 1.
For each of these correlation structures, we drew samples of
from the multivariate standard normal distribution. These data were made incomplete under monotone and non-monotone missingness (see Table2
) by randomly deleting values in accordance with the missingness pattern. Each combination of missing values had the same probability and every sample had a total of 20% missing values.
The samples were completed using re-imputation, nested imputation and appended imputation. Under re-imputation, datasets with incomplete and data were imputed times. Under nested and appended imputation, we imputed each dataset times and added incomplete data to each completed dataset. We imputed these partially completed datasets (nested imputation) and (appended imputation) times, resulting in a total of 5 datasets after re-imputation and appended imputation, and 25 datasets after nested imputation.
After imputation, a linear regression model was fitted on each completeddataset:
Results were pooled using the method-specific pooling rules presented in Table 3. Quantities of interest were variable means and regression coefficients of the fitted model. True values were population means and population regression coefficients derived from the correlation structure of the data.
|Parameter||Independent datasets||Nested datasets|
|Note. and represent the complete data estimate and sampling variance|
|for dataset in nest respectively.|
We operationalized performance as validity and efficiency. Imputations were considered valid if pooled parameter estimates were unbiased and the coverage of their confidence intervals was at least the nominal 95%, adhering to the criteria for proper imputation (Rubin, 1987). Relative efficiency referred to the width of the 95% confidence interval compared to the other imputation strategies.
While all analyses following re-imputation resulted in valid estimates (bias: , coverage: ), the validity of nested and appended imputation depended on the missingness pattern and the correlation structure of the data. Under monotone missingness, variable means and regression coefficients were valid (bias: , coverage: ) and non-monotone missingness resulted in proper variable mean estimation (bias: , coverage: ). The validity of regression coefficients under non-monotone missingness depended on the correlation structure in the data.
When correlations within timepoints were lower than correlations between timepoints, regression coefficient estimates were clearly biased (bias: ) and undercovered (coverage: ). In contrast, when correlations of variables within waves dominated correlations of variables between waves, bias of regression coefficients was small (bias: ) and their coverage exceeded the nominal level (coverage: ). The numerical results of one of the regression coefficients is graphically presented in Figure 1 and shown in Table 4. Other regression coefficients performed qualitatively similarly and are available upon request.
|Note. RI = re-imputation, NI = nested imputation, AI = appended imputation.|
The 95% confidence intervals of variable means had similar relative widths (nested imputation vs. re-imputation: ; nested imputation vs. appended imputation: ; appended imputation vs. re-imputation: ). Estimation of egression coefficients was slightly more efficient under nested imputation compared to re-imputation () or appended imputation ().
When taken together, we found that:
All techniques performed similarly under monotone missingness ;
Appended and nested imputation were on par on all scenarios;
Appended and nested imputation performed well when correlations within time points were high;
Appended and nested imputation performed quite bad when correlations between timepoints were high and correlations within timepoints were low.
5 Data application
In practice, violating congeniality in nested and appended imputation may have less severe consequences than our simulation suggested. Real datasets often contain auxiliary variables that provide extra information to reduce the impact of ignoring future data (Daniels et al., 2014; Xie and Meng, 2015). To compare the performances of the imputation strategies in practice, we applied them to the POPS dataset (Verloove-Vanhorick et al., 1986).
We predicted five outcomes at age 19 ( data: length, cognition, health-related quality of life, internalizing problems, and externalizing problems) from four types of catch-up growth ( data: weight, length, head circumference, and weight-length), either unadjusted or adjusted for potential confounders (also data) using analysis models specified by Van Dommelen et al. (Van Dommelen et al., 2014). Potential confounders were neonatal factors (gestational age, and sex) and environmental factors (maternal age at birth, maternal smoking during pregnancy, maternal diabetes, socioeconomic status, parity, ethnicity, and target length). The researchers selected cases born small for gestational age without severe complications ( for weight, for length, for head circumference, and for weight adjusted for length) from the incomplete POPS cohort. A more elaborate descriptions of case selection and operationalization of variables can be found in the original article (Van Dommelen et al., 2014).
We completed the dataset using the three imputation strategies. Similar to the original article, we re-imputed the incomplete and data times (Van Dommelen et al., 2014) and we imputed data times under nested and appended imputation. We added the incomplete data to each completed dataset and imputed these partially completed datasets (nested imputation) and (appended imputation) times.
After multiple imputation, we made the aforementioned data selections and fitted eight linear regression models per outcome variable to the appropriate data. The models predicted the outcome from catch-up growth in weight, length, head circumference or weight adjusted for length, either unadjusted or adjusted for potential confounders. Quantities of interest were regression coefficients of catch-up growth predictors and their confidence intervals.
5.2.1 Assumptions for congeniality
Although none of the data selections followed a strictly monotone missingness pattern, at least of missing values came from cases without observations in data, except for the weight-length predictor ( monotone).
Potential problems arising from nonmonotone missingness may be mitigated by strong correlations within waves. Each catch-up growth predictor had at least one correlation with another catch-up growth predictor (i.e. within ) that exceeded the correlation with each of the outcomes (i.e. between and ). Hence, we considered nested and appended imputation appropriate for these data.
5.2.2 Parameter estimates
Regression coefficients of catch-up growth predicting length at age 19 are presented in Table 5. The three imputation strategies resulted in similar point estimates and confidence intervals, and agreed on their conclusions in seven out of eight models (weight, length, head circumference; either adjusted or unadjusted, and weight-length unadjusted). The adjusted model of weight-length resulted in contrasting conclusions: Weight-length predicted length after re-imputation (CI: ), but not after nested and appended imputation (CI nested: ; CI appended: ). Over all models, nested imputation and re-imputation were approximately equally efficient (relative width 95% CI nested vs. re-imputation: ) and more efficient than appended imputation (nested: ; re-imputation: ). Results of other outcome variables were qualitatively similar and are available upon request.
|Re-imputation||Nested imputation||Appended imputation|
|b||95% CI||Width||b||95% CI||Width||b||95% CI||Width|
|Re-imputation||Nested imputation||Appended imputation|
|b||95% CI||Width||b||95% CI||Width||b||95% CI||Width|
|Note. HC = head circumference, WL = Weight adjusted for length.|
The current study investigated appended imputation as an alternative to re-imputation and nested imputation as generic strategies to deal with multistage incomplete data. Appended imputation is attractive since 1.) it keeps the old imputations in place, thereby preserving the validity of statistical analyses performed on the earlier waves; and 2.) it is computationally and logistically more convenient than nested imputation. We investigated the inherent dangers of appended imputation, and found that it could used for missing data patterns close to monotone, and for situations where the correlations relating variables within waves dominate the correlations relating variables between waves. Especially longtudinal datasets suffering from dropout could benefit from appended imputation. We do not recommend appended imputation over re-imputation when correlations between timepoints are high and correlations within timepoints are low, unless there is an explicit desire to maintain reproducibility of historic at the expense of statistical validity.
Daniels et al. 
MJ Daniels, Chenguang Wang, and BH Marcus.
Fully bayesian inference under ignorable missingness in the presence of auxiliary covariates.Biometrics, 70(1):62–72, 2014.
- Gelman and Raghunathan  Andrew Gelman and Trivellore E Raghunathan. Using conditional distributions for missing-data imputation. Statistical Science, 15:268–69, 2001.
- Harel  O. Harel. Strategies for data analysis with two types of missing values. PhD thesis, Citeseer, 2003.
- Harel and Schafer  O. Harel and J. Schafer. Multiple imputation in two stages. In Proceedings of Federal Committee on Statistical Methodology 2003 Conference. Citeseer, 2003.
- Hille et al.  ETM Hille, L Elbertse, J Bennebroek Gravenhorst, René Brand, SP Verloove-Vanhorick, et al. Nonresponse bias in a follow-up study of 19-year-old adolescents born as preterm infants. Pediatrics, 116(5):e662–e666, 2005.
- McGinniss and Harel  J. McGinniss and O. Harel. Multiple imputation in three or more stages. Journal of Statistical Planning and Inference, 176:33–51, 2016.
- Meng  Xiao-Li Meng. Multiple-imputation inferences with uncongenial sources of input. Statistical Science, pages 538–558, 1994.
- R Core Team  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016. URL http://www.R-project.org/.
- Rubin  D. B. Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons, 1987.
- Rubin  D. B. Rubin. Nested multiple imputation of NMES via partially incompatible MCMC. Statistica Neerlandica, 57(1):3–18, feb 2003.
- Rubin  Donald B Rubin. Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434):473–489, 1996.
- Shen  Zijin Shen. Nested multiple imputation. PhD thesis, Harvard University, Cambridge, MA., 2000.
- Van Buuren and Groothuis-Oudshoorn  S. Van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 2011.
- Van Dommelen et al.  Paula Van Dommelen, Sylvia M Van Der Pal, J Bennebroek Gravenhorst, Frans J Walther, Jan M Wit, and KM Van der Pal de Bruin. The effect of early catch-up growth on health and well-being in young adults. Annals of Nutrition and Metabolism, 65(2-3):220–226, 2014.
- Veen et al.  Sylvia Veen, Martina H Ens-Dokkum, Anneke M Schreuder, S Pauline Verloove-Vanhorick, JH Ruys, and R Brand. Impairments, disabilities, and handicaps of very preterm and very-low-birthweight infants at five years of age. The Lancet, 338(8758):33–36, 1991.
- Verloove-Vanhorick et al.  S Pauline Verloove-Vanhorick, RA Verwey, R Brand, J Bennebroek Gravenhorst, MJNC Keirse, and JH Ruys. Neonatal mortality risk in relation to gestational age and birthweight: results of a national survey of preterm and very-low-birthweight infants in the netherlands. The Lancet, 327(8472):55–57, 1986.
- Xie and Meng  Xianchao Xie and Xiao-Li Meng. Dissecting multiple imputation from a multi-phase inference perspective: What happens when god’s, imputer’s and analyst’s models are uncongenial. Statist. Sinica, 2015.
- Zhu and Raghunathan  Jian Zhu and Trivellore E Raghunathan. Convergence properties of a sequential regression multiple imputation algorithm. Journal of the American Statistical Association, 110(511):1112–1124, 2015.