Missing data are frequently encountered in economic, medical and social science disciplines. Valid statistical inferences for missing data depend crucially on correct identification of the underlying missingness mechanism, which was divided by Rubin (1976) into three categories. The missingness is called missing at random (MAR) or ignorable if it does not depend on the missing values themselves conditioning on the observed data, and it is called missing not at random (MNAR) or nonignorable otherwise. A degenerate case of MAR is missing completely at random (MCAR), where the missingness does not depend on either the observed or the missing data.
Under the MAR assumption, both the propensity score and outcome regression models are nonparametrically identifiable, and it is therefore always tractable to conduct valid inferences. A wide range of statistical approaches have been developed for MAR data analysis, including likelihood-based approaches (Dempster et al., 1977; Horton and Laird, 2001; Ibrahim, 1990)
, multiple imputation(Rubin and Schenker, 1986; Vach and Schumacher, 1993; Rubin, 2004), semiparametric methods (Zhao et al., 1996; Rubins et al., 1994) and the inverse probability weighting method (Rosenbaum and Rubin, 1983; Rubins et al., 1994, 1995). To alleviate the risk of possible misspecifications of propensity score or outcome regression models, estimators of double or multiple robustness have been proposed and have attracted much attention (Laird and Pauler, 1999; Kang and Schafer, 2007; Han and Wang, 2013; Chan and Yam, 2014; Han, 2014a, b, 2016a, 2016b; Chen and Haziza, 2017; Han et al., 2019). For a more comprehensive literature review on the analysis of MAR data, see, for example, Little and Rubin (2002); Tsiatis (2006); Kim and Shao (2013) and references therein.
Things become much more challenging when the MAR mechanism is violated or data are MNAR. The foremost challenge is parameter identifiability: the underlying generating model is often not identifiable based on the observed data. For MNAR data, even completely parametric models for the data-generating model may be not identifiable (Heckman, 1979; Qin et al., 2002; Greenlees et al., 1982; Miao et al., 2016), let alone semiparametric models (Qin et al., 2002; Tang et al., 2003; Kim and Yu, 2011; Shao and Wang, 2016; Liu et al., 2021) or completely nonparametric models (Robins and Ritov, 1997)
. When no general identification results are available for MNAR data, the joint distribution of the full data can only be identified under specific model assumptions. A popular condition for model identifiability with MNAR data is the existence of an ‘instrumental variable’(Wang et al., 2014) or ‘ancillary variable’ (Miao and Tchetgen Tchetgen, 2016)
, which does not affect the missingness but may affect the conditional distribution of the response variable. Quite a few studies of MNAR data under the instrumental variable condition have been conducted.Wang et al. (2014)
found that the identification issue for MNAR data can be overcome with the help of instrumental variables, and they then proposed the use of the generalized method of moments to estimate model parameters. This approach was extended byZhao and Shao (2015) to generalized linear models under a parametric propensity score model, and this was further extended by Shao and Wang (2016) to allow for the semiparametric propensity model of Kim and Yu (2011). In the presence of an instrumental variable or ancillary variable, double robust estimation and semiparametric efficient estimation under MNAR data have also been investigated (see, e.g., Morikawa and Kim, 2016; Miao and Tchetgen Tchetgen, 2016; Ai et al., 2020; Liu et al., 2021). However, an instrumental variable may not be readily available or may not be straightforward to find in practice, which complicates the identifiability and inferences of the existing statistical approaches. Tang and Ju (2018) and Wang and Kim (2021) provide more comprehensive reviews of statistical inferences for nonignorable missing-data problems.
These discussions arguably reveal that methods for handling MAR data and MNAR data are totally different: the former are relatively easy whereas the latter are much more difficult. Correctly determining which mechanism is responsible for data being missing is crucial to the subsequent development of valid inference methods. This raises the hypothesis testing problem of whether the missingness mechanism is MAR or MNAR.
A relative simple counterpart of this hypothesis testing problem is whether the missingness mechanism is MCAR or MAR. Many tests for MCAR have been developed in recent decades, since the MCAR category is at the centre of interest of many behavioural and social scientists confronted with missing data (Simon and Simonoff, 1986). Little (1988) constructed a test by comparing the means of recorded values of each variable between groups of different missingness patterns. Chen and Little (1999) extended Little’s test to longitudinal data by comparing the means of the general estimating equations across different missingness patterns with zero, with any departure from zero then indicating rejection of the MCAR hypothesis. More extensions of Little’s idea to comparisons of the means, the covariance matrices and/or the distributions across different missingness patterns have also been investigated (see, e.g., Jamshidian and Jalal, 2010; Kim and Bentler, 2002; Li and Yu, 2015). Recently, Zhang et al. (2019) proposed a nonparametric approach for testing MCAR based on empirical likelihood (Owen, 1988, 1990, 2001). Their approach also provides a unified procedure for estimation after the MCAR hypothesis has been rejected.
In contrast to MCAR, testing for MAR has not received much attention so far. The first contribution in this direction was the nonparametric test proposed by Breunig (2019)
, which was based on an integrated squared distance. Under a generalized linear regression model,Duan et al. (2020)
proposed to test MAR by a quadratic form of the difference of the estimators of the regression coefficient under the MAR and MNAR assumptions, respectively. To the best of our knowledge, these are the only two formal tests for MAR. They both require the existence of an instrumental variable to guarantee identifiability, because their test statistics depend on consistent estimates under MNAR. However, as mentioned previously, the identification of an instrumental variable is usually not straightforward, and, even worse, it may not exist. Also, consistent parameter estimation itself under MNAR is rather challenging.
In this paper, we propose two score tests for MAR under a linear logistic model when a completely parametric model and a semiparametric location model, respectively, are imposed on the outcome regression model, respectively. Compared with the tests proposed by Breunig (2019) and Duan et al. (2020), the new tests have at least three advantages. The first remarkable advantage is that no identification condition is required under MNAR, which implies that no instrumental variable is needed. However, without an instrumental variable, the tests of Breunig (2019) and Duan et al. (2020) may fail to work. The second advantage is that the new tests involve much easier calculations, because the underlying unknown parameters need only be estimated under MAR. As we have pointed out, identifiability is not an issue for MAR data and parameter estimation has been well studied. Third, our simulation results indicate that the two new tests are often more powerful than or at least comparable to that proposed by Duan et al. (2020).
2 Score test
Let denote an outcome that is subject to missingness and let
be a fully observed covariate vector whose first component is. We denote by the missingness indicator of the outcome, with if is observed and otherwise. We wish to test whether the missingness mechanism is MAR or MNAR, namely . Suppose that the missingness probability or the propensity score follows a linear logistic model
with . Under the model (2.1), the testing problem of interest is equivalent to , because the missingness mechanism is MAR if and MNAR otherwise.
Suppose that are independent and identically distributed observations from . Let denote the conditional density function of given . The loglikelihood based on the observed data is
The likelihood ratio test is the most natural and preferable for testing . Unfortunately, Miao et al. (2016) showed that parameter identifiability is not guaranteed even when a parametric model is postulated for . Without parameter identifiability, consistent parameter estimation is not feasible, and therefore nor is the likelihood ratio test, under general parametric assumptions because these require consistent parameter estimation under the null and alternative hypotheses. The Wald test has the same problem.
The score test was introduced by Rao (1948) as an alternative to the likelihood ratio test and Wald test. The most significant advantage of the score statistic is that it depends only on estimates of parameters under ; in other words, it automatically circumvents the notorious identifiability issue under MNAR. This motivates us to consider testing by a score test. Let denote the partial differential operator with respect to . The score function with respect to is
which depends on the unknown parameters and .
The score test statistic is constructed by replacing and
with their estimators under the null hypothesis. The likelihood function under becomes
In this situation, a natural estimator for is the maximum likelihood estimator , where is the likelihood function of under the null hypothesis. Estimation of depends on model assumptions on . To finish the construction of the score test, we consider postulating either a fully parametric or semiparametric model on .
2.1 Score test under a parametric model
When or equivalently the missingness mechanism is MAR, . Under a fully parametric model on , this motivates us to estimate by , where is the likelihood function of under . The score test statistic is then
To calculate the -value of a score test statistic, we need to determine the sampling distribution of under . The exact form of this sampling distribution is in general intractable. A more practical solution is to approximate it by its null limiting distribution under or MAR.
Let and be the true values of and , respectively. Our theoretical results on are built on the following regularity conditions on and :
and is of full rank.
(i) The parameter space of is independent of and compact. (ii) The true value of is an interior point of . (iii) is identifiable, i.e. for any different elements and in . (iv) . (v) is continuous in for almost all .
(i) is twice differentiable with respect to for almost all , and is continuous at . (ii) is positive definite. (iii) There exist a and positive functions and such that and , and
for all satisfying .
Under Condition (C1), the limit function of is well defined. Conditions (i) and (ii) in Condition (C2) are trivial. Condition (iii) guarantees that is a unique maximizer of the likelihood . Condition (iv) provides an envelope for . These conditions plus the continuity condition of Condition (C2)(v) are sufficient for the consistency of . Under Condition (C3), the loglikelihood can be approximated by a quadratic form of . The asymptotic normality of follows immediately. Condition (C3) also guarantees that the matrices defined in Theorem 1 are well defined.
Assume Conditions (C1)–(C3) and that is true. As goes to infinity, where , and
2.2 Score test under a semiparametric location model on
The score function depends on through the conditional mean , which is equal to under . Instead of imposing a fully parametric conditional density model, it is sufficient to assume a parametric model for , where is an unknown parameter. Under , this model assumption is equivalent to a location model on the completely observed data , namely where satisfies . We estimate by the least square estimator
Given the estimators and of and , the score test statistic under the location model on is
Let be the true value of . To establish the limiting distribution of , we impose the following regularity conditions :
(i) holds for all . (ii) The parameter space of is compact, and is an interior in . (iii) is the unique minimizer of .
(i) is twice differentiable with respect to . (ii) There exists such that and holds for all . (iii) There exists such that and
holds for all . (iv) is positive definite and is well defined.
Conditions (D1) and (D2) are the analogues of Conditions (C2) and (C3) for the conditional mean model .
Assume Conditions (C1), (D1) and (D2) and that is true. As goes to infinity, where and
A consistent estimator for the asymptotic variance is
We reject if is large enough. Its -value is approximately . The result in Theorem 2 and the variance estimator allow the error given and to depend on , or to have a heterogeneous variance. If we assume that and are conditionally independent given , then and and reduces to .
2.3 Local power
To study the asymptotic power of the proposed score tests, we consider the following local alternative:
where is fixed. The local alternative (2.2) tends to the null hypothesis at a root- rate as goes to infinity. A test for is root- consistent if it can detect the local alternative (2.2) as goes to infinity for any fixed . We expect that both of the proposed score tests have root- consistency, which is a desirable property of a nice test for MAR.
Assume Condition (C1) and that the alternative is true. Let and be the true values of , and , respectively. (i) If Conditions (C2) and (C3) are satisfied, then, as goes to infinity, where is defined in Theorem 1. (ii) If conditions (D1) and (D2) are satisfied, then, as goes to infinity, where and is defined in Theorem 2.
Under the propensity model (2.1), a nonzero characterizes the departure of the true missingness mechanism from the null hypothesis. Theorem 3 indicates that if for some fixed the alternative is true, then both of the score test statistics converge in distribution to nondegenerate distributions with nonzero location parameters. Because the absolute values of the location parameters are increasing functions of , the powers of both score tests tend to as goes to infinity, which means that both of them are root- consistent.
We conduct simulations to evaluate the finite-sample performance of the proposed score tests. Specifically, we compare the following three tests: (1) S1, the proposed score test under a parametric model on , (2) S2, the proposed score test under a semiparametric location model on and (3) DUAN, the test proposed by Duan et al. (2020). We generate data from two examples. In Example 1, which comes from Duan et al. (2020), an instrumental variable is present, whereas there is no instrumental variable in Example 2. All our simulation results are calculated based on 5000 simulated samples and the significance level is set to 5%.
Let follow a multivariate normal distribution
such that ,
The missingness indicator for follows a Bernoulli distribution
with success probability
follows a Bernoulli distribution with success probability, where is the standard normal distribution function. We consider three choices for , namely and , two choices for , namely 0.5 and 1, four choices for , namely 0, 0.25, 0.5, and 0.75, and 11 choices for , namely , for .
The DUAN test requires the existence of an instrumental variable. Under the settings of Example 1, the variable is an instrumental variable and therefore the DUAN test is applicable. For data generated from this example, we model by in the construction of the score test S1, and we model by for S2. Tables 1 and 2 present the simulated rejection rates of S1, S2,and DUAN when the sample size . The rejection rates corresponding to the DUAN test are directly copied from Tables 3 and 4 in Duan et al. (2020), which were calculated based on 1000 simulated samples. Although the true missingness indicator is generated from a probit model, we model it by the logistic model (2.1) in the constructions of the proposed two score tests.
The coefficient quantifies the departure of the true missingness mechanism from the null hypothesis. When , the null hypothesis holds and the results reported are all type I errors. We see that all three tests have desirable controls on their type I errors. As increases, the true missingness mechanism departs more and more from the null hypothesis and, as expected, all tests have increasing powers. The proposed two score tests are more powerful than DUAN in most situations. When , their power gains against DUAN can be greater than 25%; see the case with and . As increases from 0.5 to 1, the power gain can be as large as 41%; see the case with and . These observations show that the proposed score tests have obvious advantages over the DUAN test. Meanwhile, the two score tests have almost the same powers in all cases, although the S2 test requires much weaker model assumptions. We also conduct simulations for , and the simulation results, provided in the Supplementary Material, are similar.
For better comparison, we display the power (versus ) lines of the S2 and DUAN tests in Figs. 1 and 2, corresponding to and 1, respectively. The power lines of the S1 test coincide with those of the S2 test and hence are omitted. It is clear that the power lines of the S2 test always lie above those of the DUAN test, or S2 is uniformly more powerful, except for two scenarios where , and or . In the two exceptional cases, compared with DUAN, S2 is more powerful for small and becomes less powerful for large . As quantifies the departure of the true missingness mechanism from the null hypothesis, a possible explanation for this phenomenon is that the score test is usually most powerful for ‘local’ alternatives, but may be suboptimal when the alternative is not very local.