Bayesian Semiparametric estimation with nonignorable nonresponse
Statistical inference with nonresponse is quite challenging, especially when the response mechanism is nonignorable. Although existing methods often require correct model specifications for response models, the models cannot be verified based on the observed data and misspecification of the response models can lead to a seriously biased inference. To overcome this limitation, we develop an effective Bayesian semiparametric method for the response mechanism using logistic penalized spline methods. Using Polya-gamma data augmentation, we developed an efficient posterior computation algorithm via Markov Chain Monte Carlo. The performance of the proposed method is demonstrated in simulation studies and an application to a longitudinal data.READ FULL TEXT VIEW PDF
Statistical inference with nonresponse is quite challenging, especially ...
Logistic linear mixed model (LLMM) is one of the most widely used statis...
The Polya-Gamma data augmentation (PG-DA) algorithm is routinely used fo...
Multi-dimensional functional data arises in numerous modern scientific
A new statistical procedure, based on a modified spline basis, is propos...
The mixed effects model for repeated measures (MMRM) has been widely use...
Preventing periodontal diseases (PD) and maintaining the structure and
Bayesian Semiparametric estimation with nonignorable nonresponse
Handling missing data in inappropriate ways may lead to crucial selection bias in data analysis. In particular, specification of the response mechanism is essential for analyzing such data. If the response mechanism is misspecified, the statistical inference on the parameter of interest would be seriously biased. Nevertheless, it is often assumed that the response mechanism is ignorable or missing at random (MAR) because the assumption does not require any specification for the response mechanism (Rubin, 1976). Moreover, there have been several useful methods for ignorable missing data enjoying nice properties such as the double robustness (Robins et al., 1994; Kang and Schafer, 2007) and the multiple robustness (Han, 2014). In real data analysis, however, there are many unacceptable situations to believe the ignorability. Therefore, it is requisite to develop a method for analyzing nonignorable or missing not at random (MNAR) data (Little and Rubin, 2002).
In order to analyze MNAR data, (i) a response model, which is a parametric model of the response mechanism, needs to be correctly specified as well as (ii) the outcome model(Greenlees et al., 1982; Diggle and Kenward, 1994). It has been criticized to analyze missing data under the MNAR assumption due to the strong assumption, and several types of semiparametic models have been considered. Tang et al. (2003) and Zhao and Shao (2015) proposed a semiparametric estimator for the outcome model without specifying any response model by using an instrumental variable. On the contrary, Qin et al. (2002), Chang and Kott (2008), and Kott and Chang (2010) proposed a semiparametric estimator for the response model without specifying any outcome model.
Recently, Kim and Yu (2011) and Shao and Wang (2016) proposed a semiparametric estimator for a semiparametric response model. In the semiparametric response model, terms on observed variables are modeled in a nonparametric way whereas the unobserved variable is still a simple linear function. However, we can not generally know or expect the effect of the unobserved variable. For example, if we are interested in a survey of income, and assume that lower-income earners tend to refuse for the item of income, linear logistic model would be appropriate for the response model. However, if (super) higher-income earners also tend to refuse the item, quadratic or more complicated functions would be required. In this paper, we consider a semiparametric response model with a nonparametric unobserved part as considered in Sang and Morikawa (2018).
Multiple imputation (e.g. Rubin, 1978, 1987) is nowadays a common method to analyze missing data. Galimard et al. (2014) considered a multiple imputation by chained equations via Heckman’s selection model. Instead of the classical multiple imputation method, multiple imputation with data augmentation (Tanner and Wong, 1987) in terms of Bayesian perspectives have been proposed in several papers because the classical multiple imputation is essentially not “pure” Bayesian estimation. Durrant and Skinner (2006) proposed a Bayesian estimation for MNAR data by assuming parametric forms of both the response and the outcome models. In order to avoid specification of the outcome model, Im and Kim (2017) proposed a Bayesian approach using an outcome model for observed data, not for the whole data including missing responses, because it is more reasonable to assume a parametric model of the outcome model for observed data.
However, these existing Bayesian methods still suffer from misspecification of the response model, which could be essential to make statistical inference on parameters of interest. To overcome the difficulty, we propose a semiparametric selection model in which a functional form of nonresponse is not specified as considered in Sang and Morikawa (2018), and consider Bayesian estimation based on the model. We employ logistic-type selection models with penalized spline methods to estimate the unknown functional part, and develop an efficient Bayesian computation method using Markov Chain Monte Carlo methods and Polya-gamma data augmentation (Polson et al., 2013). We demonstrate that the proposed semiparametric method can flexibly capture the latent unknown missing-data mechanism and prevent bias caused by model misspecification of the response models through simulation studies. We also apply the proposed method to a longitudinal data from clinical trial regarding drug therapies for Schizophrenia, which is available from R package “Surrogate”.
Suppose that we are interested in estimating the parametric conditional distribution , where is a response,
is a vector of covariates, andis a vector of unknown parameters. For example,
for a simple linear regression. We assume thatis always observed whereas is subject to missingness. Let be the missing indicator such that if is observed and otherwise. We assume that , which is referred to as the response mechanism. In this article, we assume that the response mechanism is not missing at random or nonignorable, that is, the missing-data mechanism depends on unobserved response. Specifically, we consider the following missing-data mechanism (selection model):
where is the logistic function, is an unknown function and is a sub-vector of known as the response instrumental variable (Wang et al., 2014). Since is completely unspecified, the selection model (1) is semiparametric. For the estimation of , we employ the P-spline of the form:
Here is the degree of the spline, denotes the function , is a set of fixed knots (whose choice will be discussed later) and and are the coefficient vectors for the parametric part and the spline part, respectively. If the knots are sufficiently spread over the range of and the number of knots is sufficiently large, then the class of functions can precisely approximate the unknown function even for small , e.g. 2 or 3. We here consider the case with fixed and locations of knots, but sensitivity analysis could be done in practice. Since parameters are used in , we put a penalty on by treating as a random effect to avoid overfitting. Specifically, we assume , where is an unknown precision parameter to be estimated from the data.
We suppose the triplet is available for , where is the sample size. The unknown parameters are in the outcome model, and and in the selection model. Let be the collection of these unknown parameters. The posterior distribution of as well as missing observation is given by
where , . Using the Polya-gamma data augmentation (Polson et al., 2013), we obtain the following augmented posterior:
where , and is a density function of the Polya-gamma distribution . Note that the integral with respect to reduces to the original posterior (2). Under the expression, the conditional distribution of is normal. For prior distributions on
, we use multivariate normal distributions forand , that is, , , and a gamma distribution for , that is, , where denotes a gamma distribution with shape parameter and rate parameter . On the other hand, the specific choice of prior distributions for would depends on the specific form of the outcome model . To describe the sampling algorithm, we define , , , , and . The sampling algorithm is given as follows:
(Sampling ) The full conditional distribution of is . Although its density have a complicated form, random samples can be efficiently generated from an algorithm given in Polson et al. (2013).
(Sampling ) The full conditional density of is proportional to
thereby the full conditional distribution of is a multivariate normal distribution with and .
(Sampling ) Similarly to , the full conditional distribution of is a multivariate normal distribution , where and .
(Sampling ) Similarly to , the full conditional distribution of is a multivariate normal distribution , where and .
(Sampling ) Generate from its full conditional distribution given by .
(Sampling in ) The full conditional distribution of is proportional to
which is an exponentially titled distribution, and is not a familiar form in general. We adopt the Metropolis-adjusted Langevin Monte Carlo algorithm to update current values of ’s. With the current value , the proposal is generated from the normal distribution , where is a user-specified step-size and
Then, the proposal is accepted with probability
where denotes the density function of .
(Sampling ) Since the augmented complete data is available, the full conditional posterior distribution of is proportional to , which is the standard posterior distribution of . Hence, we could employ existing sampling techniques to update for the assumed outcome model .
Owing to the Polya-gamma representation (3), sampling steps for unknown parameters in the spline selection model (1) are quite easy to carry out. The full conditional distribution of in is different from the assumed model by the exponential term, which comes from the non-ignorable missing-data mechanism. In fact, when the missing-data mechanism is missing at random, the selection model (1) is free from the unobserved value and does not depends on , so that the full conditional distribution is same as the assumed outcome model. In Sections 3 and 4, we use the standard linear regression model and linear mixed-effects model, respectively, for the outcome models, and we will provide some details regarding the step 7 in the corresponding part.
Finally, we address the way to select a suitable set of knots . In order to make the spline method estimate the unknown missing-data mechanism flexibly even under small
, knots should be sufficiently spread over the space of the response variable. When the response variables are free from missing, it would suffice to setand to low and high (e.g, and
) empirical quantiles of the observed responses, and set the other knots for equally spaced points betweenand . However, such crude strategy might fail under nonignorable missing since the missing value may take values out of the range of the observed responses and the selection model depends on such missing values. To address this issue, we consider two methods. The first one is a similar adjustment to the crude method, that is, modify the crude vales of and to and for some positive constant specified by users. As the second method, we consider a more data-adaptive way to deal with the idea, that is, we treat the positive constant as an unknown parameter and assign prior distributions to make posterior inference. Since the knots are appeared in the posterior distributions (3) through ’s, the full conditional posterior distribution of is proportional to
where is the prior distribution on . To generate posterior samples from the non-familiar distribution, we simply adopt a random-walk MH algorithm which generates the proposal from a bivariate normal distribution with current values and a positive constant , and accept the proposal with probability .
We investigate the performance of the proposed method together with some existing methods. To this end, we consider a simple linear regression model:
where and . Here two covariates and were independently generated from , respectively. Based on the model, we generated the response value . For the response mechanism, we independently generated the missing indicator from a Bernoulli distribution with the success probability , and is observed/missing when or . We considered the following five scenarios for :
Complementary log-log MNAR)
For each response mechanism, the overall response rates were around . We generated a random sample with and , and applied the following three methods:
CC: Simply omit the samples with missing responses and apply the linear regression model to the data.
LS: Assume a parametric linear structure for the selection mechanism, that is, , where is the logistic function, and estimate the linear regression.
BSS: Apply the proposed method with linear regression for the outcome model, using the semiparametric selection model (1) with and . We considered two cases for the setting of order : and , which are denoted by BSS1 and BSS2, respectively.
It is noted that the selection model used in LS is correctly specified in scenario S1, whereas the linear structure is misspecified in the other scenarios. On the other hand, the proposed selection model covers the true selection model in scenarios S1, S2 and S5, while the model is slightly misspecified in scenarios S3 and S4. In the proposed method, we used a uniform prior on for that controls the amount of spread of knots. For two Bayesian methods, LS and BSS, we generated posterior samples of after discarding the first samples as burn-in. We then computed the posterior means and
posterior credible intervals of each component in. In the CC method, estimators of regression coefficients were obtained by the standard least squared estimator, and the Wald-type confidence intervals of each component in were computed.
For evaluating the performance of the point estimation, we calculated mean squared error (MSE) and bias, defined as
for and , where is a point estimate of in the th replication. We set in this study. Moreover, in order to evaluate the performance in terms of interval estimation (i.e. uncertainty quantification), we calculated coverage probability (CP) and average length (AL), defined as
where is a credible interval in the th replication and denotes the length of the interval. The results for the five scenarios are shown in Tables 1 and 2 in the case with and , respectively. Overall, the CC method does not perform well as expected. Although the CC method produces reasonable point estimates compared with other methods in some cases such as scenarios 1 and 5, the interval estimation does not work well in terms of CP. In scenario 1, the LS model is the true model and it is natural that the LS achieves the best performance among the four methods while the proposed BSS methods works almost the same as the LS method. However, once the LS model is misspecified as in scenarios 25, it is observed that the posterior means tend to underestimate the true values and the CP is much smaller than the nominal level . This shows that the misspecification of the selection model may lead to severely inappropriate statistical inference. On the other hand, the proposed BSS method produces almost unbiased posterior means and the CP is acceptably close to the nominal level in all the scenarios. Comparing the two choices of in the proposed method, seems slightly better than in this study. It should also be noted that the AL of the proposed method tends to be larger than that of the CC and LS methods in scenarios 2–5 since the proposed method can successfully capture the latent unknown missing-data mechanism and reflect its estimation error. Comparing Tables 1 and 2, it should be pointed out that the performance of the proposed BSS method has been reasonably improved by increasing the number of observations.
Finally, in Figure 1, we show point-wise posterior means and credible intervals of the selection probability as a function of the response , where the covariate is fixed to , in the first replication in the simulation study with scenarios 3 and 4 and . It can be confirmed that the proposed semiparametric selection model successfully captures the true structure of missing-data mechanism in both scenarios while the standard linear selection model fails to detect the underlying missing-data mechanism especially around small values of the response variable. Such flexibility of the proposed method would lead to the superior performance in the estimation of parameters in the outcome model.
To illustrate the proposed method in practical situations, we applied the proposed method to a dataset of randomized clinical trial of drug therapies for Schizophrenia, which is available from R package “Surrogate”. In the trial, a placebo and a treatment groups were compared, and the response of interest is an integer showing severity of symptoms known as PANSS score, where high values indicate more severe symptoms. The patients were observed at weeks 1, 2, 4, 6 and 8 () of the study. In the dataset, 2,151 patients are included and some patients have missing values. If patients did not feel good enough to see doctors, they would not be able to see doctors and the corresponding values would be missing, thereby the missing-data mechanism is considered as MNAR. The overall missing rate is about . In this study, we are interested in the time-varying difference of PANSS scores between placebo and treatment groups. Let denote the PANSS score for the th individual at th time, and we modeled the individual PANSS score as
where is the indicator whether the th patient is included in Placebo () or Treatment () groups, denotes the measurement time, is an individual effect and is an error term. We assume that and are mutually independent and distributed as and . Note that in the model (4), time change of the response variable is modeled by the measurement time separately for each group. We let be the missing indicator such that if is observed, and otherwise. For the missing-data mechanism, we employ the following model:
where is the logistic function and , and is a nonparametric function modeled by P-spline. Note that the inclusion of
in the selection model addresses the time dependence of the missing indicator, and the joint distribution ofgiven the other variables is expressed as the product of the probability given in (5), so that we can still apply the same algorithm for posterior computation given in Section 2.
For the unknown parameters in the model (4), we set independent normal priors for the regression coefficients and inverse gamma priors for and . As noted in Section 2, the posterior sampling algorithm for unknown parameters in the outcome model is the same as one for complete data, thereby the posterior computation for the unknown parameters in (4
) can be easily implemented by using existing Gibbs sampling algorithm for linear mixed models(e.g. Hobert and Casella, 1996).
We considered the proposed method with and . For comparisons, we also applied the simple linear selection model that replaces a linear function of with in (5). We generated 40,000 posterior samples after discarding the first 10,000 samples as burn-in. Based on the posterior samples, we computed the posterior means and point-wise credible intervals of regression lines, which are shown in Figure 2. It is observed that the estimates of regression lines are different between the proposed method and linear selection model, and the credible intervals are overlapped at some measurement times in the linear selection method while the credible interval are clearly separated at all the measurement times. Based on the results of simulation study in Section 3, the result from the linear selection method is doubtful since it is subject to misspecification leading to serious bias in the estimation of parameters in outcome models. We also computed the posterior means and point-wise credible intervals of the selection probability (5) as a function of PANSS score with and four combinations of and , which are presented in Figure 3. It is revealed that the effect of treatment indictor on the missing probability is limited whereas the missing indictor of the previous time significantly changes the missing probability. Also, the linear selection model produces a very simple structure (almost constant over PANSS score) for missing probability while the proposed method seems to capture the underlying missing-data mechanism flexibly as a complete function of PANSS score. Such difference of flexibility in the estimation of missing probability would lead to the difference of resulting regression lines as shows in Figure 2.
Finally, we considered sensitivity check of the selection model (5). To this end, we considered an alternative selection model adding to (5), and two choices of . Based the same number of posterior samples, it has been revealed that the results have not been changed much, that is, the posterior means and credible intervals of PANSS score were quite similar to that given in Figure 2.
This paper developed an effective Bayesian approach to estimating missing mechanics with an unspecified functional form for missing variables and a parametric form of observed covariates. Based on discussion given in Sang and Morikawa (2018), the flexible estimation of the functional form of missing variable could be more significant than that of observed covariates. However, the proposed method could be extended to the case where the missing probability is an nonparametric function of both missing variable and covariates by using, for example, radial basis expansion.
For outcome models, we considered only a parametric model for simplicity, but we may employ semiparametric methods such as generalized method of moments(Yin, 2009). It should be remarked that extension of the outcome modeling could be easily done since the posterior computation for outcome models is the same as that with complete data. Since the detailed investigation would extend the scope of this paper, we left it to an interesting future study.
Finally, although we employed the logistic function as a link function in the selection models due to its popularity in this context, we may use other link functions such as Probit link. In such cases, the posterior computation algorithm given in Section 2 should be changed in appropriate ways.
This work is supported by Japan Society for the Promotion of Science (KAKENHI) grant numbers 18K12757 and 19K14592.
Multiply robust estimation in regression analysis with missing data.Journal of the American Statistical Association 109, 1159–1173.