Inverse probability weighting (IPW) has long been accepted as the standard estimation procedure under unequal probability samplings with and without replacement ever since the work of Hansen and Hurwitz (1943) and Horvitz and Thompson (1952)
. IPW always produces an unbiased or asymptotically unbiased estimator with an elegant expression, regardless of the complexity of the underlying sampling plan, and this method therefore enjoys great popularity. As well as survey sampling, it has been widely used in many other areas, including missing data problems(Robins et al., 1994; Wooldridge, 2007; Tan, 2010; Kim and Shao, 2014), treatment effect estimation or program evaluation (Rosenbaum and Rubin, 1983; Rosenbaum, 2002; Imbens and Wooldridge, 2009; Hirano et al., 2003; Cattaneo, 2010; Young et al., 2019; Zhao, 2019; Tan, 2020), personalized medicine (Zhang et al., 2012; Jiang et al., 2017), and survival data analysis (Robins and Rotnitzky, 1992; Robins, 1993; Bang and Tsiatis, 2000; Ma and Yin, 2011; Dong et al., 2020), where IPW is renamed inverse probability of censoring weighting. In recent years, accompanied by optimal subsampling, the IPW method has also proved to be an effective approach to validate statistical inferences for big data (Wang et al., 2018, 2019; Yu et al., 2020).
Roughly speaking, these applications of the IPW method can be categorized into the following three representative types.
Problem A (Missing data problem).
Let with being a response variable that is subject to missingness
being a response variable that is subject to missingness andan always-observed covariate. Denote by a non-missingness indicator, with if is observed and 0 otherwise. Suppose are independent and identical copies of . The goal is to estimate .
Problem B (Unequal probability sampling without replacement, UPS-WOR).
Let be a population of interest with known size and let be a sample drawn by an UPS-WOR. The goal is to estimate the population mean .
Problem C (Unequal probability sampling with replacement, UPS-WR).
The population and the parameter of interest are the same as in Problem B, except that the sample is drawn by a UPS-WR.
Special cases of Problem A include treatment effect estimation in the framework of Rubin’s potential outcomes (Rubin, 1974), as well as program evaluation in economics and other social sciences. In a broad sense, estimation of optimal treatment regime and survival data analysis both belong to Problem A. Poisson sampling, a popular unequal probability sampling method, can also be regarded as a special case of Problem A. See Section 3. Optimal subsamplings in the context of big data are special cases of Problems B and C.
Through weighting the observations by the reciprocal of a certain probability of inclusion in the sample, the IPW estimator is able to account for unrepresentativeness, missingness or selection bias caused by non-random lack of information or non-random selection of observations. However, the IPW estimator can be highly unstable if there are extremely small probabilities, which can result in biased estimation or poor finite-sample performance of the accompanying asymptotic-normality-based inference (Busso et al., 2014; Kang and Schafer, 2007; Robins et al., 2007; Imbens and Wooldridge, 2009; Cao et al., 2009; Han et al., 2019). As pointed out by Robins et al. (2007) with regard to double-robust estimators (which are IPW-type estimators) in missing data problems, ‘Whenever the “inverse probability” weights are highly variable, …, a small subset of the sample will have extremely large weights relative to the remainder of the sample. In this setting, no estimator of the marginal mean can be guaranteed to perform well.’ In casual inference with observational studies, this is the well-known limited- or non-overlap problem in covariate distributions in different treatment groups (Crump et al., 2009; Khan and Tamer, 2010; Yang and Ding, 2018). The IPW estimator becomes inflated disproportionately or even breaks down in survival analysis when the number of patients at risk in the tails of the survival curves of censoring times is too small (Robins and Finkelstein, 2000; Dong et al., 2020). To guarantee that the IPW estimator possesses consistency, asymptotic normality, and satisfactory finite-sample performance, it is usual to impose an unnatural lower boundedness assumption on the probabilities (Rosenbaum and Rubin, 1983; Mccaffrey et al., 2013; Sun and Tchetgen Tchetgen, 2018), although tiny probabilities are frequently encountered in practice, especially when the propensity scores are estimated from data (Yang and Ding, 2018; Ma and Wang, 2020).
To overcome this notorious problem, at least three remedies have been proposed in the literature: stabilizing, thresholding, and trimming. The stabilizing method (Hájek, 1971) rescales the IPW estimator so that the weights sum to 1 (Kang and Schafer, 2007). Although straightforward, it can often sharply reduce the instability of the IPW estimator. The thresholding method, proposed by Zong et al. (2019) for solving Problem B, replaces those probabilities that are less than a given threshold by that threshold while keeping others unchanged. The parameter of interest is then estimated by IPW with the modified probabilities. Zong et al. (2019) proposed an easy-to-use threshold determining procedure and showed that, in general, the resulting IPW estimator works better than the naive IPW estimator. This method can reduce the negative effect of highly heterogeneous inclusion probabilities, and hence leads to improved estimation efficiency, although at the cost of an estimation bias.
. However, the exact amount of trimming is usually ad hoc and can affect the performance of the IPW estimator and the corresponding confidence interval in nontrivial ways.Ma and Wang (2020) systematically investigated the large-sample behaviour of the IPW estimator after trimming and found it to be sensitive to the choice of trimming threshold and subject to a non-negligible bias. They proposed a bias-corrected and trimmed IPW estimator with an adaptively trimming threshold, and a Studentized bootstrap procedure for interval estimation. Their estimator was shown to be insensitive to small probability weights as well as being able to correct the bias caused by trimming. The bias correction technique is built on local polynomial regressions (Fan and Gijbels, 1996), which further require a bandwidth. Inappropriate choices of the trimmed threshold and the bandwidth may affect the performance of their estimator. More importantly, the bias correction technique depends on the target quantity (e.g. in Problem A) to be weighted, which makes their method inapplicable to weighted optimization problems, such as optimal treatment regime estimation (Zhang et al., 2012).
The final point estimators of the stabilizing, trimming, and thresholding methods are all based on IPW, although they adopt different strategies to reduce the detrimental effect of extremely small probabilities. These IPW-type estimators inevitably inherit certain weaknesses of the naive IPW estimator: they are either still unstable or biased. Also, the accompanying intervals, regardless of whether they are asymptotic-normality-based or resampling-based, often exhibit much undercoverage. See our simulation results in Section 4.
In this paper, we propose a biased-sample empirical likelihood weighting (ELW) estimation method to serve the same general purpose as IPW in handling incomplete or biased data while overcoming its instability. We systematically investigate its finite- and large-sample properties in the context of missing data problems and unequal probability samplings with and without replacement. The proposed ELW estimation method has several advantages over the IPW-type methods and the usual EL method.
The ELW method circumvents the use of inverse probabilities and therefore never suffers from extremely small or even zero selection probabilities. It takes the maximum empirical likelihood estimates of the probability masses of a multinomial distribution as weights, which always range from 0 to 1. This is the most significant advantage of the ELW method over IPW and its variants.
The ELW weights are always well defined. By contrast, the usual empirical likelihood weights suffer from the well-known convex hull constraint or the empty-set problem: they are undefined if the origin lies outside the convex hull of certain transformed data points (Tsao, 2004; Chen et al., 2008; Liu and Chen, 2010).
Like the stabilized IPW estimator, the ELW weights always sum to 1, which gives the ELW estimator the nice property of shift-equivariance. Unfortunately, the naive IPW estimator, the trimmed IPW estimator of Zong et al. (2019), and the IPW estimator of Ma and Wang (2020) are all sensitive to a location shift in the response or the parameter of interest.
The ELW weights are very convenient to calculate. Their calculation involves only solving a univariate rational equation, which can be done efficiently by the commonly used bisection algorithm. In contrast to the IPW estimator of Ma and Wang (2020), the ELW estimator is free of any tuning parameter and is hence more computationally efficient. The ELW weights depend only on the propensity scores and the total sample size, and therefore the ELW method is directly applicable to weighted optimization problems.
As we shall show, the ELW estimator is theoretically more efficient than the IPW estimator in at least two scenarios: missing data problems and unequal probability samplings without replacement. Treatment effect estimation in observational studies under the potential outcome framework of Rubin (1974) can be regarded as a special case of missing data problems. This is a bonus of ELW, since the construction of the ELW weights makes use of side information. Under unequal probability sampling with replacement, we cannot tell which of the ELW and IPW estimators wins theoretically. Nevertheless, our simulation results indicate that the ELW estimator often has smaller mean square errors and the accompanying interval has better coverage accuracy in most cases.
A crucial requirement of ELW is knowledge of the size of the finite population of interest or a larger independent and identically distributed sample that includes the observed data as a subsample. This is also required by the original IPW method and some of its variants, and is available in most situations. For example, in missing data problems, the size of the overall dataset is clearly known, and in unequal probability sampling problems, the size of the finite population from which the sample was drawn is usually known a priori, since we need to construct a sampling frame before sampling. This mild requirement implies that the ELW method has many potential applications beyond missing data problems, sample surveys and casual inference.
The remainder of this article is organized as follows. In Section 2, we introduce the ELW method by studying Problem A, namely estimating the population mean of a response when data are subject to missingness. In Section 3, we extend the ELW method to unequal probability sampling by solving Problems B and C. A simulation study and a real-life data analysis are conducted in Sections 4 and 5 to demonstrate the usefulness and advantage of the ELW method. Section 6 concludes with some discussion. All technical proofs can be found in the supplementary material.
2 Empirical likelihood weighting
For ease of exposition, for the time being, we assume that the inclusion probability or propensity score is completely known and always positive, although our method allows to take zero values. The case with unknown propensity score is considered in Section 2.4. We take the parameter of interest to be for a user-specific function , for example in Problem A.
Denote the data by , with or simply , where and ; the covariates with do not come into play in most of this paper. The data is in fact a biased sample of the underlying population if all are not equal. The standard IPW estimator of is
This expression for the IPW estimator indicates that it becomes extremely unstable when some of the with are close to zero, and that the terms with actually contribute nothing to it. The Hájek estimator, or stabilized IPW (SIPW) estimator,
replaces by , so that the weights of sum to 1 and the estimator becomes more stable. Since the size is known, the zero-value together with the other single-value contain information about . The IPW estimator and its variants ignore such side information, and are not able to utilize it as well, and they consequently have potential losses of efficiency. As a popular and flexible non-parametric technique, empirical likelihood (EL) (Owen, 1988, 1990, 2001) can conveniently and efficiently make use of side information to achieve improvements in efficiency. This motivates us to develop a biased-sample empirical likelihood weighting (ELW) estimation method to serve the same purpose as the IPW estimator, while overcoming its instability and improving its estimation efficiency.
2.1 ELW estimator
Let the distribution function of be
, where the inequality holds element-wise for vector-valued. To estimate , it suffices to estimate . We consider the problem of estimating by discarding those with , although these quantities may be partially accessible. The likelihood based on the remaining data is
where . In the principle of EL, we model by a discrete distribution function . This function is not well defined, because those with are not fully available. Nevertheless, our final estimator, say , of satisfies , making the resulting maximum EL estimators of and well-defined statistics.
With in place of in (2) and taking logarithms, we have the empirical log-likelihood
Those that are feasible satisfy
We emphasize that although those with appear in and , they have no likelihood contribution or any influence on the resulting EL method.
The proposed EL estimator of , or equivalently of the , is obtained by maximizing the empirical log-likelihood (3) subject to (4). For fixed , the maximum of the log-EL in (3) subject to (4) is attained at
This immediately gives , the EL estimator of . Accordingly, the EL estimators of and are
and . Finally, the EL estimator or the ELW estimator of is
Obviously, both and are well-defined statistics because .
The focus of this paper is the derivation of a better weighting method than IPW. We achieve this by taking the probability masses of the maximum empirical likelihood estimator of as weights. Our ELW method requires the total sample size and the (estimated) selection probabilities. After multiplication by the likelihood in (2) is a degenerate case (with only one capture occasion) of the full likelihood for capture–recapture data of Liu et al. (2017), in which is unknown and the focus of estimation. The case with known has been studied by Li and Qin (1998) for biased and truncated data, although their interest was not in better weighting, but rather in estimating the distributions of the target and truncation variables, together with the unknown parameters involved.
When the propensity scores are available for those with the ELW method works even if the responses and the covariates are missing simultaneously for all data with . However, it may be subject to loss of efficiency when all covariates are observed. To improve efficiency in this case, we may incorporate estimating equations such as in the definition of the empirical log-likelihood (3), where is a user-specific function and . This strategy was also adopted by Qin et al. (2008) to improve efficiency and by Han (2013) and Han and Wang (2014) to construct multiple robust estimators. The price paid for this modification, however, is that the resulting ELW weights are undefined if lies outside the convex hull of (Tsao, 2004; Chen et al., 2008; Liu and Chen, 2010). The probability that the ELW weights are undefined can be large when the sample size is small and/or the dimension of is high (Tsao, 2004); the calculational burden of the ELW weights will also be heavier. As the focus of this paper is on the development of an estimation method that is robust to small probabilities, we do not incorporate side information in the proposed ELW method for the time being.
2.2 Practical implementation
The key to calculating the proposed EL estimators, including the EL estimator of and the ELW estimator , is to calculate by maximizing . This necessitates a double iterative algorithm because involves an implicit function , and thus it seems to be rather a difficult task. We find a more convenient solution, in which we need only solve a univariate equation.
Mathematically, is a solution to
Putting this expression into (6) leads to an equivalent equation for :
As (11) has multiple roots, it is necessary to identify the interval containing the desired root. Denote the observed by and define for . Equation (11) is further equivalent to , where . Because , , and is a consistent estimator of , the desired root of should lie between and . Actually, there must exist one and only one solution to between and . Because , it follows that , , and that is strictly decreasing between and . By the intermediate value theorem, there must exist one and only one solution, denoted by , in such that . It is worth noting that if all the are equal and equal to , then and the resulting are all equal to , and the ELW estimator reduces to the sample mean, i.e., . Otherwise, all () are not equal to each other, and , , and are all non-degenerate.
The proposed ELW estimation procedure can be implemented by Algorithm 1. The second step can be efficiently achieved by a bi-section search algorithm, and the remaining steps all involve only closed-form calculations, making the ELW procedure extremely easy to implement.
2.3 Finite- and large-sample properties
The non-zero EL weights are for . We use the maximum weight ratio among the non-zero EL weights to quantify the dispersion between the EL weights. The following proposition establishes an upper bound on .
Suppose take distinct values . If there exists such that and then .
Proposition 1 indicates that the ELW method works even if the smallest is as small as zero. However, the maximum weight ratio of the IPW estimator, namely , has no such a guarantee, and the IPW estimator becomes extremely unstable when some of the are close to zero. In particular, it fails to work when is exactly zero. Our ELW estimator successfully and completely overcomes this issue, which is its most significant advantage over the traditional IPW in finite-sample performance.
Next, we show that asymptotically our ELW estimator is unbiased and more efficient than the IPW estimator. This is a bonus of using ELW, and also a significant advantage that it has over the conventional IPW in large-sample performance. For ease of presentation, let , and, for two generic functions and , define , , and . We denote for a vector or matrix .
Let be the truth of . Suppose that and that and are both finite. As goes to infinity,
the ELW estimator is more efficient than the IPW estimator and the SIPW estimator i.e. and where the equalities hold only if is degenerate.
The assumption guarantees that with probability tending to 1, the propensity scores are not all equal to each other, and so the ELW estimator is non-degenerate. A reasonable estimator of is required in the construction of Wald-type confidence intervals for . Inspired by the fact that , we propose to estimate with the ELW method by
where , , and
It is worth stressing that the ELW-based variance estimator is again insensitive to small probabilities, since it circumvents the use of inverse probabilities.
2.4 Estimated propensity score
In many situations, such as missing data problems and causal inference, the propensity score is unknown and needs be estimated from the observed data. The ELW and IPW estimators have different large-sample behaviours if we take the variability of the estimated propensity score into account. Suppose that
is parametrically modelled bywith true parameter value .
The following conditions are satisfied:
The function is continuous in and for all .
There exists a constant such that (1) and (2) .
The truth of satisfies and .
Condition 1(a) holds when the non-missingness indicator follows logistic and probit models, which are commonly used in the literature. Condition 1(b) together with the other conditions guarantee the consistency of and hence the consistency of the ELW estimator. Under Condition 1(c), with probability tending to 1, the observed propensity scores are not all equal to each other, and so the ELW estimator is non-degenerate.
Under Condition 1(a), and therefore , , and We still use . In addition, we denote and
Assume Condition 1 and that satisfies where the influence function has zero mean. Suppose that and are all finite. As goes to infinity,
According to Theorem 2
, the ELW, IPW and SIPW estimators still follow asymptotic normal distributions when the propensity score involves a finite-dimensional unknown parameter. However, in general, the efficiency gain of the ELW estimator over the IPW and SIPW estimators is no longer guaranteed.
The most popular choice for is the maximum likelihood estimator. For example, if data are missing at random (Rubin, 1974) and the covariates are all observed, the maximum likelihood estimator is the maximizer of . In this case,
where The asymptotic variance of the ELW estimator is
Again, an ELW estimator can be constructed for .
2.5 Resampling-based interval estimation
Based on Theorems 1 and 2, we can construct Wald-type confidence intervals for once a consistent estimator for the asymptotic variance is available. The asymptotic normality of the ELW, IPW, and SIPW estimators requires and . If this is violated, the Wald-type confidence intervals may not have the promised coverage probability. This dilemma can be overcome by resampling. We propose to construct confidence intervals for by the resampling method in Algorithm 2.
In the case of the estimated propensity score , we replace and by and , respectively. The ELW variance estimator converges in probability to , which is assumed to be positive definite. This, together with Theorems 1 and 2, implies that converges in distribution to the standard normal, an obviously continuous distribution. By Corollary 2.1 of Politis and Romano (1994), the empirical distribution of is a uniformly consistent estimator of the distribution of , which is formally summarized in Theorem 3. This validates the interval estimator produced by Algorithm 2.
3 Extension to unequal probability samplings
We have shown that the ELW method works for missing data problems, where data are independent and identically distributed. The results in Theorem 1 also hold for Poisson sampling, a special case of UPS-WOR. Under Poisson sampling, the indicators are all independent of each other and is random. Suppose there exists a function such that ; then can be regarded as missing data and the ELW method applies. However, Theorem 1 no longer holds under more general UPS-WORs, where the sample size is fixed and the data are correlated, or UPS-WRs, where a piece of data may be sampled multiple times. In this section, we extend the ELW method to UPS-WOR and UPS-WR by solving Problems B and C.
3.1 ELW for UPS-WOR
With the notation of Problem B, denote the inclusion probability of by . The population size and all the are known a priori. The IPW estimator of is the famous Horvitz–Thompson estimator (Horvitz and Thompson, 1952)
). Although the ELW estimators for UPS-WOR and for missing data problems have the same form, their random behaviours are totally differentm because the joint distributions of the observed data are different.
The asymptotic normality of the IPW estimator, although difficult, has been established for many commonly used samplings, including simple random sampling with or without replacement (Erdös and Rényi, 1959; Hájek, 1960), rejective sampling with unequal probabilities (Hjek, 1964), stratified unequal probability sampling with or without replacement (Krewski and Rao, 1981; Bickel and Freedman, 1984), and two-phase sampling (Chen and Rao, 2007). See Wu and Thompson (2020) for a comprehensive review.
We establish the asymptotic normality of the ELW estimator following the results of Patterson et al. (2001) and Brändén and Jonasson (2012). A crucial requirement in these papers for the series of random elements is linear negative dependence or negative association, which are defined formally as follows.
Definition 1 (Patterson et al. (2001)).
A sequence of random variables,
A sequence of random variables,is said to be linearly negatively dependent (LIND) if for any disjoint subsets of indices and positive constants
for any real numbers and .
Definition 2 (Brändén and Jonasson (2012)).
A family of random variables is said to be negatively associated (NA) if for all increasing functions and such that there exists such that depends only on and depends only on .
Patterson et al. (2001) proved that if a series of random variables is NA, then it must be LIND, and that for LIND random series, asymptotic normality of the IPW estimator holds under mild additional conditions. Brändén and Jonasson (2012) showed that many commonly used samplings, including conditional Poisson sampling, Sampford sampling, Pareto sampling, and pivotal sampling satisfy the NA property. In summary, these results imply that, under mild additional conditions, the IPW estimator for these commonly used samplings without replacement follows an asymptotic normal distribution.
Suppose that there is a sequence of large datasets, , of size , all of which are regarded as finite sets of non-random numbers. The large dataset under study is one of them. For a particular large dataset , a sample of size is taken from it with a prespecified UPS-WOR. Define if is selected, and 0 otherwise for . Then, . Suppose that and , as .
Let be an array of -dimensional constant vectors and be an array of row-wise LIND random variables. Let and and define Suppose that
For notational simplicity, we shall suppress the subscript and use instead of . This theorem implies that . This result is sufficient to establish the asymptotic normalities of the IPW and ELW estimators. Define . We redefine the notation from Section 2. For two generic functions and