Monotone population characteristics arise naturally in many survey problems. For example, average salary might be increasing in pay grade, average cholesterol level could be decreasing in physical activity time, etc. In large-scale surveys, there is often interest in estimating the characteristics of domains within the overall population, including those of domains with small sample sizes. One possibility to handle small domains is to apply small area estimation methods. However, that requires switching from the design-based to a model-based paradigm, which can be undesirable. An alternative approach is to remain within the design-based paradigm but take advantage of qualitative assumptions about the population structure, when such are available.
Isotonic regression has been widely studied outside of the survey context. Some remarkable works on this topic include Brunk (1955), VanEeden (1956), Brunk (1958), Robertson et al. (1988), and Silvapulle and Sen (2005). In contrast, merging isotonic regression techniques into survey estimation and inference has just been studied recently. Wu et al. (2016) considered the case when both sampling design and monotone restrictions are taking into account on the domain estimation. They proposed a design-weighted constrained estimator by combining domain estimation and the Pooled Adjacent Violators Algorithm (PAVA) (Robertson et al., 1988)
. Further, they showed that their proposed constrained estimator improved estimation and variability of domain means, under both linearization-based and replication-based variance estimation.
Although the constrained estimator proposed by Wu et al. (2016) improves the precision of the usual survey sampling estimators, it has to be used carefully since invalid population constraint assumptions could lead to biased domain mean estimators. The main objective of this work is to develop diagnostic methods to detect population departures from monotone assumptions. Particularly, we propose the Cone Information Criterion for Survey Data (CIC) as a data-driven method to determine whether or not it is better to use the constrained estimator to estimate the population domain means. The Cone Information Criterion (CIC) was originally developed for the i.i.d. case by Meyer (2013a).
In Section 2, we describe the constrained estimator proposed by Wu et al. (2016) and explain some of its properties such as adaptive pooling domain and linearization-based variance estimation. Section 3 contains the proposed CIC along with some of its theoretical properties. In particular, we show that CIC
is consistently choosing the correct estimator based on the underlying shape of the population domain means, in the sense that with probability going to 1 as the sample size increases, CICwill determine that pooling of domains that violate monotonicity constraints is unwarranted. Section 4 demonstrates the performance of the CIC under a broad variety of simulation scenarios. In Section 5 we apply our CIC methodology to the 2011-2012 National Health and Nutrition Examination Survey (NHANES) laboratory data. Lastly, Section 6 states some general conclusions of the work developed in this paper, and contains a brief discussion about future related areas of research.
2 Constrained Domain Mean Estimator for Survey Data
We begin by reviewing the survey setting and the constrained estimator proposed by Wu et al. (2016). Consider a finite population , and let denote a domain for . Assume that constitute a partition of the population . Denote as the population size of domain . Given a study variable , let be the population domain means,
Suppose we draw a sample using the probability sampling design . Let be the sample size of . We are going to consider the case where the sampling design is measurable, i.e., both first-order and second-order inclusion probabilities are strictly positive, where is the indicator variable of whether or not. Denote as the corresponding sample in domain obtained from . Further, let . For simplicity in our notation, we will omit the subscript from these and related quantities from now on.
Consider the problem of estimating the population domain means . When no qualitative information is assumed on the population domains, we can consider either the Horvitz-Thompson estimator (Horvitz and Thompson, 1952) or the frequently preferred Hájek estimator (Hájek, 1971), which are given by
respectively, where . We will refer to them as unconstrained estimators of . Note that both estimators in Equation 1 consider only the information contained in domain
, leading to large standard errors on domains with small sample sizes.
Suppose now that we want to include monotonicity assumptions into the estimation stage of domain means. For instance, assume the population domain means are isotonic over the domains. That is, (analogously, , but which we will not further consider explicitly here). Wu et al. (2016)
proposed a domain mean estimator that respect monotone constraints, given by the ordered vectorwhich optimizes
The objective function in Equation 2 can be written in matrix terms as , where , , is a consistent estimator of , and .
where for . Moreover, we can make use of the Pooled Adjacent Violator Algorithm PAVA (Robertson et al., 1988) along with and the weights to compute efficiently the constrained estimator . Observe that the constrained estimator in Equation 3 consists of adaptively collapsing neighboring domains. Furthermore, the above procedure can be simplified in the obvious way when applied to the Horvitz-Thompson estimator with weights , leading to the constrained estimator vector with entries of the form . We refer to Wu et al. (2016) for a discussion of the properties of these constrained estimators, including design consistency and asymptotic distribution.
We conclude this section by defining some of the quantities we will use in the development of the CIC. Note that the estimator has a random weighted projection matrix associated with it, which is defined by the pooling obtained from the PAVA and the weights . That is, is the matrix such that . For example, suppose and that PAVA chooses to pool domains 1 and 2, but not to pool domain 2 and 3. Hence, , and . Then,
be the unbiased estimator of the covariance matrix of, given by
where . Further, for any , let be the pooled population mean of domains through . That is,
For any indexes such that and , let , be the Hájek estimators of and , respectively. By standard linearization arguments (Särndal et al., 1992, Chapter 5), the approximated covariance of and is given by
Moreover, given that for all , a design consistent estimator of the approximate covariance in Equation 4 is
3 Main results
In this section, we present the Cone Information Criterion for Survey Data (CIC). The CIC is a tool that may be used to validate the monotone estimator in Equation 2 as an appropriate estimator of population domain means. In what follows, we define the CIC for the Horvitz-Thompson estimator and propose a natural extension that applies to the Hájek setting. Further, main properties of the CIC are shown along with their theoretical foundation.
3.1 Cone Information Criterion for Survey Data (CIC)
For the Horvitz-Thompson estimator, we define the CIC as
where is the projection matrix associated with .
The proposed CIC shares similar features with the Akaike Information Criterion (AIC) (Akaike, 1973) and the Bayesian Information Criterion (BIC) (Schwarz, 1978), which have been broadly used for model selection. The first term measures the deviation between the constrained estimator and the unconstrained estimator
, while the second term can be seen as a penalty for the complexity of the constrained estimator. The penalty term is large when the number of different groups chosen by the constrained estimator is also large, meaning that the number of different parameters to estimate (or effective degrees of freedom) of the constrained estimator is high.
The development of proceeds similarly as for the Cone Information Criterion (CIC) proposed by Meyer (2013a). Its motivation comes from properties of the Predictive Squared Error (PSE) under the Horvitz-Thompson setting, which is defined as
where is the vector of Horvitz-Thompson domain mean estimators obtained from a sample that is independent to , where is drawn using the same probability sampling design as . Furthermore, define the Sum of Squared Errors (SSE) as
We define CIC as an estimator of PSE that involves SSE. Proposition 1 establishes a relationship between PSE and SSE; its proof and all subsequent ones are included in the Appendix.
Proposition 1. .
Motivated by Proposition 1, an estimate of PSE can be derived by estimating both and . The first term has a straightforward unbiased estimator SSE, and an estimator for the covariance term can be obtained using the observed pooling on . As we will show later, the latter term can be estimated by the asymptotically unbiased estimator under certain assumptions. That produces the proposed CIC in Equation 6.
However, recall that the use of the Horvitz-Thompson estimator requires information about the population domain sizes , which is not frequently the case in many practical survey applications. Therefore, analogously to Equation 6, we extend the CIC to the Hájek setting by using the estimator instead of SSE, and instead of ; where denotes the estimator of the covariance matrix of and , which is based on the observed pooling of and is defined element-wise as
Hence, the proposed CIC for the Hájek estimator setting is
In order to state properly our theoretical results, we need to consider some required assumptions.
The number of domains is a fixed known constant.
The non-random sample size satisfies .
for . Also, for some constants and any integers such that , then with .
For all , , , and .
, where denotes the set of all distinct tuples from .
Assumption (A1) states that the number of domains
will not change as the population size changes. Assumption (A2) declares that the sample size is asymptotically strictly less than the population size but greater than zero, which intuitively means that the sample and the population size are of the same order. The boundedness property of the finite population fourth moment in Assumption (A3) is used several times in our proofs to show that the approximated scaled covariances in Equation4 are asymptotically bounded, and also, that their estimators are consistent for them. In addition, Assumption (A4) is used to assure that the population size and the subpopulation size are of the same order. Further, it establishes that the pooled population domain means converge to some constant limiting domain means with rate . The consistency result of CIC is based on whether the constants are strictly monotone or not. Assumption (A5) implies that both first and second-order inclusion probabilities can not tend to zero as increases. Moreover, this assumption states that the sampling design covariances () tend to zero, i.e., sampling designs that produces asymptotically highly correlated elements are not allowed. Lastly, Assumptions (A6)-(A8) are similar to the higher order assumptions considered by Breidt and Opsomer (2000). These assumptions involve fourth moment conditions on the sampling design. These assumptions hold for simple random sampling without replacement and for stratified simple random sampling with fixed stratum boundaries (Breidt and Opsomer, 2000).
3.3 Properties of CIC
Under above assumptions, CIC has the property of being an asymptotically unbiased estimator of PSE when the pooling obtained from applying the PAVA to the vector with weights is unique. To show that, we first prove that there are certain poolings which are chosen with probability tending to zero as tends to infinity. This is stated in Theorem 1, which makes use of the Greatest Convex Minorant (GCM).
The GCM provides of an illustrative way to express monotone estimators. Figure 1 displays an example of sample domain means with their respective monotone estimates (Figure 1(a)), and a plot of their corresponding cumulative sum diagram and GCM (Figure 1(b)). The GCM is conformed by points, indexed from 0 to , and their left-hand slopes are the values. The points indexed by 0 and are the boundaries of the GCM, and the rest are its interior points. Three possible scenarios can be identified for each of the interior points: the slope of the GCM changes (corner points); the GCM slope does not change and the cumulative sum coincides with the minorant (flat spots); or the GCM slope does not change but the cumulative sum is strictly above the minorant (points above the GCM). The example displayed in Figure 1(b) shows that the indexes 1, 2, 5 correspond to corner points, the index 6 to a flat spot, and the indexes 3, 4 to points above the GCM. In particular, note that flat spots correspond to cases where consecutive domain means are equal ().
Theorem 1. Let and , for , where , and . Also, let be the GCM points of the cumulative sum diagram with points . Define and to the indexes of points strictly above and indexes of its corner points, respectively. Based on the sample , define and , with , and let , , , and be the analogous sample quantities of , , , and . Denote and to be the events where and , respectively. Then, and .
To have a better understanding of Theorem 1, note that for every pair of mutually exclusive sets , , there are certain poolings (groupings) allowed by to obtain . In particular, if (i.e. no flat spots), then there is a unique pooling allowed by . Speaking somewhat loosely and referring to ‘bad poolings’ to those poolings of that are chosen with zero asymptotic probability, Theorem 1 states that bad poolings correspond to those pairs of disjoint sets , that do not satisfy and .
One case of particular interest is when there are no flat spots on the GCM corresponding to , i.e., . Such scenario is equivalent than saying that, asymptotically, there is a unique pooling allowed by . Moreover, under this scenario, it can be proved (Theorem 2) that the proposed CIC in Equation 6 is an asymptotic unbiased estimator of the PSE in Equation 7.
Theorem 2. If , then CICPSE+.
In practice, the proposed CIC can be used as a decision tool that validates the use of the constrained estimator as an estimate of the population domain means. The decision rule would be to choose the estimator, either the constrained or the unconstrained, that produces the smallest CIC value. As we mentioned, CIC is an overall measure that balances the deviation of the constrained estimator from the unconstrained, as well as the complexity of such estimator. The fact that CIC measures the estimator complexity would avoid the undesired situation of choosing always the unconstrained estimator above the constrained estimator. Although we will focus on the Hájek version of the CIC (Equation 8) for the rest of this section, it is important to remark that the following properties are also valid under the Horvitz-Thompson setting.
Let CIC and CIC denote the CIC values for the unconstrained and constrained estimators, respectively. From Equation 8, that is,
where . Similarly as AIC and BIC, we might choose the estimator that produces the smallest CIC value. We show that this decision rule is asymptotically correct when choosing the shape based on the limiting domain means (Theorem 5), and also, that the decision made from CIC is consistent with the decision made from PSE (Theorem 6). Theorems 3 and 4 contain theoretical properties of that are required to establish Theorem 5.
Theorem 3. For any domains where , ,
Theorem 4. Let be the weighted isotonic population domain mean vector of with weights . Then,
Theorem 3 states that the scaled is asymptotically bounded and also, that is a consistent estimator of with a rate of . Hence, both the covariance between and , and its proposed estimate are well defined. Theorem 4 establishes that the constrained estimator gets closer to the weighted isotonic population domain mean with a rate of . This theorem generalizes the results in Wu et al. (2016), where it was only considered the case when the limiting domain means are monotone. Recall that if and only if the population domain means are monotone increasing. Theorem 5 shows that CIC consistently chooses the correct estimator based on the order of the limiting domain means .
Finally, Theorem 6 establishes that the chosen estimator driven by PSE in Equation 7 is analogous to the decision made by CIC.
Observe that neither Theorem 5 nor Theorem 6 deal with the case where the vector entries of are non-strictly monotone. Although in that case we would like both PSE and CIC to choose the constrained estimator, neither of them is able to choose it universally. Nevertheless, we show in the Simulations section that the constrained estimator is chosen with a high frequency under the non-strictly monotone scenario.
We demonstrate the CIC performance through simulations under several settings. We consider the set-up in Wu et al. (2016) as a baseline to produce our simulation scenarios. For the first set of simulations, we generate populations of size using limiting domain means . Each element in the population domain
is independently generated from a normal distribution with mean. That is, for a given domain , for . Samples are generated using a stratified simple random sampling design without replacement in all strata. The strata constitutes a partition of the total population of size
. We make use of an auxiliary random variableto define the stratum membership of the population elements, with created by adding random noise to , for . Stratum membership of is then determined by sorting the vector , creating blocks of elements based on their ranks, and assigning these blocks to the strata. Also, we set , , , and . The number of replications per simulation is 10000.
The vector of limiting domain means
is created using the sigmoid functiongiven by for . We consider three different scenarios for : the monotone scenario, where ’s are strictly increasing; the flat scenario, where ’s are non-strictly increasing; and the non-monotone scenario, where ’s are not monotone increasing. The limiting domain means on the monotone scenario are given by for . The flat scenario is formed by “pulling down” until it is equal to , that is, where . For the non-monotone scenario, we pull down until it gets below by using . Note that the only difference among these three scenarios relies on the right tail. For each of the above scenarios, the total population size varies from . Further, the total sample size is divided among the 4 strata as for , which makes the sampling design informative. Once the sample is generated, the Hájek domain mean estimators are computed along with the CIC in Equation 8.
We consider the design Mean Squared Error (MSE) of any estimator given by
For each scenario mentioned above, we compute both the MSE for the unconstrained estimator MSE and for the constrained estimator MSE through simulations. In addition, we compute the MSE for the CIC-adaptive estimator , given by
Although there are no other existing methods that aim to choose between the unconstrained and the constrained estimator for survey data, we compare the performance of CIC
versus two conditional testing methods that are based on the following hypothesis test under the linear regression model setting,
The first test is a naive Wald test which depends on the sample-observed pooling. For this, we compute the test statistic
and then compare it to a , where is the number of different estimated values on .
The second test is the conditional test proposed by Wollan and Dykstra (1986). Even though the latter test is established for independent data with known variances, we use instead the estimated design variances of the sample-observed pooling obtained from Equation 5. To perform this, we compute the test statistic -as in the Wald test- but then we compare it to a with point mass of at , where is the probability that under the hypothesis . Note that the conditional test might perform similar as the Wald test when the number of domains is large.
Since both Wald and conditional tests require the variance-covariance matrix of the domain mean estimators to be non-singular, these could be performed only when the variance-covariance matrix formed by the estimates in Equation 5 is in fact a valid covariance matrix. We set the significance level of these tests at .
Tables 1, 2 and 3 contain the proportion of times that the unconstrained estimator is chosen over the constrained estimator under the monotone, flat and non-monotone scenarios, respectively. In cases where the unconstrained and constrained estimators agree (i.e. the unconstrained estimator satisfies the constraint), this is counted as a constrained estimator in the calculation of this proportion. The last two rows of these tables show the MSE of the constrained estimator and the CIC-adaptive estimator, relative to the MSE of the unconstrained estimator. The former ratio can be viewed as a measure of how much better (or worse) naively applying the constrained estimator is under the different scenarios, while the latter ratio shows how well the adaptive estimator is in terms of balancing the MSE’s of the constrained and unconstrained estimators.
From Table 1, we can note that CIC tends not to choose the unconstrained estimator under the monotone scenario as increases. In contrast, the unconstrained estimator is chosen most of the times under the non-monotone simulation scenario (Table 3). Flat scenario results (Table 2) show that although the proportion of times the unconstrained estimator is chosen do not tend to zero as grows, it is fairly small, meaning that CIC is choosing the constrained estimator most of the times. From these three tables, we can observe that CIC tends to be more conservative when choosing the unconstrained estimator over the constrained, in comparison with both Wald and conditional tests.