Reducing Variance with Sample Allocation Based on Expected Response Rates

05/25/2020 ∙ by Blanka Szeitl, et al. ∙ 0

Several techniques exist to assess and reduce nonresponse bias, including propensity models, calibration methods, or post-stratification. These approaches can only be applied after the data collection, and assume reliable information regarding unit nonresponse patterns for the entire population. In this paper, we demonstrate that sample allocation taking into account the expected response rates (ERR) have advantages in this context. The performance of ERR allocation is assessed by comparing the variances of estimates obtained those arising from a classical allocation proportional to size (PS) and then applying post-stratification. The main theoretical tool is asymptotic calculations using the delta-method, and these are complemented with extensive simulations. The main finding is that the ERR allocation leads to lower variances than the PS allocation, when the response rates are correctly specified, and also under a wide range of conditions when the response rates can not be correctly specified in advance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For sample design of household or individual surveys, complex selection methods are typically needed to achieve the goal of variance reduction or variance efficiency. Precise sampling allocation methods are theoretically suitable to produce representative samples along previously appointed variables.
Most commonly, allocation proportional to size (PS) Lavrakas2008 and Neyman’s allocation (which involves the previously estimated, or expected variance of an investigated variables) Lavrakas2008 are in use. However, the exact sample realizations differ from the allocated ones. The differences between allocated and realized sample sizes are increasing with a growing reluctance among respondents to participate in surveys Stoop2004. Based on field experiences and the analysis of current meta-data of surveys, the increasing survey nonresponse is not steady and equal among population subgroups Meyer2015, which leads to sample bias.

Unit nonresponse, (when a household/individual in a sample is not interviewed at all) has been rising worldwide in most surveys. Unit nonresponse generally accumulates from non-contact (the household/individual cannot be reached), from refusal (the household/individual is reached but refuses to participate in the survey) or from incapacity to answer (the household/individual would participate but cannot because of e.g. health problems) Osier2016. Unit nonresponse only leads to bias if it is nonrandom. However, exploring whether unit nonresponse is random can be difficult, because only limited information is available on the characteristics of nonrespondents. Even if nonrespondents are similar to respondent based on a limited number of characteristics - gender, age and geography - this does not mean that these groups are similar along other dimensions, such as willingness to participate in government programs, social attitudes etc. Meyer2015. The social background of unit nonresponse has attracted great research interest Goyder2002. The most commonly used hypotheses are linked to the theory of general and social trust and to the theory of social integration Amaya2016. For survey methodologists, nonresponse analysis is useful to only a certain depth, because background theories can be only hardly adapted to sampling procedures and also to nonresponse weighting processes. This is one of the main reasons why other studies examined basic proxies for more complex background theories (e.g. for weaker social integration)

Abraham2016. Several meta-analyses identified response characteristics within population subgroups, however, stable and universal nonresponse patterns, or response probability estimates are still missing. They have found that single-person households, renters, and individuals out of the labor force are less likely to participate in surveys than other social groups

Abraham2016, Meyer2015, which can be directly used for sampling, in terms of defining the strata and also for the allocation procedures.

The goal of this paper is to show that if significant and stable nonresponse pattern differences could be established among given demographic subgroups of the population, sample allocation taking into account the expected response rates (ERR) have advantages. In fact, the ERR allocation leads to lower variance then those obtained in the case of a classical allocation proportional to size (PS). The remainder of the paper is organized as follows. First, we briefly introduce the procedure of proportional allocation to size (PS) with post-stratification (Section 2.1) and in Section 2.2 the method of ERR allocation is presented. The performance of the ERR allocation is assessed by comparing the variances of estimates resulted by the estimation procedures (Section 3) and the asymptotic calculations of the variances are done by the -method in Sections 4.1 and 4.2. The variance comparison is done first assuming correctly specified response rates in Section 4.3. Here, the sampling procedure assumes that the expected response rates in strata are close to the observed ones and slight differences between the allocated and intended samples are corrected with post-stratification. In Section 4.4 the variance comparison is done in the case of misspecified response rates, and simulation results are presented.

2 Sample Allocation

Denote the population size with , and let , be the sizes of the strata relevant for the sampling procedure, with . In a stratified random sample, a simple random sample of elements is taken from each stratum , with a total sample size of elements.

To be able to distinguish the PS allocation from the ERR allocation later, in case of the PS allocation denote the sub-sample size within stratum h, and in case of ERR allocation denote the sub-sample size within stratum h.

When nonresponse is taken into account, it is important to distinguish allocated and intended sample sizes. Let denote the average response rate of the total population and , the expected 111Expected strata-specific response rates can be derived from previous research experiences, meta-data analysis or simple estimation based on the research design. strata-specific response rates. Clearly,

If the nonresponse rate is assumed to be constant across strata, the relation between allocated and intended sample sizes is

In the sequel, the intended total sample size in the survey is denoted by .

2.1 Allocation Proportional to Size (PS)

In proportional to size allocation, the sampling fraction is specified to be the same for each stratum, which implies also that the overall sampling fraction is the fraction taken from each stratum and

(1)

The total allocated sample size is

(2)

Often, nonresponse, particularly where the nonresponse rate differs across strata, can drastically affect the intended sample size in each stratum and in total as well. In case of the PS allocation is assumed to be constant across strata and also with the assumption of the expected response rate is close to the observed one. Slight observed differences are corrected with post-stratification. Note, that a stratified sample with proportional allocation is self-weighting only if the proportion of sampled individuals who respond is the same within each stratum.

2.2 Allocation Based on Different Expected Response Rates (ERR)

In ERR allocation, the number of allocated elements in each stratum is specified using the expected strata-specific response rates. The allocated sample size in each stratum is:

(3)

Total allocated sample size is

(4)

3 Estimation Procedures

In order to assess the ERR and PS allocations by comparing the variances of the estimates obtained, in this section we discuss the estimating procedures the asymptotic variances of which will be specified using the -method in Section 4.

The task is to estimate the fraction of those who would say ’yes’ to a given question, based on samples with ERR and PS allocations, respectively. In both cases, post-stratification will be applied before estimation to properly reproduce the relative sizes of the strata in the population.

It is assumed that responding to the survey is probabilistic and occurs in stratum with probability and is independent from the true answer to the question of interest. The probability of nonresponse222For the present argument, it is irrelevant whether nonresponse applies to the entire survey or only to the current question. is assumed to be in each stratum . Thus, the data are missing completely at random. The probability of response ’yes’ is assumed to be in each stratum . Under the previous assumptions, the complete data, for each stratum, would be the observation of a variable with the following components:

  1. counts the number of cases when the selected respondent did not answer and the answer would have been ’no’;

  2. counts the number of cases when the selected respondent did not answer and the answer would have been ’yes’;

  3. counts the number of cases when the selected respondent did answer and the answer was ’no’;

  4. counts the number of cases when the selected respondent did answer and the answer was ’yes’.

Within stratum , has a polynomial distribution with parameters and , where is the allocated sample size for stratum . Note that the allocated sample sizes are different under ERR and PS allocations. For the entire sample, the variables have a product multinomial distribution. Under the assumed independence, the relevant population probabilities in stratum are

and the observed sample size is , instead of the allocated sample size. Thus, for each observation in stratum , a post-stratification weight of

is applied, which adjusts the fraction of the sample size in stratum to be equal to the population fraction of stratum but does not change the total observed sample size. After the weight is applied, is replaced by

With this, the natural estimator for the fraction of ’yes’ responses in stratum is

(5)

which is the relative frequency of ’yes’ responses among all responses observed in stratum . The estimator for the fraction of ’yes’ responses in the total sample is

(6)
(7)
(8)
(9)

which is the weighted fraction of ’yes’ responses among all responses observed in the total sample.

4 Variance Comparison

In this section we compare the variances of estimates derived from the ERR and PS allocations. This will be done with the -method, and its theoretical background is briefly introduced in the first sub-section. Then, the relevant calculations will be shown in the second sub-section.

4.1 The -Method

The

-method is a general method to derive asymptotic variance formulas of a functions of random variables with known variances. The first formulation of this method was used for estimation of moments of functions of samples see

Cramer1946, Oehlert1992. The same formulas are often used as approximate variances for finite but sufficiently large sample sizes Rudas2018.

Theorem 4.1 (Multidimensional case of the -method).

Let

be a sequence of k-dimensional vector-valued random variables such that for some parameter

,

(10)

and let the real function be differentiable at e. Then

(11)

Where denotes the partial derivative of f, which function has coordinates , and each of them is a function of variables. is an x matrix, which has one row for every coordinate of f and one column for every variable of f. The derivative function contains the partial derivatives of f.

Consequently, in the application in the next sub-section, partial derivatives and covariance matrix play a central role.

4.2 Application of the -Method

The estimator for the fraction of ’yes’ responses in the total sample is the result of the estimation procedure (9) in Section 3, which is the weighted fraction of ’yes’ responses among all responses observed in the total sample. In this section we apply the -method to derive the asymptotic variance of the estimator for the fraction of ’yes’ responses, first for one strata and then expanded to the total sample.
In one strata, omitting the index , the estimation function is and partial derivatives are

The partial derivative vector with the components above evaluated at the expectations and , is

(12)

As has a multinomial distribution, its covariance matrix is

(13)

Based on Theorem 4.1. the asymptotic variance is obtained with . Since and , with the proper substitutions the asymptotic variance is

(14)
(15)

The asymptotic variance in stratum with the allocation formula (1) in case of PS allocation is

(16)

and for the total sample one obtains that

(17)
(18)

The asymptotic variance in stratum with the allocation formula (3) in case of ERR allocation is

(19)

and for the total sample one obtains that

(20)
(21)

The results for the Variances of Estimates of ERR and PS allocations are summarized in Theorem 4.2.

Theorem 4.2 (Variance of Estimates).

Denote by the size of the population in each strata and by the total population size. Let be the intended total sample size, the expected response rate in the population and the observed response rates in stratum . Denote by the estimated parameter in each stratum . Let be the total variance of estimates based on a sample drawn by proportional allocation and be the total variance of estimates based on a sample drawn by allocation based on different expected response rates. Then,

(22)
(23)

4.3 Comparison Under Correctly Specified Response Rates

In this section we prove that in case of correctly specified response rates , the variance of the estimate based on the ERR allocation is less than or equal to that of derived from the PS allocation:

Theorem 4.3 (The relation between the variances).

Let be the total variance of the estimates based on a sample drawn by proportional allocation given in (22), and be the total variance of the estimates based on a sample drawn by allocation based on different expected response rates given in (23). If the observed response rates are equal to the expected response rates, then,

(24)
Proof.

If the response rates are correctly specified, then , and thus is the average response rate among strata. Since , and are population parameters, and is a fixed constant, it is enough to see that

(25)

Since the left hand side of (25) is the harmonic mean of the response rates and the right hand side of (25) is the arithmetic mean of the response rates, the classic weighted harmonic-arithmetic means inequality

333

Within the theory of the abstraction of Hölder mean, the inequality of arithmetic and harmonic means, or briefly the AM-HM inequality (more precisely the geometric mean is also involved, and called the AM–GM-HM inequality), states that the arithmetic mean of a list of non-negative real numbers is greater than or equal to the harmonic mean of the same list; and further, that the two means are equal if and only if every number in the list is the same. There are various methods to prove, including mathematical induction, the Cauchy–Schwarz inequality, Lagrange multipliers, and Jensen’s inequality

Bullen2003
. can be used, which states that the harmonic mean is less than or equal to the arithmetic mean and this concludes the proof. ∎

4.4 Comparison Under Misspecified Response Rates

In this section we compare the ERR and PS allocation methods under misspecification that is, when the real response rates differ from the expected ones used in the sample allocation. The comparison is done between the variances of estimates derived from the ERR and PS allocations in several simulated sampling setups with a fixed number of strata, .
During the simulations, several possible combination of the following parameters is generated: expected response rates ; observed response rates ; estimated parameter in every strata. Size of the population , size of strata and the desired total sample size are fixed. With all these parameters different base sampling positions are defined444During the simulations the parameters ; generated with values and the estimated parameter gets every value from to by ..
The variances of the estimates are calculated with the -method for each sampling setup and Figures 1 - 4 show the simulation results.

Figure 1 shows a comparison of the variances of the estimates obtained by ERR and PS allocations. The comparison is given as a function of the total absolute misspecification of the response rates, (x-axis) and of the total absolute distance of the real response rates from their weighted average, (y-axis). It can be seen, that the amount of misspecification between expected and real response rates makes a great impact on how the ERR allocation performs relative to the PS allocaton, but not independently from the total absolute distance of the real response rates from their weighted average. If in the real response rates are close to their weighted average, the ERR allocation performs better only if the response probabilities are not too poorly estimated. If the total absolute distance of the real response rates from their weighted average is high, the variance of the ERR allocation is smaller than that of the PS allocation, even in some cases with higher misspecification rate.

Figure 1: Comparison of the variances of the estimates obtained by ERR and PS allocations, by the total absolute misspecification of the response rates (x-axis) and the total absolute distance of the real response rates from their weighted average (y-axis).

In Figure 2 the comparison of the variances of the estimates obtained by ERR and PS allocations is shown in terms of the total absolute misspecification of the response rates, (x-axis) and the total absolute distance of the real response rates from the expected response rates, y-axis. When both of these are relatively low (lower than ), the ERR allocation performs better. In the extreme areas of this plot, the ERR and PS allocations perform equally well.

Figure 2: Comparison of the variances of the estimates obtained by ERR and PS allocations, by the total absolute misspecification of the response rates (x-axis) and the total absolute distance of the real response rates from the expected response rates (y-axis).

Figure 3 is a combination of the two previously presented plots. Here, the comparison of the variances of the estimates obtained by the ERR and PS allocations is presented in terms of the total absolute distance of the response rates from their weighted average (x-axis) and the total absolute distance of the real response rates from the expected response rates (y-axis). It clearly shows, that if the real response rates are closer to the expected ones than to their weighted average, the ERR allocation performs better in every possible sampling setup.

Figure 3: Comparison of the variances of the estimates obtained by the ERR and the PS allocations, by the total absolute distance of the response rates from their weighted average (x-axis) and the total absolute distance of the real response rates from the expected response rates (y-axis).

On Figure 4 the comparison of the variances of the estimates obtained by ERR and PS allocations is presented by the total absolute distance of the real response rates from their weighted average (x-axis), the ratio of the variances of the estimates obtained by ERR and PS allocations (y-axis) and the total absolute distance of the real response rates from the expected response rates (colors). Simulation data is grouped by the value of (proportion of answer ’yes’) in four different set-ups regarding in each stratum: (1) ; (2) , , ; (3) ; (4) . It clearly shows the the diversity of in each strata (upper right hand side plot) can produce bigger differences between the variances of estimates in ERR and PS allocations, but in general, the closer we get with the real response probabilities to the expected ones and the farther we get from the the better the ERR allocation performs.

Figure 4: Comparison of the variances of the estimates obtained by ERR and PS allocations, by the total absolute distance of the real response rates from their weighted average (x-axis), ratio of the variances of the estimates obtained by ERR and PS allocations (y-axis) and the total absolute distance of the real response rates from the expected response rates (colors). Charts grouped by four possible set-up of value in every strata

5 Conclusion

In this paper we showed how expected nonresponse rates can be involved in the allocation procedure in survey sampling. In the ERR allocation (allocation based on expected response rates), the strata specific expected response rates are used to determine allocated sample sizes within each stratum. We assessed the method by comparing it to a standard proportional allocation method (PS) where strata specific response rates are not used. The assessment utilized the -method.
The first finding of the paper is that if the strata specific response rates are correctly specified, the ERR allocation performs better in terms of the variances of estimates than the PS allocation. In practice, however, it may be difficult to estimate precisely the strata-specific response rates before sampling. In such cases, approximate response rates based on experience need to be used. Based on the simulation results presented in the paper, the ERR allocation still performs better than the PS allocation provided any of the following conditions hold:

  • the total absolute distance of the real response rates from the expected response rates is small, i.e., misspecification is moderate;

  • the total absolute distance of the real response rates from their weighted average is high, i.e., the real response rates differ from each other highly;

  • the real response rates are closer to the expected ones than to their weighted average.