Generalised regression estimation given imperfectly matched auxiliary data

05/19/2020
by   Li-Chun Zhang, et al.
University of Southampton
0

Generalised regression estimation allows one to make use of available auxiliary information in survey sampling. We develop three types of generalised regression estimator when the auxiliary data cannot be matched perfectly to the sample units, so that the standard estimator is inapplicable. The inference remains design-based. Consistency of the proposed estimators is either given by construction or else can be tested given the observed sample and links. Mean square errors can be estimated. A simulation study is used to explore the potentials of the proposed estimators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/12/2018

Adaptive two-stage sequential double sampling

In many surveys inexpensive auxiliary variables are available that can h...
12/15/2017

Automated Selection of Post-Strata using a Model-Assisted Regression Tree Estimator

Auxiliary information can increase the efficiency of survey estimators t...
07/01/2019

Transformed Naive Ratio and Product Based Estimators for Estimating Population Mode in Simple Random Sampling

In this paper, we propose a transformed naïve ratio and product based es...
11/29/2018

An Evaluation of Design-based Properties of Different Composite Estimators

For the last decades, the US Census Bureau has been using the AK composi...
12/01/2021

Investigating an Alternative for Estimation from a Nonprobability Sample: Matching plus Calibration

Matching a nonprobability sample to a probability sample is one strategy...
12/14/2020

Model-assisted estimation in high-dimensional settings for survey data

Model-assisted estimators have attracted a lot of attention in the last ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be the values observed in a sample from the population

. Design-unbiased estimation of the population total

can be achieved using the sample inclusion probabilities

for . Let

be the vector of known auxiliary values for each

. By incorporating these auxiliary values, the generalised regression (GREG) estimator (e.g. Särndal, 1992) can often improve the efficiency of estimation. The GREG estimator of is given by

(1)

where is the Horvitz-Thompson (HT) estimator of and that of , and

is a weighted least-squares estimate of the coefficients of a linear regression model of

on . The constants can be introduced given heterogeneous regression errors; it is also common to set .

To calculate the GREG estimator (1), one needs the -values for each sample unit. However, it may be impossible to match the sample and the auxiliary database perfectly, because one does not have a common, unique identifier in both sources. Record linkage based on linkage key variables (e.g. Fellegi and Sunter, 1969; Herzog et al., 2007; Christen, 2012; Harron et al., 2015), such as name and birth date, will be imperfect if some of them are incorrectly recorded in either source, so that any pairing of by a link may not actually refer to the same unit. This causes the problem for GREG estimation in situations where the auxiliary data cannot be perfectly matched.

While exists a growing literature on linear regression analysis based on linked datasets — see e.g. Lahiri and Larsen (2005), Chambers (2009), Chambers and Da Silva (2020) and Zhang and Tuoto (2020) under the frequentist framework of inference, our perspective is different here. The interest is not the regression relationship itself. The aim is to utilise the auxiliary information to improve the efficiency of population total (or mean) estimation, where the linear model plays the role of an assisting model (e.g. Särndal, 1992), and the inference is based on the sampling design rather than the model. For instance, in regression analysis the auxiliary population total is of little consequence, whereas it is of central importance to GREG estimation, and the ostensible total of the

-values in the auxiliary database, denoted by over , cannot be used directly when the matching between and is incomplete, i.e. they are not one-one matched in truth.

To the best of our knowledge, Breidt et al. (2018) is the only previous work that addresses the problem from our perspective. In their motivating example, the population consists of recreational fishing boat trips along the Atlantic Coast of South Carolina each year, and the -value is the catch on each boat trip. The auxiliary database consists of the boat’s logbook (including data of catch) required to be reported to the South Carolina Department of Natural Resources. The quality of record linkage is rather poor, and one cannot be sure that all the trips are reported in the logbooks. Breidt et al. (2018) consider a difference estimator, which makes use of multiple links for the sample trips. Although the estimator is biased, one is able to reduce the mean squared error (MSE) of estimation. The difference predictor is a special case of GREG predictor given fixed regression coefficients. As these authors point out, there is a need for developing methods which allow the predictor to be estimated from the sample actually observed.

We shall develop three types of GREG estimators and their approximate variance estimators, when the matching between the population and the auxiliary database is incomplete and record linkage between them is imperfect. The conditions for design-consistent estimators are specified, which can be tested given the observed sample and links, if the conditions cannot be verified directly based on linking the entire population and auxiliary database. Thus, the MSE of estimation can be estimated.

In Section 2 we outline the underlying linkage structure of the problem and the inference framework. The GREG-estimators are developed in Section 3. A simulation study is used to study the relative merits of the proposed estimators in Section 4, also in comparison to the HT estimator that ignores the auxiliary information and the hypothetical ideal GREG estimator. Some conclusions and final remarks are given in Section 5.

2 Entity ambiguity and inference framework

Imperfect matching between separate data files arises from the ambiguity surrounding the set of unique entities underlying these data files. Record linkage, or entity resolution, results in one or several links between a record in one file and the records in another. A link between a pair of records is a match if the records refer to the same entity, the link is false otherwise. False links and missed matches are caused by errors of the key variables used for record linkage, in the absence of a true identifier (i.e. a perfect key variable). While the formulation can be extended to include duplicated records in the same file, we shall assume that duplicates are absent in the following.

Denote by the matches between the population and the auxiliary database , where in is the matching record of in . Let be the size of , which may differ to , e.g. if and are not one-one correspondent in terms of the matches. Let the population (set of) links be given as

where contains the records in linked to unit , with cardinality , and contains the units in that are linked to record , and may be empty for some . Let be the cardinality of , where if is empty.

Some explanations are needed for this set-up. In a situation where one is able to link and , one can easily impose a restriction that any record is linked to at least one unit in as well. However, in practice, it is often the case that one is only able to link the sample units in to , for , but not the records in to , because the key variables are only observed in but not . This is indeed the situation considered by Breidt et al. (2018). Thus, to ensure general applicability, we assume that the direction of linkage is from to , which allows one to ensure that each unit is linked to at least one record in , no matter how likely (or unlikely) one judges that a link may be correct. That is, for any given , one finds one or more records in that can be linked to it, but one does not look for the units in that can be linked to any given record . It follows that may be empty for some records in . (Of course, the methods developed in this paper remain applicable if all are non-empty.)

Given the population links from to , is fully observed for any sample unit in , as well as the sample links . Whereas for any may not be fully observed in , based on which one only observes for . The example below provides an illustration.

Example

Let and . The records in are enumerated as they are known in , and in parentheses according to their unknown matches in , where the matches are shown as dashed lines. Notice that and are not one-one correspondent in terms of the matches, despite . The population unit in is an unmatched unit and the record presents an unmatched entity in . The population links are given by the solid arrows.

[cramped] A: & ℓ=1(ι_2) & ℓ=2(ι_1) & ℓ=3(ι_3) & ℓ=4(ι_4) & ℓ=5(ι_5) & ℓ=6(-)

U:& i=1 [ur, dash, dashed] [ur] & i=2 [ul, dash, dashed] [u] & i=3 [u, dash, dashed] [u, bend left] [ur] & i=4 [u, dash, dashed] [u, bend right] [ul] [ur] & i=5 [u, dash, dashed] [ur] & i=6 [u]

Let the sample be from . The sample links are

such that , and . These are fully observed in . Moreover, we observe for , where , and . However, any observed can differ from , since it is possible for other units in to be linked to the records in , such as . Finally, for these sample units, the missing match is for ; the false links are for , for , and and for .


Generally, in the presence of entity ambiguity, we have , where the false links are , and the missing matches are . For the methods of GREG estimation given imperfectly matched auxiliary data and the associated uncertainty assessment, we shall treat and all the associated the linkage key variables as fixed, whatever the underlying mechanism that has generated the key-variable errors and the chosen linkage method. Hence, the population links are fixed as well. The expectation and variance of an estimator will be evaluated only with respect to the sampling design.

3 Estimators

We consider two settings: (I) linkage from to is possible and is known, (II) linkage is only possible from to and one observes only associated with . Below we first consider a class of estimators that are only feasible under the first setting, and then two classes of estimators that are feasible under both settings.

3.1 Setting-I: given

We observe fully for any , since is known. Let be the incidence weights that are non-negative constants of sampling, where for , and

(2)

In the special case of for , the weights are referred to as the multiplicity weights. One can vary subjected to the constraint (2), e.g. based on the comparison scores used for record linkage (Fellegi and Sunter, 1969). In any case, the weights are constants of sampling given , , and the associated key variables.

Let be the constant auxiliary value for each , which is given by

Notice that we have , if for all , since

by virtue of (2). However, we do not assume this to be the case generally. For an illustration using Example earlier, given the sample , we have for where for , and and , where and and . In particular, the multiplicity weights are given by , since , and are all of size 2. The population total is given by .

We observe the population total given and . A population incidence (PI) GREG estimator can be given by (1) based on instead of , i.e.

(3)

where . Let . The variance of the PI-GREG estimator is approximately given by that of

The estimator (3) is design-consistent as , provided the ideal GREG estimator (1) is consistent. This is a main advantage that and are known under setting-I.

3.2 Setting-II: given

Suppose only is observed over , where . Since we observe only but not necessarily for any , the incidence weights by (2) are unknown. This is the situation considered by Breidt et al. (2018), who set heuristically according to the assessed quality of the links in . Now, provided , i.e. all the matches are among the links in although one does not know them all, one may let

be the probability that a link is the match for , so that the constraint (2) is satisfied. However, one still would not know the total of the corresponding , as long as is unknown. Moreover, the probability above cannot be calculated correctly for all without knowing the other links , even if the error mechanism of the key-variables were known. Therefore, this is not a viable option. Below we consider two types of estimators, where is fully determined given the observed for any .

3.2.1 Reverse incidence weights

Let the reverse incidence weights be such that, for each , we have

(4)

While the incidence weights (2) sum to one for any record in with , the weights (4) sum to one in the opposite direction over for any unit . Hence, the adjective reverse. While the incidence weights require the knowledge of the population links , the reverse incidence weights are always available given the sample links .

Let be constructed -value of based on the reverse incidence weights. Again, one may define the weights according to the relative plausibility of the links in based on record linkage. The weights are then constants of sampling given , , and the associated key variables. Let . When is known, a population reverse incidence (PRI) GREG estimator can be given as

where and . The variance of the PRI-GREG estimator is approximately given by that of

For the general case where is unknown because is unknown. The sample reverse incidence (SRI) GREG estimator of is given as

(5)

where is the mean of the -values over . Writing as a linear estimator with the sample weights , we have . The SRI-GREG estimator has the same approximate variance as the PRI-GREG estimator, since the first-order approximations of the two only differ by a constant where . However, insofar as , the estimator (5) will be biased under repeated sampling. An additional assumption is needed for design-consistency, i.e.

(6)

Intuitively, this assumption may seem reasonable, as long as the errors of the linkage key variables associated with and are unrelated to the -values in . A more detailed condition will be given later in Section 3.2.2

. For the moment, notice that in practice the assumption can be tested based on the observed statistic

.

A special case of the reverse incidence weights is worth mentioning. For each , let be the best link among all . Let and for any other . It should be pointed out that this is not a special case of incidence weights (2), since is chosen among not and it is possible for any given to be the best link for more than one unit in . Let be the best-link -value of . This can be relevant for secondary users, who are only given these best-link auxiliary values, but have no access to the other links because the linkage is performed by another party.

Let . The sample best-link (SBL) GREG estimator of is given as

(7)

where and . The variance of the SBL-GREG estimator is approximately given by that of

The estimator (7) is design-consistent, provided , as , and the assumption can be tested based on the observed statistic .

3.2.2 GREG over

Let be the number of links in . Let and be the total and mean of over , respectively. Let be a vector of coefficients of the same dimension as . Let

where , and , and . Clearly, is design-consistent for if , which can be tested based on the observed statistic . Below we consider first this condition in more details, and then the estimation of given the sample links. .

Let be the incidence weights (2) associated with . Let be the link-projection of onto , containing the linked records in . Let if and 0 otherwise. We assume and , as . We say and are non-informative of the -values asymptotically, as , if

(8)

In other words, as , the empirical covariance of and over tends to 0, as well as that of and over . We have then , since and in the first part of (8), and due to the second part of (8). It follows that , as , and the estimator above is consistent.

Moreover, we have , if the first part of (8) holds when are the reverse incidence weights (4), where and . Thus, the condition (6) for the SRI-GREG estimator (5) is satisfied if the reverse incidence weights are non-informative of the -values in addition to (8). In reality, requiring the first part of (8) to hold for both types of weights is essentially the same as requiring it to hold for either type of weights, since it is hard to imagine a situation where the condition holds only for one type of weights but not the other type.

To reveal the estimator of , we observe that

given any reverse incidence weights (4). Thus, can be set according to GREG of on over , where . A sample link-set (SLS) GREG estimator of is

(9)

where , and , since is the inclusion probability of a link in . The target of over sampling is , where .

The first-order Taylor expansion of the SLS-GREG estimator (9) is given by

where , and is the sum of population link-set GREG residuals over , for . The extra term arises from the estimator . Given (8), an approximate variance estimator can be that of .

3.3 Relative efficiency

Of the three types of GREG estimators above, the PI-GREG estimator (3) is based on the incidence weights (2), the SRI-GREG estimator (5) and the SLS-GREG estimator (9) are based on the reverse incidence weights (4); the first two are based on GREG over , and the last one is based on GREG over . A key factor to the relative efficiency is the covariance between the dependent and independent variables of the regression.

Consider the simple linear regression model as the assisting model, where the model covariance is a scaler that is easy to comprehend. The PI-GREG estimator depends on the covariance between and , the SRI-GREG estimator on that between and , and the SLS-GREG on that between and . For any given and choice of over , we have

where refers to the model covariance between and the matched , and for two different units . Thus, given the presence of false links and the fact that , the model covariance is reduced given imperfectly matched auxiliary values, and the population GREG slope coefficient will be attuned towards 0 compared to that given matched auxiliary values. This is the main reason why GREG estimation given imperfectly matched auxiliary values will lose efficiency compare to the ideal situation where the matches are known.

Meanwhile, what is important in practice is whether using the constructed auxiliary values based on the sample links can still improve the efficiency compared to the HT-estimator that ignores the auxiliary information altogether. It is possible to equate the HT-estimator with the GREG estimator that uses an intercept-only assisting model, where the covariance between and the constant independent variable is zero by definition. Using either the incidence weights or the reverse incidence weights, we have

Thus, even though one does not know which links are the matches, as long as the links can cover a certain amount of matches, GREG estimation that makes appropriate use of the auxiliary data via the links still has the capacity to improve the HT-estimator.

It is more difficult to draw general conclusions regarding the relative efficiency of the different types of GREG estimator. Take for instance the PI-GREG and the SRI-GREG estimators. The population GREG residual is under the former, where , the residual is under the latter, where . Although both and are weighted sums of over the same , the weights sum to 1 for any in for the former whilst they sum to 1 for in for the latter. The relative magnitude of the residuals cannot be determined generally for each on its own, because it also depends on how the other units are linked. In the next section, we shall use a simulation study to explore the relative efficiency of the different estimators.

4 Simulation study

4.1 Set-up

First, we generate a set of values , where

The model variance of is for . We present the results under simple random sampling without replacement, where the sample size is . Size-related unequal probability sampling does not yield any extra insight regarding the relative efficiency of the different estimators, because their relative merits are chiefly determined by how the population links are distributed over , regardless the sampling design.

We let , so that we can easily calculate the ideal GREG estimator (1). The population matches and links are generated according to the parameters below.

  • [leftmargin=5mm]

  • Let be the proportion of population units with , where and . For example, if , then 40% of the units in have only one link to , 30% of them have two links and the rest 30% have 3 links.

  • Let be the proportion of units in that have a match in , where . By setting , one can emulate the general situation where and are not one-one correspondent in terms of the matches, and the ideal GREG estimator that uses all is unattainable in reality even if one knew all the matches. We let all the unique links be matches, the other units with matches are randomly selected, independently of whether a unit has 2 or 3 links. For each with , all its false links are randomly selected from .

  • Let be the proportion of units in whose matches are identified as the best links, where . Setting implies that the best-link choice is perfect given , in which case the SBL-GREG estimator (7) reaches its maximum potential. Setting means that all the known correct links are presented as the unique links. Using the SBL-GREG estimator is then unlikely to be a good option, because one could have obtained additional correct links among the units just by guessing randomly. Thus, the SBL-GREG estimator improves as varies from to . The units with and correct best links are randomly chosen from the relevant units. The best links for the rest units are randomly chosen among the relevant false links in the respective .

Given each sample , we calculate the following estimates and their variance estimates.

  • [leftmargin=5mm]

  • The HT-estimator, and the ideal GREG estimator (1), or simply Ideal.

  • The subsample GREG-estimator, or simply Sub, which is only based on the sample units with , i.e. with known correct links. This is a practical option, because in most applications of record linkage one can identify a subset of unique links that are virtually error-free, no matter how large or small this subset is in a given situation.

  • The PI-GREG-estimator (3) with multiplicity weights or unequal incidence weights as explained below, designated as PI- and PI-, respectively.

  • The SBL-GREG estimator (7), or simply SBL, and the SRI-GREG estimator (5) with reverse incidence weights as explained below and designated as SRI-.

  • The SLS-GREG estimator (9), or simply SLS, with the same weights as SRI-.

For SRI-, the reverse incidence weight (4) assigned to the best link is in cases of , where , and for the other links in .

  • [leftmargin=5mm]

  • For a unit with , setting would mean that one has no plausible guess which of the two links is more likely to be correct; for a unit with , the indifferent choice would be . For an easy presentation without unnecessary finesse, we simply set whether of or 3, which refers to a choice where the weights are more or less indifferent over the multiple links of unit in .

  • Of course, in cases where is much higher than , it is no longer reasonable to set . To take advantage of the knowledge of linkage quality, one can raise the value of , according to the proportion of units with correct best link given , which is given by . For example, if , then setting around is not an unnatural choice in practice.

For the incidence weights (2), the multiplicity weight is the indifferent choice. For unequal weights of PI- when , we proceed as follows: if the matched population unit is in , assign the value to the match, where , and to the other links in ; otherwise, assign to a randomly selected link, and to the others. The value of can be large, if one has good knowledge of the linkage quality, such as when . Setting a lower value of , e.g. , emulates a situation where one has only vague ideas about the linkage quality. Notice that, given how the population links are generated above, the range of over is greater than than that of over , although a large majority of records in still have between 0 and 3.

Finally, based on independent samples, the Monte Carlo expectation and variance of an estimator, generically denoted by for , are given by

We obtain the MSE of the estimator accordingly. Moreover, the Monte Carlo expectation of the associated variance estimator, denoted by for , is given as

4.2 Results

The population values of are generated with , where . The sample size is . Let the population mean be the target of estimation.

For the results in all the tables, SE is the square root of of the corresponding estimator and ESE the square root of . The variance estimator of an estimator works well, if its SE and ESE are close to each other. The relative efficiency (RE) of an estimator is given by the ratio between its variance and that of the HT-estimator, whereas RMSE designates the ratio of their MSEs. The bias of an estimator is small compared to its variance if its RMSE and RE are close to each other.

The columns in all the tables refer to the different estimators by their shorthands given above. We simply set for GREG over , and for GREG over .

4.2.1 Low linkage quality

Table 1 provides a set of results in a situation where the linkage quality is very low. We have , such that one is only confident about 20% of the units whose links are matches. Next, we have , such that the matches are missed from the relevant links for 60% of the population units. The choice of the best link deteriorates as decreases from to . We set for all the results in Table 1, which is not unreasonable given the low linkage quality here.

, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.204 0.148 0.328 0.204 0.204 0.197 0.198 0.201
ESE 0.203 0.146 0.327 0.201 0.201 0.194 0.195 0.199
RE 1 0.525 2.595 1.002 1.001 0.930 0.939 0.968
RMSE 1 0.525 2.596 1.002 1.001 0.931 0.940 0.968
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.205 0.148 0.332 0.205 0.205 0.203 0.199 0.203
ESE 0.204 0.146 0.325 0.202 0.202 0.198 0.195 0.200
RE 1 0.522 2.623 0.999 1.001 0.977 0.943 0.977
RMSE 1 0.522 2.625 0.999 1.001 0.979 0.944 0.978
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.202 0.146 0.332 0.202 0.202 0.201 0.194 0.198
ESE 0.203 0.147 0.325 0.201 0.201 0.200 0.194 0.199
RE 1 0.522 2.691 1.000 0.998 0.984 0.924 0.959
RMSE 1 0.522 2.692 1.000 0.998 0.985 0.924 0.960
Table 1: Results given low linkage quality, , ,

First, since HT and Ideal do not depend on , their Monte Carlo variance and MSE all have the same expectations in Table 1, such that the variations across the three blocks reflect directly the magnitudes of the Monte Carlo simulation errors. It is seen that the results are reliable within a range of . Although the variation is greater for Sub, as it is only based on about 20 sample units, it is clearly the least efficient estimator here.

Next, as can be expected, the performance of SBL worsens as decreases. Its RE is about 1 when . However, since is unlikely to be as low as in practice, one can still expect it to be slightly more efficient than HT.

Given constant in Table 1, only small variations of the results can be detected across the three blocks, regarding the variance and MSE of each of the other estimators. It follows that the population variations of and best links across the blocks do not affect the following conclusions based on these results. Using the incidence weights, PI- and PI- do not yield any gains over HT, although they are more difficult and costly to implement because they require the knowledge of . Between SRI- and SLS, both based on the reverse incidence weights, the former is somewhat more efficient. In particular, SRI- is able to improve HT, even when is as low as , whereas it is about as efficient as SBL when and the latter is at its best.

Comparing RE and RMSE, one can see that the bias is negligible compared to the variance for SBL, SRI- and SLS that only require the sample links . Finally, comparing SE and ESE, one can see that the variance estimators work well in all the cases.

, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.206 0.149 0.333 0.199 0.198 0.171 0.186 0.192
ESE 0.204 0.146 0.324 0.194 0.194 0.168 0.183 0.190
RE 1 0.524 2.622 0.933 0.932 0.691 0.818 0.876
RMSE 1 0.524 2.624 0.933 0.932 0.691 0.822 0.878
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.206 0.149 0.325 0.198 0.193 0.172 0.174 0.191
ESE 0.204 0.146 0.326 0.195 0.189 0.169 0.172 0.190
RE 1 0.519 2.481 0.924 0.872 0.694 0.716 0.861
RMSE 1 0.519 2.482 0.924 0.873 0.697 0.719 0.861
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.206 0.149 0.333 0.199 0.199 0.205 0.186 0.194
ESE 0.204 0.147 0.328 0.195 0.195 0.201 0.182 0.191
RE 1 0.519 2.599 0.931 0.930 0.983 0.815 0.883
RMSE 1 0.519 2.604 0.931 0.930 0.983 0.818 0.884
Table 2: Results given low linkage quality, , ,

Table 2 provides another set of results, where remains the same but is increased to 0.8, such that 80% of the population units now have their matches included in the links, although one can only be confident that about 20% of the sample units are linked correctly. This provides a scenario where one can possibly have good knowledge of the linkage quality, although the available linkage key variables are rather noisy. Sub cannot improve given the same . SBL is much better when , where it uses correctly matched auxiliary data for 80% of the units, and its RE is about 0.69 in Table 2 compared to 0.93 in Table 1 when . But the gain easily evaporates as decreases towards . Although the results for PI- and PI- are better than before, they are still dominated by SRI- and SLS based on the reverse incidence weights, and the same pattern as before remains of the relative merits of the latter two. Again, the bias is negligible compared to the variance and the variance estimators work well in all the cases.

A notice is worthwhile regarding the second block of results in Table 2. Now that is much higher than , it is no longer reasonable to set , where the weights are more or less indifferent over the multiple links. To take advantage of the good knowledge of linkage quality, one can raise the value of . Setting is not hard to justify here, given . While this clearly improves the results for SRI-, where the RE is 0.71 against 0.82 given in the first block, it does not have any noteworthy effect for SLS. In the case of SLS, GREG is over instead of , and it seems more difficult to assign the unequal weights sensibly for this estimator. PI- is also clearly better given larger , where its RE is 0.87 compared to 0.93 in the first block where , although the improvement is not as large as for SRI-.

Finally, GREG estimation is much more efficient than the HT-estimator, at least in these results, even when one can only be certain that about 20% of the sample units are correctly matched, as long as covers a large part of . For instance, the SRI-GREG estimator achieves about 20% variance reduction in the last block, only based on more or less indifferent reverse incidence weights for the units with multiple links.

4.2.2 Better linkage quality

Two more sets of results are given in Table 3, given better linkage quality than above. For the first two blocks of results, we have and , such that 90% of the population units have matches among the links, although one is only certain about nearly half of them. This will be referred to as the medium linkage quality scenario. For the last two blocks, we have and , such that only 2% of the population units have missing matches in , and nearly 80% of the matches are given as unique links. This will be referred to as the high linkage quality scenario.

The following features are essentially the same as the results in Tables 1 and 2 given low linkage quality. In all the cases, the bias is negligible compared to the variance and the variance estimators work well. PI- and PI- using incidence weights are still largely dominated by SRI- and SLS using reverse incidence weights. For the former two estimators, PI- can improve PI- given good knowledge of the linkage quality, i.e. when ; for the latter two estimators, SRI- is still better than SLS.

, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.205 0.149 0.232 0.194 0.183 0.160 0.164 0.184
ESE 0.203 0.146 0.233 0.191 0.181 0.157 0.160 0.181
RE 1 0.532 1.278 0.900 0.800 0.612 0.638 0.805
RMSE 1 0.532 1.279 0.901 0.800 0.615 0.641 0.806
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.204 0.147 0.229 0.191 0.192 0.183 0.173 0.184
ESE 0.203 0.146 0.233 0.190 0.191 0.181 0.173 0.183
RE 1 0.519 1.262 0.878 0.891 0.804 0.723 0.810
RMSE 1 0.519 1.263 0.879 0.892 0.804 0.723 0.811
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.201 0.145 0.164 0.174 0.155 0.148 0.149 0.162
ESE 0.204 0.146 0.165 0.175 0.155 0.149 0.150 0.164
RE 1 0.526 0.667 0.753 0.593 0.547 0.548 0.651
RMSE 1 0.526 0.675 0.753 0.593 0.549 0.549 0.651
, , ,
HT Ideal Sub PI- PI- SBL SRI- SLS
SE 0.205 0.149 0.166 0.179 0.182 0.162 0.158 0.165
ESE 0.204 0.146 0.165 0.174 0.178 0.160 0.156 0.164
RE 1 0.532 0.660 0.763 0.793 0.625 0.596 0.652
RMSE 1