Multiple system estimation using covariates having missing values and measurement error: estimating the size of the Māori population in New Zealand

by   Peter G. M. van der Heijden, et al.

We investigate use of two or more linked registers, or lists, for both population size estimation and to investigate the relationship between variables appearing on all or only some registers. This relationship is usually not fully known because some individuals appear in only some registers, and some are not in any register. These two problems have been solved simultaneously using the EM algorithm. We extend this approach to estimate the size of the indigenous Māori population in New Zealand, leading to several innovations: (1) the approach is extended to four registers (including the population census), where the reporting of Māori status differs between registers; (2) some individuals in one or more registers have missing ethnicity, and we adapt the approach to handle this additional missingness; (3) some registers cover subsets of the population by design. We discuss under which assumptions such structural undercoverage can be ignored and provide a general result; (4) we treat the Māori indicator in each register as a variable measured with error, and embed a latent class model in the multiple system estimation to estimate the population size of a latent variable, interpreted as the true Māori status. Finally, we discuss estimating the Māori population size from administrative data only. Supplementary materials for our article are available online.



page 1

page 2

page 3

page 4


Building a large synthetic population from Australian census data

We present work on creating a synthetic population from census data for ...

Nested Dirichlet Process For Population Size Estimation From Multi-list Recapture Data

Heterogeneity of response patterns is important in estimating the size o...

Bayesian Propagation of Record Linkage Uncertainty into Population Size Estimation of Human Rights Violations

Multiple-systems or capture-recapture estimation are common techniques f...

Common Misconceptions about Population Data

Databases covering all individuals of a population are increasingly used...

Estimation of the number of irregular foreigners in Poland using non-linear count regression models

Population size estimation requires access to unit-level data in order t...

drpop: Efficient and Doubly Robust Population Size Estimation in R

This paper introduces the R package drpop to flexibly estimate total pop...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of dual system estimation (DSE, also known as capture–recapture or the Lincoln–Peterson estimator) to estimate the size of a population which cannot be completely observed has become widespread in official statistics, particularly as a key part of making estimates from a population census (e.g. Wolter 1986, Brown et al. 1999, 2019), though also in situations involving the use of linked administrative data sources (Bakker et al., 2015, and references therein; Zhang & Dunne, 2018). The need to make efficient use of data already available to government in the construction of official statistics outputs has led to better access to administrative data, but there are important challenges in making use of it (see, for example, Hand, 2018). Linkage of the records from these sources is being widely used to understand the under– and over–coverage within them and estimate coverage corrections. We will use “registers” as a generic term for all sources containing lists of identifiable units.

When two registers are linked, in general there will be some records which remain unlinked, as there is no corresponding record in the other source. This leads to missing data for any variables which appear in only one register (item missingness). The linked data can be used to estimate the size of the population that is not present in either register, and for these unobserved records all

the variables are missing (unit missingness). There is an extensive line of research that treats this problem from a missing data perspective, starting with Zwane and van der Heijden (2007), and summarized and extended in van der Heijden et al. (2012, 2018). De Waal, van Delden and Scholtus (in press, see also 2019) provide an overview of what they call multi–source statistics, where they categorize this line of research under the subject “Overlapping variables and overlapping units” and “Undercoverage”. Another overview of the field is provided by Zhang and Chambers (2019). The missing data methodology involves the EM algorithm, using a suitable loglinear model specified by the researcher. In the EM algorithm an Expectation step and a Maximization step are alternated until convergence. In the E–step, the expected values of the missing data are derived given the observed data and the current best fitted values found in the preceding M–step. In the M–step, maximum likelihood estimates are found for the chosen model using the completed data from the E–step; see Zwane & van der Heijden (2007) for the full details. Standard errors for the estimates can be calculated using the parametric bootstrap (Buckland and Garthwaite, 1991). Van der Heijden et al. (2018) concluded that further practical experience with these methods is needed to demonstrate their usefulness in a variety of situations and encouraged their wider application.

Here we consider these methods for estimating the size of the Māori ethnic population in New Zealand, and section 2 describes the context of this problem, the registers available and the procedures which have been used to link them. Some of these registers cover only parts of the population, a common situation in official statistics applications of population size estimation. For the estimation of ethnicity in New Zealand this part coverage is related to the age of individuals. Zwane, van der Pal-de Bruin and van der Heijden (2004) approached the problem of registers that do not fully cover the population as a missing data problem. They provided solutions assuming the missingness (i.e. the undercoverage) is ignorable, which is often not a priori unrealistic as the missingness is due to the design of the registers (in contrast to other missing data problems where ignorability is often unrealistic). In the current paper we reframe the part–coverage of a register as a collapsibility problem for a covariate. Using results of van der Heijden et al. (2012), we show under what assumptions we may assume collapsibility over such a covariate, and hence, when the part–coverage of registers may be ignored. This is discussed in section 3.

Then we build up the estimation problem in section 4, starting with two registers, then progressing to three and four registers. In section 4.4 we consider estimates obtained only from the administrative registers, to examine how good estimates would be in the absence of a population census. Section 5 introduces the idea that a different concept of Māori identity is measured in each register because of differences in context or timing, an extension of the idea that the same variable is measured differently in different registers (van der Heijden et al. 2018). The model is extended to include latent classes to represent an underlying concept of Māori/non-Māori. Under this model a single estimate is obtained for the size of the Māori and non-Māori populations (instead of estimates for each register separately). Section 6 concludes by discussing the effectiveness of these approaches and the corresponding estimates of the size of the Māori ethnic population, and how the differences in the estimates derived from the different registers may be interpreted. It also discusses the sensitivity of the approaches to assumptions of perfect linkage and no overcoverage.

2 Māori ethnicity and relevant registers

Ethnicity is the principal measure of cultural identity in New Zealand, and is used across the official statistics system. Identifying the indigenous Māori population is of particular importance due to the partnership and obligations between Māori and the crown established under the Treaty of Waitangi of 1840. Māori are also a key group of policy interest; for example Waldon (2019) discusses the way measurement of ethnicity supports the measurement of health outcomes for indigenous peoples in New Zealand.

Ethnicity is therefore regularly included in statistical and administrative data collections. However, differences in questions or context and changes in perceptions can all affect how ethnicity is recorded in different registers (Simpson et al. 2016). Particularly for indigenous peoples such as the Māori, there is a need to ensure that definitions are created which take into account the ways in which members of these communities perceive themselves (Madden et al. 2019).

Official population estimates and projections for major ethnic groups in New Zealand are based principally on the responses people provide in the five yearly census, adjusted for nonresponse using a post–enumeration survey. As part of its census transformation programme, Statistics New Zealand is exploring the feasibility of a census based on administrative data (Statistics New Zealand 2012, 2014) . The ability to produce ethnicity data from administrative registers is a key consideration. Ethnicity is collected independently in a number of administrative registers as well as through the census. People do not always report the same ethnicity in each register and sometimes do not report their ethnicity at all, so there is an additional missingness problem to deal with.

A key question is how to combine ethnicity from multiple registers, when information is sometimes conflicting. Reid, Bycroft, and Gleisner (2016) compared ethnicity data from the 2013 Census with the ethnicity information collected by administrative registers, for a New Zealand resident population derived from administrative registers. They found that nearly everyone in this administrative data–based New Zealand resident population had ethnicity recorded in at least one administrative register, but that consistency with census responses varied considerably by register and by ethnic group. The method used to combine these registers has a major impact on the result. Under the assumption that census responses provide the best measure for official statistics purposes, a method that ranks registers based on their consistency with the census was applied. Using administrative data alone was found to produce a time series that reflects expected patterns of increasing ethnic diversity, with the age structure and regional distribution of ethnicity consistently in line with official measures (Statistics New Zealand, 2018). We note that, according to Statistics New Zealand’s standard classification of ethnicity, used in many administrative systems, people can and do provide multiple ethnicities. Here we focus on Māori ethnicity so that we have two mutually exclusive categories: Māori (with or without other ethnicities) and non–Māori (everyone else).

Four registers are available, the population census and three administrative registers, each with an ethnicity variable. In this application we use ethnicity information from the linked administrative data to explore alternative ways to produce official ethnic population estimates in New Zealand, and also investigate estimates produced from only the administrative data as a possible replacement for the census. In support of this we analyse a variety of census and administrative registers using the approach of Zwane and van der Heijden (2007), with a specific focus on the estimation of the size of the Māori population at the time of the 2013 population census. The analysis requires the extension of the methods to deal with multiple registers and with a variety of different types of missing data.

The population used here is the experimental administrative–based New Zealand resident population known as the ‘IDI-ERP’ (Statistics New Zealand, 2017a). The IDI-ERP is derived using signs of activity in government sources. Those who have died, or who have moved to live overseas before the reference date are excluded to minimise over-coverage, though some non-residents may remain in the dataset. We are interested in the population size, and therefore will implicitly assess the coverage of different sources within IDI-ERP relative to our estimate of the size of the New Zealand population.

The data are probabilistically linked in Stats NZ’s Integrated Data Infrastructure (IDI). The IDI provides safe access to anonymised linked microdata for research and statistics in the public interest. Data sources in the IDI (including the census) are linked to a central population spine. Perfect linkage is an essential assumption for DSE. An incorrect link could mean that the wrong ethnicity is associated with a person. In this application, if records in the registers have not been linked to the IDI spine, they do not enter the analysis, and become part of the unobserved population for the register.

The three administrative registers are:

  • Department of Internal Affairs (DIA) birth registrations data – which includes the ethnicity of the child as reported at registration

  • Ministry of Education (MOE) tertiary education enrolment data – which includes ethnicity for students

  • Ministry of Health (MOH) National Health Index system, a unified national person list – which includes ethnicity

For a more detailed explanation of these registers, see Reid et al. (2016).

Each of the administrative registers relates to different parts of the population. Birth registrations are for babies born in NZ since 1998, or those up to age 14 in 2013; tertiary education enrolments are available from the late 1990s, and are mainly for those aged between 18 and 40 years in 2013; both census and health data include all ages, and each register has an ethnicity value for around 90 % of the IDI-ERP population. Overall, almost 99 % of the IDI-ERP population have ethnicity information from at least one of these registers, and many people have information from more than one register. Table 1 provides the observed counts for ethnicity, where stands for item missingness (individuals that do not have their ethnicity registered in this register) and x stands for individuals that are not part of a register. For example, in the Census 3,225,804 are registered as non–Māori, 560,427 as Māori, for 20,619 individuals in the census no ethnicity is reported, and 595,140 individuals are missed by the census but appear in at least one of the other registers.

non–Māori 3,225,804 574,077 3,527,874 1,763,463
Māori 560,427 236,673 617,205 405,063
20,619 6,045 188,781 20,424
x 595,140 3,585,195 68,130 2,213,040
TOTAL 4,401,990 4,401,990 4,401,990 4,401,990
Table 1: Summary of Census linked to DIA, MOH and MOE, observed numbers

The aim is to produce aggregate estimates of Māori and non–Māori ethnicity by combining these four independent registers: the 2013 Census and the three administrative registers. In the next section, we first ask the question whether it is a problem that the four registers cover different parts of the population.

3 Using registers that cover different parts of the population

Usually expositions about multiple system estimation assume that each register aims to cover the same population. That is not the case here, where the Census and MOH aim to cover the complete population but DIA and MOE aim to cover only subpopulations. Building on earlier work of Zwane et al. (2004) and van der Heijden et al. (2012) we will argue here that ignoring the fact that DIA and MOE only cover a subpopulation is not problematic for estimating the size of the population. We devote some space to this problem as it is one that is regularly encountered in population size estimation. An alternative modelling approach is presented by Sutherland and Schwarz (2007), and related work is found in di Cecco et al. (2018), van der Heijden and Smith (2020), and di Cecco, di Zio and Liseo (2020).

3.1 Partial coverage and collapsibility, two registers

We now reframe the problem. We show how these two results are related to collapsibility in loglinear population size estimation models described by van der Heijden et al. (2012). See Figure 1, taken from their paper. In all models there are two registers, and . In model there is no covariate. In model there is a covariate that is related to register but not to register

, meaning that inclusion probabilities are heterogeneous across the levels of

for but homogeneous for . In model it is the other way around. In model inclusion probabilities for both and are heterogeneous across the levels of . Van der Heijden et al. show that the total population size estimate is identical in models , and and different from the estimate in . In other words, under models and the three-way array of variables and is collapsible over covariate . Under model there is no collapsibility. They use the concept of a short path to distinguish situations where there is collapsibility from situations where there is not. A short path is a sequence of connected nodes in the graph which does not contain a sub-path; note that it need not be short in the sense of having few nodes. For full details see Whittaker (1990) and van der Heijden et al. (2012). In the covariate lies on a short path between and and therefore one cannot collapse over . In and the covariate does not lie on a short path and in these cases one can collapse over the covariate.

Figure 1: Interaction graphs for loglinear models with two registers and one covariate, taken from van der Heijden at al. (2012)

Now we reframe the results from Zwane et al. (2004) in terms of the results of van der Heijden et al.. In Example 1 there are two regions: register A covers the full population, but register B covers only the north. Here, region can be considered as the covariate . If we assume that the inclusion probabilities for are homogeneous over region, there is no edge from to . Yet the inclusion probabilities for are heterogenous, as these are positive for the north and zero for the south region, and there is an edge from to . By the assumption of homogeneous inclusion probabilities for , region is not on a short path between and . Hence we can collapse over region – in other words, ignore the problem that register only covers the north region. This is the graph in model in Fig. 1. Example 2 has three regions, with A covering the north and central regions, and B covering the central and south regions. This situation is described by model : has heterogeneous inclusion probabilities over the levels of region as the inclusion probabilities are positive for the north and the middle but zero for the south, and and so does as the inclusion probabilities are zero in the north and positive in the middle and south. Hence there are two edges, one from to and another from to , and therefore is on a short path from to and it follows that we cannot collapse over (i.e. ignore) the region covariate.

3.2 Extension to more than two registers

Using the short path concept van der Heijden et al. (2012) provide the following results for three registers, see Figure 2, reproduced from their paper. In model we can collapse over the covariate as it is not on a short path. In model we cannot collapse over as it is on the short path between registers and . In model we can collapse over as is not on a short path from to . The maximal model, (using the notation that the interaction within [ ]’s and all lower order interactions involving the same terms are included), not shown in Figure 2, is not collapsible over .

We apply these results in the next sections, when we produce estimates based on linking three and four registers. At this stage of the paper we cannot immediately discuss the models used for the estimation of ethnic background. In these models there are not only variables for “being in register 1,…,4” but also covariates “ethnicity in register 1,…,4” which makes the models more complicated than the models discussed in this section. In Section 4 we start with a model for registers Census and MOH, that both aim to cover the full population. Then we add register DIA, the birth registration started in 1998. Conceptually, we consider there to be a covariate Age, for which we do not have data: inclusion probabilities are high for individuals born from 1998 onwards and zero before that time. Here we have a model similar to , as the inclusion probabilities for DIA are related to Age but for the Census and MOH we assume they are not. When we add register MOE we assume for this register that there is an unmeasured variable that is also related to age, as enrolments are available starting in the late 1990s, and that there are other factors that lead one to go into tertiary education. For Age we end up with a model having properties similar to those of model , as Age is related to two registers but if there is also a direct link between the two registers the model is collapsible over Age. The other factors leading one to go to tertiary education are further covariates that are only related to MOE and not to the other registers and the model is therefore collapsible over these covariates too. We come back to this in the next sections.

Figure 2: Interaction graphs for loglinear models with three registers and one covariate, taken from van der Heijden at al. (2012)

3.3 General result on the ignorability of partial coverage

We can extend the approach of this section to give a general result. For any number of registers, and any combination of full and partial coverages of the population, the partial coverage will be ignorable (that is, we can obtain the correct results from modelling without taking any special account of the population coverages) if the model is collapsible over the covariate (or covariates) which defines the partial coverage(s). Equivalently, the variable(s) defining the coverage(s) do not appear on any short path in the graphical representation of the model.

4 Building up the population estimates

4.1 Two registers

We start by using the two registers with the widest coverage, the Census and the MOH. Being in the Census is denoted by ( for ‘yes’, for ‘no’), and similarly for MOH, denoted by . The ethnicity variable in the Census is denoted by ( for non–Māori, for Māori, for individuals who are in but did not fill in their ethnicity, and = x for individuals that are not in ). The ethnicity variable in the MOH is denoted by and coded similarly to . In comparison to the methods employed by van der Heijden et al. (2018), the presence of the level in variables and is new, and we first extend the approach to deal with this new level with two registers.

Figure 3 illustrates the form of the data when they are coded in a matrix of individuals in the rows by variables in the columns. In the middle two columns we depict and , that indicate whether individuals are only in but not in (), in both and () or not in but only in (). At the bottom we find a horizontal band of ‘Individuals missed by both lists’, and this refers to . This last number has to be estimated to arrive at an estimate of the size of the total population of non–Māori and Māori. The first column stands for ethnicity variable . When individuals are in ( or ), there are three types of individuals, namely 0, non–Māori (light grey); 1, Māori (checkerboard pattern); and , those who have a missing value for ethnicity (gridded pattern). When individuals are not in but only in , the ethnicity variable is automatically not measured and denoted by x (white area). The last column stands for ethnicity variable , and it has similar levels to . Notice that there are three kinds of missing data: there is item missingness for those individuals that are on a list but did not provide their ethnicity; there is item missingness x for those individuals that are not on one list, and hence have no value on the corresponding ethnicity variable (if only , , and if only , ). Last, there is unit missingness for those individuals that are in neither nor .

Figure 3: Graphical representation of two linked registers

The problem can also be presented in contingency table format, see Table

2, panel 1. In the rows we have the combination of variables and , with levels , and similarly for the combination of variables and in the columns. Thus there is a table with a complicated structure. The subtable top–left shows the cross–classification for individuals in both and . The first two diagonal values are very large, showing that many individuals have the same ethnicity in both the census and MOH. The counts 108,189 and 31,995 refer to the number of individuals for which the information on ethnicity in contradicts that in . The counts for show the number of individuals that are in but whose ethnicity is missing. The aim of our analysis will be to distribute (for example) these 16,512 individuals over the levels and . It is clear that, even though most of these individuals will be non–Māori () as they are non–Māori in , not all of them are non–Māori as for non–Māori

there are still 108,189 individuals that are classified as Māori in

(). Similarly for the counts for . The 900 individuals for which the variables and are both missing ( will have to be distributed over the four cells . When individuals are not in the Census, , then is automatically missing. In the cross–classification with and , the numbers of individuals with and have to be distributed over and .

Panel 1: Observed counts
3,004,335 31,995 150,840 38,634 3,225,804
108,189 435,465 12,405 4,368 560,427
16,512 2,769 900 438 20,619
x 398,838 146,976 24,636 - 570,450
Totals 3,527,874 617,205 188,781 43,440 4,377,300
Panel 2: Fitted values under
3,170,294.8 33,787.9 38,616.0 411.6 3,243,110.3
111,242.5 448,084.8 877.6 3,534.9 563,739.8
402,709.4 10,770.8 4,905.2 131.2 418,516.6
14,130.7 142,839.1 111.5 1,126.8 158,208.1
Totals 3,698,377.4 635,482.6 44,510.3 5,204.5 4,383,574.8
Table 2: Census () linked to MOH (). Ethnicity in is denoted by and ethnicity in is denoted by , where and have levels 0 (non–Māori), 1 (Māori), (missing) and x (not in register). The data have been randomly rounded to base 3 to protect confidentiality. Source: Stats NZ.

The original 15 counts in Table 2, Panel 1, will have to be redistributed over 3 subtables of dimension . I.e., the subtable of size has to be reduced to size , the three values for have to lead to a subtable of size and similarly for the three values for . In a second step the subtable for has to be estimated, and this refers to the individuals that are missed by both lists. Thus two types of missing data are estimated.

For the simpler situation that there are no missing data of type but only of type x, van der Heijden et al. (2018) describe how this result can be found using an EM algorithm with some loglinear model specified by the researcher. Van der Heijden et al. (2018) show that the maximal loglinear model that can be fitted to the data is , where the highest order fitted margins are placed between square brackets. If we denote the levels of by , then this model can be denoted by


with identifying restrictions that the parameters , , and are free, and the other parameters are restricted to be zero. The maximal model is saturated in the sense that the fitted values are equal to the observed values. The other two–factor interactions cannot be estimated as there are no data to estimate them, for example, the interaction between and cannot be estimated as the joint marginal count for and is zero, and similarly for and , and for and .

The EM algorithm can also be applied in this more complicated situation with missing data of type and x with both types replaced by their expected values in the E-step. The result is given in Table 2, Panel 2. We created our own code (see suppl. materials B and C) instead of the computer program CAT that we used in the past (Schafer, Harding and Tusell, 2012; van der Heijden et al., 2012, supplementary materials). Due to the fitted model, in each of the three estimated

subtables the odds ratio between

and is identical and equal to 377.9, the value from the observed top–left subtable.

The lower right table in Panel 2 of Table 2 shows the estimated numbers of people missing from both census and MOH. These numbers are relatively low, due to the large overlap of the two registers. Under independence between and conditional on and , we find, for and , the estimate , and this low estimate is due to the large denominator (where and ). Although not many individuals are missed by both registers, approximately one fifth of the missed individuals are Māori.

The parameter estimates are given in Table 3. Their standard errors are very small, which is not surprising given the large observed frequencies. Notice that the estimate 4,905.2 in panel 2 of Table 2 can also be obtained as exp(8.4981). The estimated conditional odds ratio between the ethnicity variables and of 377.9 can be obtained as exp(5.9347). Also notice that the relation between and is negative, showing that being in the census, , goes together with a smaller probability of being recorded as Māori in the MOH (estimated odds ratio is exp), whereas being in MOH, , goes together with a larger probability of being recorded as Māori in the census (estimated odds ratio is exp(0.4344) = 1.54).

Estimate Std. Error z value Pr
(Intercept) 8.50 NA NA NA
A 2.06 0.002 1248.1
c 0.006
C 4.41 0.005 865.1
a 0.016
A:c 0.003
C:a 0.43 0.016 27.1
c:a 5.94 0.007 903.7
Table 3: Parameter estimates for two registers model

To estimate confidence intervals we use a procedure that can be considered, conceptually, to be a hybrid between the non-parametric and the parametric bootstrap. The nonparametric bootstrap draws random samples of size

from the frequencies observed in the sample (including in our example the frequencies of – and x, that is from Table 2, Panel 1). This approach takes the uncertainty due to these missing values into account, but not the uncertainty due to the estimated population size. As a consequence, the nonparametric bootstrap tends to underestimate the variance of

(Buckland and Garthwaite 1991, Anan, Bohning and Maruotti, 2017). The parametric bootstrap draws random samples of size from the fitted frequencies (in our example proportions derived from Table 2, Panel 2), and then sets the counts for cells which would not be observed (e.g. the cells for which and ) to zero. This approach, however, does not generate observations of missing values for the ethnicity variables and , and therefore also underestimates the variance of

because it excludes the imputation of these missing values.

Therefore we use a new, hybrid bootstrap procedure here that has the aim to combine these two properties: we use the observed counts in Panel 1 and supplement these with the fitted value of the total of the subtable for in Panel 2. These observed counts and the fitted value are used to derive multinomial probabilities (using as the denominator). Bootstrap samples of size (rounded) are then drawn with these probabilities, after which the counts in the subtable for are set to zero, and the new data analysed with model 1. In this application 2,000 bootstrap samples are used to derive a confidence interval using the percentile method.

The estimated total population size for New Zealand is 4,383,575; the 95 percent confidence interval using the percentile method, estimated with the parametric bootstrap, is 4,383,404 – 4,383,736. The census and the MOH differ in which part of this total is Māori, as summarised in Table 4, Panel A. The variable measuring ethnicity with the best validity can be used. If there is no clear preference, a practical solution could be to average the two estimates. Other considerations are discussed in section 4.3 below. When the number of registers involved is larger than two, we propose to use latent class models to bring the separate estimates into agreement with each other, see Section 5.

Māori Non-Māori
estimate 2.5 percent 97.5 percent estimate 2.5 percent 97.5 percent
Panel A: Two registers (section 4.1)

Panel B: Three registers (section 4.2)

Panel C: Four registers, restricted model (section 4.3)

Panel D: Three registers, administrative data sources only (section 4.4)

Table 4: Summary of population size estimates of non-Māori and Māori under the selected models according to the ethnicity classifications in different registers, estimated numbers and 95 percent confidence intervals.

4.2 Three registers

We now add the DIA register, denoted , with ethnicity variable , and consider data derived from three registers. The observed counts are found in the online supplementary material. Adding a third register has the advantage that we no longer have to assume conditional independence of and as in the model . Now we can fit models that have pairwise dependence between the three registers, which makes the fitted model more realistic. The cells of the contingency table with the observed frequencies are denoted with and x, and the cells of the contingency table with the fitted frequencies with , since the missing levels and x have been imputed.

As there are 6 variables, the potential number of fitted frequencies is . However, there are 8 structurally zero cells for the combination of and . Also, as in the situation for two registers, it is not possible to fit models where interacts with , with or with . Thus the maximal model is . This model has 1 (intercept) + 6 (main effects) + 12 (15 two–factor interactions minus the three interactions that cannot be fitted) + 7 (three factor interactions) = 26 parameters. See the left graph in Figure 4 for a graphical model. As described in Section 3, it is clear that DIA () has heterogeneous inclusion probabilities over age, as for some age categories the inclusion probabilities are zero. See the right graph for the resulting model. This model can be collapsed over Age, as Age is not on any short path, compare Section 3. Here we implicitly assume that coverage in at least one of the other sources is homogeneous w.r.t. age, and this is clearly reasonable for MOH. Although the census has some differential response by age, this is much less extreme than the partial population coverage of DIA, and we consider it reasonable to treat this source as homogeneous too. However, even if there is an important dependence of census on age we can make valid estimates as long as the interaction between Census and MOH is also included in the model, as then age is not on a short path and the model can be collapsed over it.

Figure 4: Interaction graphs for loglinear model on the left, extended with a covariate Age on the right

The number of unique individuals in the three linked registers is 4,378,377, the estimated number not in any register is 40,868, and this leads to an estimated population size of 4,419,245 (4,415,848 – 4,422,929). The estimated number of non–Māori and Māori in the Census, DIA and MOH can be found in Table 4, Panel B.

Parameter estimates for the maximal model are found in Table 5. With it is not surprising that it is hard to find more parsimonious models that fit the data, except for the model where the three-factor interaction between and is set to zero. Using the methodology that we propose it is possible to fit more parsimonious models but in this particular instance there is not much to gain. For three registers the maximal model allows for pairwise dependence of registers, whereas with two registers conditional independence had to be assumed. For example, for the pairwise dependence of register and we use the parameter estimates for : and ::. Then for non–Māori () the estimated conditional odds ratio between and is exp(0.332) = 1.4, and for Māori it is exp, so inclusion in goes together with inclusion in . These relations are much stronger for and , and for and . For example, for the estimated conditional odds ratio between and is exp(2.0042) = 7.4, and for the estimated conditional odds ratio between and is exp. Not surprisingly, there are strong relations between the ethnicity variables , and . For parameters :, : and : the estimates are 5.472, 5.075 and 4.156. Interestingly, the estimated three factor interaction parameter :: is negative. This means that, for example, for the non-Māori in the estimated odds-ratio between ethnicity measures in and is exp(5.4720) = 237.9, whereas for Māori in the estimated odds-ratio between the ethnicity measures in and is only exp. Thus the negative three-factor interaction estimate :: shows that for non-Māori in one of the three registers the disagreement in the other two registers is smaller than for Māori (i.e., Māori are more likely to be inconsistent between the registers than non-Māori). Table 6 illustrates this for counts marginalized over the registers and : for example, for the disagreement is 64,566 + 30,087 (relative to approximately 3,750,000), but for the disagreement is 26,063 + 18,448 (relative to approximately 625,000).

Estimate Std. Error z value Pr
(Intercept) 10.487 NA NA NA
A 0.029 0.050 0.6 0.5592
B 0.033
c 0.025
C 2.254 0.050 45.1
b 0.183
a 0.201
A:B 0.332 0.005 67.9
A:c 0.024
B:c 1.080 0.014 79.4
A:C 2.004 0.050 40.1
A:b 0.426 0.086 4.9
C:b 0.661 0.182 3.6
B:C 2.030 0.032 62.7
B:a 1.700 0.059 28.8
C:a 0.702 0.201 3.5
c:b 4.156 0.032 128.9
c:a 5.075 0.022 232.1
b:a 5.472 0.275 19.9
A:B:c 0.008
A:C:b 0.086
B:C:a 0.059
A:c:b 0.232 0.029 8.1
B:c:a 0.014
C:b:a 0.275 0.3664
c:b:a 0.029
Table 5: Parameter estimates for three registers model
3,581,229 64,566 18,264 26,063
30,087 100,639 18,448 579,949
Table 6: Contingency table of fitted frequencies for the marginal table of and

4.3 Four registers

We now add the fourth register, MOE, denoted , with ethnicity in MOE denoted as , to the analysis. Now the maximal model is . Due to the appearance of four factor interactions in the model the assumptions become less and less demanding as more registers are involved.

In this model both DIA and MOE have heterogeneous inclusion probabilities over Age. We argue here that this heterogeneity does not threaten the validity of our estimates. Assuming homogeneous inclusion probabilities over age for the Census and MOH, the heterogeneity would lead to a model where Age is related to both DIA and MOE, but not the other registers. Therefore Age is not on a short path (comparable to Figure 2, model M17) and we can collapse over Age. We conclude that the fact that DIA and MOE do not aim to cover the full population, does not threaten the validity of our estimates.

This model turns out to have some numerical instability in the sense that some log-linear parameters are (nearly) on the boundary or non-identified. We restricted 15 parameters to zero leading to the model . This stabilizes the estimates. The deviance of the model is 680.6, which is, with an observed population size of over 4 million, negligible. The population size estimates for the model are in Table 4, Panel C. The loglinear parameter estimates for the maximal model and for the restricted version are in the supplementary materials. The number of unique individuals in the four linked registers is 4,401,990, and the estimated number missed by all registers is 20,972, giving an estimated population size of 4,422,962 (4,421,894 – 4,424,080). In comparison to the three register solution, by including the fourth register the estimated population size increases by only 4,000, mostly Māori.

4.4 Three registers, ignoring the Census

We also present estimates derived only from the three administrative registers, so that we can see what would happen if the census were replaced entirely by an administrative data-based system. The observed number of individuals in at least one of the registers is 4,378,716. We estimate an additional 26,513 individuals missed by all three registers. This leads to a total population size of 4,405,229 (4,401,858 – 4,409,667). This is somewhat less than the four register estimate, that was 4,422,962.

The estimate of 4,405,229 broken down by ethnicity for each of the three registers is presented in Table 4, Panel D. The three-register Māori estimate for DIA is larger than the four-register estimate in Table 4, Panel C: 804,936 versus 775,697; for MOH it is very similar and for MOE the three-register estimate is larger as well: 780,234 versus 762,222. This suggests that in the absence of the census, the estimate of the size of the Māori population would be larger. Based on the estimates of measurement error from the latent variable analysis in section 5, where the census was the most accurate, this suggests an overestimation of the Māori population size.

5 Dealing with measurement error in Māori variables

To arrive at a final estimate of the number of non–Māori and Māori we describe two approaches, both using the concept of measurement error. Deriving a final estimate is related to the field of macro–integration of categorical outcomes, see de Waal et al. (2019 and in press) for a review of the field. Consider the joint margins of the ethnicity variables and of the four registers in Table 7 from the restricted maximal model of Section 4.3. An ad hoc approach is to consider five groups of individuals, according to how many registers record them as Māori. Then a (slightly arbitrary) choice would have to be made which categories would be used to arrive at the estimated numbers of Māori and non–Māori. For example, one could consider individuals who are estimated to be Māori in at least two registers to be Māori, allowing for a measurement error in recording a non–Māori as a Māori in at most one register. This would give a final estimate of 768,479, corresponding to an estimated probability of .

3,519,852 53,366 10,998 6,934
55,676 15,600 9,686 17,555
14,590 21,218 2,560 17,747
18,443 79,105 28,934 550,697
Table 7: Joint margins for variables (Census), (DIA), (MOH) and (MOE) estimated under the restricted maximal model.

A statistical approach to measurement error is to make use of a latent class model (McCutcheon, 1987), a technique that has also been applied to evaluate census-related multiple system estimation by Biemer et al. (2001), see for comparable work Biggeri et al. (1999), Stanghellini and van der Heijden (2004) and di Cecco et al. (2018). See also Boeschoten, de Waal and Vermunt (2019), for a comparison of the latent class model with the ad hoc approach described in the preceding paragraph. Other recent work making use of the latent class model in official statistics is Boeschoten, Oberski and de Waal (2017) and Boeschoten, de Waal and Vermunt (2019). Oberski (2018) makes a plea for the use of latent class modelling in the context of linked data sources to tackle the measurement error problem.

The latent class model assumes the existence of a categorical latent variable, and that the observed variables are independent conditional on the latent variable. Thus the latent variable “causes” the responses to the observed variables, and explains the interactions between the observed variables. Let be the joint probability for variables and , with levels indexed by and respectively, with 0 for non–Māori and 1 for Māori. Let be the latent variable, with two levels indexed by (). Let be the probability to fall in latent class . Let be the conditional probability for variable to fall in given latent class . Then the two-class latent class model is


As there are 16 counts in Table 7, and nine parameters in equation 2

(i.e. eight independent conditional probabilities and one independent class size parameter), there are 7 degrees of freedom.

We fit the latent class model with two latent classes to the data in Table 7. This gives a two-stage procedure - first fitting a model to deal with the missing data (both and x), and then applying a second, latent class, model to the resulting estimates. We find the estimates in Panel 1 of Table 8. In this latent class model, the first latent class is to be interpreted as the class for non–Māori, and the estimated probability of falling in this class is 0.827. The estimated probability for the Māori class, 0.1733, corresponds to an estimated Māori population size of approximately 770,677. Estimated conditional probabilities of being Māori for each latent class are also shown in Panel 1 of Table 8; they are consistently low for the non–Māori latent class and high for the Māori latent class. These estimated conditional probabilities can be interpreted as measurement error estimates under the assumption that the model is correct. For example, with the census has the smallest measurement error for Māori: given that the true status of someone is Māori, he or she has an estimated probability of 0.063 to say he or she is non–Māori in the census. For the non–Māori the MOH has the smallest measurement error, with an estimate of 0.003, closely followed by the census. The measurement errors are much larger in the latent class of the Māori than in the class of the non–Māori. The population size of the Māori latent class of 770,677 may be considered high when we compare it with the estimates for the Census, DIA, MOH and MOE of 733,167, 761,545, 643,429 and 770,047 respectively. However, we have to take into account that in this Māori latent class individuals have relatively large measurement error, making them report regularly that they are non-Māori. The reverse measurement error, that individuals in the non-Māori latent class report that they are Māori, is much smaller. We note that a three class latent class model is not identified if the number of observed variables is four, even though the number of cells is larger than the number of parameters (Vermunt and Magidson, 2004).

Panel 1: Estimates for Table 7
census DIA MOH MOE
Class 1 0.827 0.004 0.016 0.003 0.015
Class 2 0.173 0.937 0.937 0.826 0.922
Panel 2: Estimates for LCMSE
Class 1 0.834 0.007 0.014 0.005 0.016
Class 2 0.166 0.957 0.958 0.847 0.959
Table 8: Estimates of latent class models with two latent classes.

We have just fitted a latent class model with two classes on the margins for variables and . It is also possible to define overall models. For this purpose we first rewrite the latent class model with two latent classes as a loglinear model for the observed variables and and the latent variable (Hagenaars, 1993). The latent class model can be denoted as the loglinear model for the probabilities , where . Written as a loglinear model using –parameters, we have




We incorporate this loglinear model in the maximal model for variables and in the following way. To start with, we define probabilities that allow us to define a model for the eight observed variables and the latent variable. These probabilities that include a latent variable are related to the probabilities for the observed variables only by


In the section on four registers we saw that the maximal model is . The restrictive features of this maximal model are (1) that variables denoted by a lower case letter cannot appear in the same term as variables denoted by capitals, so that, for example, variables and cannot appear in the same interaction term, and (2) the four capitals and cannot appear together in the same term. We are particularly interested in the joint marginal counts of and .

We now modify the maximal model to include a latent variable. In the maximal model there are three groups of terms. First, there is the last term , that we replace by the latent class model . Thus we are formulating a model for observed variables and the latent variable. Second, we eliminate the terms , as the joint appearance of two or three of the lower case letters for variables and in one single term is in conflict with the latent class assumption. For example, consider . The interaction between and is explained by the latent variable . One could argue that, although it is in conflict with the latent class model to include the interaction, one could still include the interactions for , for and for . However, this would result in a non–hierarchical loglinear model and these are difficult to interpret, so we eliminate these terms completely. Third, we can retain the terms . Thus a latent class model where (3) is extended with the register variables and , is . We refer to this model as LCMSE, short for latent class multiple system estimation.

Our code for the LCMSE model can be found in the supplementary materials B and C. The LCMSE model was also fitted using lEM (Vermunt, 2007), however, lEM does not provide an easy way to calculate estimates of the missing part of the population.

The latent class parameter estimates can be found in Panel 2 of Table 8. They are similar to those in Panel 1. The deviance of the model is 10,922.25. A proper evaluation of this value should take the observed population size of 4,401,990 into account, and therefore we divide the deviance by 4,401.99, so that we have an impression of the deviance if the population size had been 1,000. Thus we find a normed deviance of 2.5, showing a good fit. The population size estimate for this model population is 4,447,071, with a 95 percent confidence interval of 4,435,301 – 4,465,050. The estimated proportion of Maori in the LCMSE is 0.166, lower than in the two-stage model, and the non–Māori proportion is correspondingly higher. It is as if the LCMSE approach is more likely to treat individuals with uncertain status as part of the non–Māori latent class. This reduces the estimated measurement errors in the Māori latent class, and increases them in the non-Māori. We prefer the integrated approach of the LCMSE because it deals with all the variability in the data appropriately, whereas the two-step approach means that the latent class estimates depend on the fitted values of the first stage model.

As a second model, we adjust the LCMSE model by including a latent variable that explains the relations between the variables and . Theoretically, such a model allows for heterogeneity of inclusion probabilities, where in one class inclusion probabilities for the four registers are higher and in the other class they are lower, see Stanghellini and van der Heijden (2004) for a medical example. So if there were two such subpopulations, the latent class model would reveal this. A model for this is (now excluding all of the original interactions as these are explained by the latent class variables). This model has a deviance of 291,464.1, and a normed deviance of 66.2. Thus we reject this model. We also considered extending the model with an interaction between and , giving . The fit improves but is still not good: the normed deviance is 43.7.

We conclude that the LCMSE model is the best model for these data. Interestingly, the estimated probability of the Māori latent class of 0.166 in this model is close to the estimated probability of Māori of 16.5 for the IDI-ERP found by Statistics New Zealand. Also, the tables analysed are for Census day 5th March 2013. The official ERP figure for March 2013 is 4,436,000 +/- about 0.5 percent ( This is very close to our 4 register estimate of 4,422,962 and the LCMSE estimate of 4,447,071.

6 Discussion

Van der Heijden et al. (2018) presented an approach for estimating the margins of auxiliary variables in the dual system estimation framework. They suggested that more experience with applications of this methodology was needed to be able to judge its usefulness. Here this approach is extended to multiple system estimation with four registers, and a more complicated missing data structure.

The data used here were probabilistically linked in Stats NZ’s Integrated Data Infrastructure (IDI), which means that the linkage has not been subjected to a clerical review stage. This increases the risk that there will be some linkage error which is not accounted for in our models, and since dual and multiple system estimation using loglinear models is critically dependent on the lack of linkage error (Wolter, 1986, Biemer and Stokes, 2004 , Gerritse et al., 2017), there is a risk that estimates will be inflated through matching error. The data have also been randomly rounded to base 3 to protect confidentiality, but we expect this to have a negligible impact on the fitted values as the number of individuals in the linked data set is larger than 4 million.

There is also a risk of overcoverage (also known as list inflation) in administrative registers, through duplicate records and the inclusion of people who are not part of the target population, which is the usual resident population in the case of the census as considered here. It is likely that some duplicates remain in the IDI-ERP despite attempts to remove as many of them as possible. There is as yet no estimate of the remaining overcoverage in the IDI-ERP. However, among the sources considered here the DIA birth register is less likely to suffer from duplicates because it does not obtain repeating data – though for the same reason it may suffer from overcoverage when people emigrate. However, the sequence of population size estimates from models with two to four registers shows only small increases, suggesting that the overcoverage is small in the added registers.

We assume that the inclusion probability is homogeneous in at least some of the registers (see section 4.2). This assumption seems most at risk for the census source with respect to age, though as we discussed previously this does not invalidate our analysis because we have homogeneity in other sources; at worst it may restrict our choice of models to ensure that age does not appear on a short path between registers. However, even in the case that age was important in all the registers, it would still be possible to use missing data methodology to estimate the unobserved parts of the registers under the assumption that the relations in parts of the population where the registers overlap also hold in the other parts - a missing at random assumption (Zwane et al. 2004).

The modelling process to choose the most parsimonious loglinear model to use in population size estimation and marginal total estimation for auxiliary variables such as Māori/non–Māori is complicated in our application by the size of the data. For most of the models except for the model for four sources, since the census and the MOH registers cover most of the population of New Zealand, the entries in the contingency tables (for example Table 2, Panel 1) are very large. Therefore any term added to the model has plenty of data for estimation, and will almost certainly be significant. Therefore we have ended up with saturated models, except for the model for four sources where the saturated model is numerically unstable, and the latent class model, that is not saturated by definition. The estimated sizes of the Māori population are plausible, but a different value is created according to the definition in each register (in this sense the Māori identifiers act as alternative variables, as in van der Heijden et al. (2018, section 4)). A preferred version of the variable, or some pragmatic combination of estimates must be chosen in order to obtain a final estimate.

The latent class analysis of the same data deals with these different definitions, and provides a consolidated single estimate, and, as a consequence, also estimates of the measurement errors in each source. In the two models for which estimates are presented the measurement errors for Māori status in the census are among the lowest (Table 8, Panel 1 and 2), which provides some support for the intuition in the statistical system that the purpose-designed census collection gives a better estimate of the size of the Māori population than one based on administrative registers which only collect this variable as a by-product of their main purpose. The preferred LCMSE model (Table 8, Panel 2) shows almost the same estimated measurement error for the Māori in census, DIA and MOE, with MOH showing a substantially greater measurement error. MOH has the lowest estimated measurement error for non-Māori, but only a little better than the Census (under either latent class approach).

Despite these differences, the estimate using only the three administrative registers without the input of the census is reasonable, although there are some differences from methods based on the census.

We conclude that the methods of van der Heijden et al. (2018) provide stable results that allow for detailed interpretation of the processes of inclusion in the registers considered, and of recording Māori status. The conditions for treating the partial coverage of the registers as ignorable are met in this study, so that the modelling can be applied directly, and we show that they can be extended to deal with different forms of missing data. The latent class approach provides a principled method to produce a common estimate accounting for differences in the definition of Māori status among the data sources, and also provides estimates of the measurement error in the different definitions which can be used to understand the quality of the administrative sources.

7 Disclaimer

The results in this paper are not official statistics, they have been created for research purposes from the Integrated Data Infrastructure (IDI) managed by Stats NZ. The opinions, findings, recommendations and conclusions expressed in this paper are those of the authors, not Stats NZ. Access to the anonymised data used in this study was provided by Stats NZ in accordance with security and confidentiality provisions of the Statistics Act 1975. Only people authorised by the Statistics Act 1975 are allowed to see data about a particular person, household, business or organisation and the results in this paper have been confidentialised to protect these groups from identification. Careful consideration has been given to the privacy, security and confidentiality issues associated with using administrative and survey data in the IDI. Further detail can be found in the Privacy impact assessment for the Integrated Data Infrastructure (Statistics New Zealand, 2017b).

8 Supplementary Materials

Supplementary Material A:

Raw data tables and parameter estimates. (file type: .pdf)

Supplementary Material B:

R Package (including data): cenzus. Openly available at

Supplementary Material C:

Computer code and extended output of parameter estimates (file type: .html)