## 1 Introduction

Estimating the size of a hidden finite set is an important problem in a variety of scientific fields. Often practical constraints limit researchers’ access to elements of the hidden set, and direct enumeration of elements may be impractical or impossible. In demographic, public health, and epidemiological research, researchers often seek to estimate the number of people within a given geographic region who are members of a stigmatized, criminalized, or otherwise hidden group [1, 2, 3, 4]. For example, researchers have developed methods for estimating the number of homeless people [5, 6], human trafficking victims [7, 8], sex workers [9, 10, 11, 12, 13], men who have sex with men [14, 15, 10, 16, 17, 18, 11, 19, 20], transgender people [21, 19], drug users [22, 23, 24, 25, 26, 27, 19, 11, 28], and people affected by disease [29, 30, 31, 32, 33, 34]. In ecology, the number of animals of a certain type within a geographic region is often of interest [35, 36, 37, 38]. Effective wildlife protection, ecosystem preservation, and pest control require knowledge about the size of free-ranging animal populations [39, 40, 41]. In intelligence analysis, military science, disaster response, and criminal justice applications, estimates of the size of hidden sets can give insight into the size of a threat or guide policy responses. Analysts may seek information about the number of combatants in a conflict, military vehicles [42, 43], extremists [44], terrorist plots [45, 46], war casualties [47], people affected by a disaster [48], and the extent of counterfeiting [49].

Despite the wide diversity in application domains, most statistical approaches to estimating the size of a hidden set fall into a few general categories. Some approaches are based on traditional notions of random sampling from a finite population [50, 51]. Others leverage information about the ordering of units [42, 43], or relational information about “network” links between units [5, 52, 53, 54, 55, 26]. Single- or multi-step sampling procedures that involve record collection or “marking” of sampled units – called capture-recapture experiments – are common when random sampling is possible [56, 57, 58, 35, 59, 23]. Sometimes exogenous, or population-level data can help: when the proportion of units in the hidden set with a particular attribute is known *a priori*, then the proportion with that attribute in a random sample can be used to estimate the total size of the set [60, 61, 62, 25, 18, 63]. Still other methods use features of a dynamic process, such as the arrival times of events in a queueing process, to estimate the number of units in a hidden set [45, 46].

Alongside these practical approaches, corresponding theoretical results provide justification for particular study designs and estimators, based on large-sample (asymptotic) arguments. Guidance for prospective study planning often depends on asymptotic approximation. For example, sample size calculation may be based on asymptotic approximation if the finite-sample distribution of an estimator is not identified or hard to analyze [64, 65, 66]

. In retrospective analysis of data and the comparison of statistical approaches, researchers may choose estimators based on large-sample properties like asymptotic unbiasedness, efficiency and consistency if closed-form expressions for finite-sample biases and variances are hard to derive

[67, 68]. Claims about the large-sample performance of estimators depend on specification of a suitable asymptotic regime, and it is well known that estimators can perform differently under different asymptotic regimes. Asymptotic theory in spatial statistics provides some perspective on what it means to obtain more data from the same source: informally, an “infill” asymptotic regime assumes a bounded spatial domain, with the distance between data points within this domain going to zero. An “increasing domain” or “outfill” asymptotic regime assumes that the minimum distance between any pair of points is bounded away from zero, while the size of the domain increases as the sample size increases. The latter is usually the default asymptotic setting considered by researchers studying the properties of spatial smoothing estimators [69, 70, 71]. However, under infill asymptotics, these desirable asymptotic properties of smoothing estimators often do not hold: even when consistency is guaranteed, the rate of convergence may be different [72, 73, 69, 74, 75].When the size of the population from which the sample is drawn is the estimand of interest, intuition about large-sample properties of estimators can break down, but a similar asymptotic perspective is useful in studying the properties of estimators for the size of a hidden set: an infill asymptotic regime takes the total population size to be fixed, while the number of samples from this population increases; the outfill regime permits the sample size and population size to grow to infinity together.

In this paper, we review models and methods for estimating the size of a hidden finite set in a variety of practical settings. First we present a unified characterization of set size estimation problems, formalizing notions of size, sampling, relational structures, and observation. We then introduce the non-asymptotic regime in which sample size tends to the population size, and define the “infill” and “outfill” asymptotic regimes in which the sample size and population size may increase. We investigate a range of problems, query models, and estimators, including the German tank problem, failure time models, the multiplier method, the network scale-up estimator, the Horvitz-Thompson estimator, and capture-recapture methods. We characterize consistency and rates of estimation errors for these estimators under different asymptotic regimes. We conclude with discussion of the role of substantive and theoretical considerations in guiding claims about statistical performance of estimators for the size of a hidden set.

## 2 Setting and notation

### 2.1 Hidden sets

Let be a set consisting of all elements from a specified target population. In general, can be discrete or continuous. Let be a measure defined on such that . The *size* of is . We call a *hidden* set if the members of are not directly enumerable, or if its size cannot be ascertained from a deterministic query. When is a finite set of discrete elements, is the cardinality of . If alternatively is the union of intervals, then can be taken as Lebesgue measure.

We seek to learn about the size of by sampling its elements. Define a probability space , where is a -field, and is a probability measure on .
The measure represents a probabilistic query mechanism by which we may draw subsets of the elements of . For each possible sample , defining gives a notion of *random sampling*. Sequential sampling designs can be specified by defining the sequential sampling probabilities . Sequential samples are denoted as , and the sample size is defined as , the sum of the cardinality of each sample, which can be larger than under with-replacement sampling. An estimator of is a functional of onto .

Elements of the hidden set , or of a sample from , may have attributes, labels, or relational structures that permit estimation of from a subset. An element may be labeled or have attributes , which may be continuous, discrete, unordered, or ordered. The elements of may be connected via a relational structure, such as a graph , where the vertex set is , and edges represent relationships between elements. Alternatively, the sampling mechanism may impose a structure on the elements of a sample: if and are samples from , then the intersection is the set of elements in both samples. An *observation* on the sample consists of statistics that reflect these attributes, labels or structures of the units in , such as the value of attributes , network degrees in a graph or size of the intersection of samples .

### 2.2 Asymptotic regimes

We now formalize asymptotic regimes relevant for hidden set size estimation.

###### Definition 1 (Asymptotic regime).

Let be a probability space defined for each , and let be the set of samples from , with . An asymptotic regime is a sequence such that the limits

exist (infinity included).

We first define the trivial finite-population regime, in which the sampled set approaches the fixed population .

###### Definition 2 (Finite-population regime).

Let be a hidden discrete set of fixed size. The finite-population (non-asymptotic) regime is for all and for all , where is a positive integer.

Next, we define the “infill” asymptotic regime that arises when sampling repeatedly (with replacement between different samples) from a set of fixed finite size. This regime is an example of a superpopulation model [76, 77] which reproduces the original population for each .

###### Definition 3 (Infill asymptotic regime).

Let be a sequence of probability spaces, where assigns probability to sequential samples for any . The infill asymptotic regime is a sequence , where (any ) and are both fixed and bounded, and the number of samples as .

Sometimes it can be difficult to conceptualize sampling infinitely many times from , or the sampling design may be subject to practical constraints, so that sampling only a single or fixed number of samples, or a fixed proportion of the total population, is allowed. It is therefore also reasonable to study the performance of estimators under an asymptotic regime in which a *single* sample is obtained from the hidden set, where the size of the sample and hidden set may tend to infinity together.

###### Definition 4 (Outfill asymptotic regime).

Let be a sequence of probability spaces, where assigns probability to for any . The outfill asymptotic regime is a sequence such that and such that for each as , where may be finite or infinite.

We are primarily interested in the outfill asymptotic regime with for all . The multiplier and capture-recapture methods, described below, are special cases where may be greater than one. Figure 1 illustrates different regimes in general discrete settings.

### 2.3 Statistical properties of estimators

Let be an estimator of , defined for each . We are interested in the statistical properties of under the asymptotic regimes described above. An estimator is called unbiased if for all , where denotes expectation with respect to . Under an asymptotic regime , an estimator is asymptotically unbiased if

. There may be some slightly biased estimators whose variance is smaller than that of every unbiased estimator. A common way to balance the trade-off between the bias and variance is to evaluate the

mean squared error (MSE), defined as . The asymptotic MSE under a given regime is defined as .An estimator that satisfies
for any under a particular asymptotic regime is called
consistent for .
An estimators is called MSE consistent for under a certain asymptotic regime if as under that asymptotic setting. MSE consistency implies consistency.
Under a particular asymptotic regime, we call a sequence of estimates *asymptotically normal* with mean , variance and rate

if the cumulative distribution function (CDF) of

converges to the CDF of a random variable, denoted by .## 3 Ordered sets

Suppose each unit in the hidden set has a distinct label , so that the labels give a natural ordering of the elements in : we can define units if . One common scenario for discrete is that the ’s are consecutive integers. Another common situation when is equivalent to an interval in is that equals that interval. An observation of samples from an ordered set consists of sampled units and their labels .

### 3.1 Discrete set: the German tank problem

In 1943, the Economic Warfare Division of the American Embassy in London initiated a project to learn about the capacity of the German military using serial numbers found on German equipment, including tanks, trucks, guns, flying bombs, and rockets [42, 78]. In a simple conceptualization of the problem, let and consider sampling units without replacement from with probability . With i.i.d. repeated samples, an estimator for is a functional of the observations, including the sample sizes and observed labels . Let be the th order statistic for the th sample.

With one sample, the maximum likelihood estimator (MLE) for is , which is negatively biased. Goodman [43] proposed an unbiased estimator

(1) |

which is a uniformly minimum-variance unbiased estimator (UMVUE), with . An alternative estimator of takes into account the gap between and , and adjusts for the bias with the average gap between order statistics [43]. The estimator

(2) |

is also unbiased, with . The estimator can also be modified to estimate when the labels do not start with 1. In particular,

is the UMVUE of when the initial label is unknown [43], with .

When there is more than one sample, we take the MLE as the maximizer of the joint sampling probability , which is , the largest observed value across all samples. For estimators with closed forms like , we derive estimates based on each sample, and take their average as the estimator. In remaining sections, we average the estimators under infill by default, except for the models where infinite without-replacement sampling is feasible (e.g. Section 3.2 and 4.1). We consider the infill asymptotic regime where and , and the outfill regime where with . Figure 2 illustrates different regimes for the German tank problem. We have the following asymptotic results:

###### Theorem 3.1.

Under the finite-population and infill regimes, are consistent. Under the outfill regime, all estimators above are asymptotically unbiased with asymptotic MSE and inconsistent. Whether the initial label is known or not does not change the rate of MSE of the UMVUE.

Höhle and Held [79] investigated the same problem from a Bayesian perspective. Taking an improper uniform prior, , the posterior mode is the MLE and the posterior mean is for . The latter converges in probability to a biased quantity under the infill regime, and has the same MSE rate as under outfill asymptotics.

### 3.2 Continuous interval

A continuous version of the German tank problem arises for estimation of the length

of a finite interval using i.i.d. random samples from the continuous uniform distribution Unif

. For one sample of size , the probability density . Repeated samples are independently generated under the same mechanism.For one sample, the MLE is , which is biased. The UMVUE is

(3) |

with variance . Consider the infill regime with , , and , and the outfill regime where , and with . When there are samples, the MLE is , which is biased, but asymptotically unbiased under the infill regime when . Since

under outfill, is inconsistent. Moreover, it is asymptotically unbiased with variance under the outfill regime. We discuss outfill consistency when the density increases at polynomial and exponential rates near in the Appendix.

## 4 Bernoulli Trials

Consider a discrete hidden set consisting of unlabeled, indistinguishable units. A sample from arises by associating a binary indicator to each , for fixed , where different realizations of the ’s can be generated in different draws. The probability may be known or unknown. A single sample consists of the subset of units with positive indicators, . This is a frequently encountered situation in computer science, ecology, business, epidemiology, and many other fields [80, 81, 34, 33].

### 4.1 Binomial parameter

We first assume that is known. A single sample from gives an observation which follows Binomial distribution. When there are independent samples, we assume they are generated by the same mechanism, so

. The method of moments estimator (MME)

is an unbiased estimator of . There are two versions of MLE, derived from continuous and discrete likelihood equations respectively. The continuous MLE, is the solution of (take if it is larger than the solution), and the discrete MLE is the largest such that .The finite-population regime arises when and , i.e. when all units are associated with indicator and observed in a single sample. We consider the infill asymptotic regime with and , where the “sample size” here represents number of repeated samples. The outfill regime is with . Figure 3 shows how the sampling mechanism varies under different regimes for the binomial model.

###### Theorem 4.1.

Under the finite-population regime, , and are consistent. Under infill asymptotics, , , and after rounding to the nearest integer, are consistent [82]. Under outfill asymptotics, and are both asymptotically unbiased and normal with variance . The “relative error” of the discrete MLE, for any . The “relative error” of with goes to in probability [82, 83].

When is unknown, the situation does not improve: negative or unstable estimates may occur, and Bayesian approaches are usually adopted to avoid these issues. Blumenthal and Dahiya [82]

adopted a conjugate prior Beta

for and an improper uniform prior for ; the posterior is proper if and only if [84]. Blumenthal and Dahiya [82] showed that the posterior mode is consistent under infill asymptotics, and satisfiesunder the outfill regime. In particular, the MSE rate is slower compared to as in Theorem 4.1 when is known.

### 4.2 Zero-truncated Poisson

Sampling bias can sometimes be exploited to estimate the size of a hidden set. For example, a registry may record the number of times each unit has been observed, but zero counts are not recorded. Distributional assumptions can be used to estimate the proportion of unobserved zero counts, leading to estimates of the set size. Zero-truncated counting models have been used to estimate size of hard-to-reach populations, including drug users [85, 86], undocumented immigrants [87, 88], criminal population [89, 90], the number of infected households in an epidemic [91], and species richness in ecology [92, 93].

To illustrate, let be a set of indistinguishable units. To each unit , we associate a realization of the attribute . A sample from is and an observation on is , the set of all positive counts. For one sample, the sampling mechanism is given by . We define the infill asymptotic regime as and , i.e. more and more identically distributed and mutually independent realizations of are generated, leading to the samples such that for . The outfill asymptotic regime is defined as with .

When is known, estimation of reduces to the simplest binomial model as in Section 4.1, where , and all asymptotic claims follow. When is unknown, Stuart et al. [94] suggested using the MME of binomial where , leading to . This estimate is unbiased if is known, and negatively biased by Jensen’s inequality if an unbiased estimator is used for .

### 4.3 Waiting times

Sometimes the state of a hidden unit may change, thereby making it known to an observer. For example, terrorist plots may change state from “hidden” to “executed”, making them observable by intelligence agents [45]. The temporal pattern of such state changes may give insight into the number of hidden units. Properties of waiting times to an event have been exploited to estimate the number of units in studies of terrorism, crime, and estimation of epidemiological risk population sizes [45, 95, 96, 97].

Suppose is a set of hidden units in existence at time 0, each of which is at risk of “failure” at some future time. To each , associate a failure time , and suppose failure times are observed up to some finite observation time . A sample is the set of units that have failed by the end of study, with , and an observation on is . With repeated sampling, a new observation is independent of all previous observations, taken after all units are set to be “at risk” over again. We consider the finite-population regime in which so that all failures are observed, the infill regime in which and are fixed with the number of repeated observations , and the outfill regime in which with . Figure 4 illustrates each regime under the waiting time model.

Let be the waiting time between the th and th failure. The sampling mechanism is given by

which gives rise to the likelihood . Alternatively, if we ignore the timing of events, the observed number of events can be characterized by a binomial model , which yields . Maximizing and lead to two estimates, and of . It is easy to verify that , so and are identical, and the timing of events does not contain more information about than the total number of events.

The asymptotic behavior of follows from the discussion in Section 4.1: when is known, is consistent under finite-population and infill regimes. Under the outfill regime, it is unbiased and asymptotically normal with variance .

### 4.4 The multiplier estimator

The multiplier method, also called the method of benchmark multiplier (MBM), can be used to estimate the size of a hidden population if the number of hidden units with a certain trait, and an estimate of the overall prevalence of that trait in the hidden population, are available. Often the prevalence of the trait is estimated through expert opinion, historical data, or from a separate sample [98, 99, 23].

Let be a hidden set of units of size . To each unit in we associate a binary trait . The first sample is , and the *benchmark* is , which follows Binomial. If the trait prevalence is known, the results in Section 4.1 apply. Alternatively, suppose is estimated from another random sample, , which is independent of . We assume is a uniformly random draw from with deterministic size , among which has a positive trait. An observation on () consists of the benchmark and . Then the proportion gives the *multiplier*, which is an estimate of .

follows a hypergeometric distribution, and the mechanism of generating the observations can be defined as

.A MME for is , often called the multiplier estimator of . When more than one sample pair ( is drawn, we shall note that unlike the binomial setting, the binary traits (like HIV status or death) of units will not change. Therefore, no new realizations of will be generated, and is always fixed under the infill regime. We consider the finite-population regime that . The infill regime is that are fixed and , where is the number of sample pairs, . The outfill regime is that with , , with only one draw of .

Since follows hypergeometric distribution, , and is positively biased by Jensen’s inequality. The multiplier estimator has essentially the same properties as the Lincoln-Petersen capture-recapture estimator in Section 5.1.1, where detailed discussion will be provided. We have the following asymptotic results:

###### Theorem 4.2.

is consistent under the finite-populatin regime. Under infill asymptotics, is inconsistent with MSE . Under the outfill regime, when , and , is inconsistent with MSE at least . for some .

### 4.5 The network scale-up method

Estimating the size of a hidden network or graph is an important problem in sociology, epidemiology, computer science, and intelligence applications [5, 48, 52, 54, 55, 100, 101]. A subgraph of a larger graph may contain information about the size of the larger graph [55, 102, 103]. The network scale-up method (NSUM) [5] provides an estimate for the size of a hidden population by making use of network information from a sub-sample of individuals.

Consider a graph , where is a set of units and means that are connected. is called the total population, and a subset of size is the hidden population. The network of is , where . We call the general population. A sample from a subset of , along with network degrees of the sampled units within and outside of that subset provides information for learning about the size of . Two common scenarios are illustrated in Figure 5.

We now introduce exchangeable random graph models (EGM) [104] that both scenarios are based on, or related to. Suppose each vertex is associated with some random attribute which is i.i.d. for each . The probability that units and are connected is , where is a function from to . We then denote . EGM includes Erdős-Rényi [105] and stochastic block models [106] as special cases.

#### 4.5.1 Sampling from the general population

We consider sampling uniformly at random from the general population with a fixed sample size . The sampling mechanism is . We consider the distribution of

that is slightly more general than EGMs in that we require the joint distribution

to be i.i.d. for each combination of and , instead of assuming i.i.d. ’s. This is a generalization of the commonly assumed Erdős-Rényi distribution for NSUM methods. Let . We observe network degrees and for each . ThenBy canceling out we have the following MME:

(4) |

In (4), follows hypergeometric distribution conditioning on for each . The same estimator can also be derived under a different model assumption. Killworth et al. [5] considered a model where is Binomial given , and (4) is then unbiased with variance .

We consider the finite-population regime in which , i.e. . The infill regime is defined such that are fixed and the number of repeated samples goes to infinity. The outfill regime is that such that , , and .

Sometimes an intermediate step in deriving is the estimation of personal network sizes . If unbiased estimates are plugged in, would have a positive bias. Let us assume for now that the ’s are observed true values.

###### Theorem 4.3.

has a positive bias . It is not necessarily consistent under the finite-population regime, and converges to a positively biased quantity under infill. It is asymptotically normal with bias and variance under the outfill regime.

#### 4.5.2 Sampling from the hidden population

Consider a random sample where . We observe the nodes , as well as network degrees and , for each individual . Let , then

canceling out yields the MME

which is often simplified to

(5) |

Chen et al. [107] investigated the behavior of with finite-sample as well as with large , but did not specify the relationship between and . In our setting, the finite-population regime is with fixed. The infill regime is that are fixed and the sampling procedure is infinitely repeated. The outfill asymptotic regime is that with .

###### Theorem 4.4.

Under the finite-population regime, converges to . Under infill asymptotics, is always positively biased conditioning on [107], and is hence inconsistent. Under outfill asymptotics, is asymptotically normal with bias and variance .

### 4.6 Estimating a total with unequal sampling probabilities

A generalization of binomial models allows for heterogeneity in the inclusion, or “success” probabilities , that is, when the sampling is not uniformly at random. Horvitz and Thompson [50] proposed unbiased estimators for population means and totals under the setting of sampling without replacement from finite population, where the selection probabilities can be unequal. The Horvitz-Thompson (HT) estimator for the population total is , where is the probability that unit is sampled in . The estimator is unbiased for the total population size . This estimator and its variants have been applied to the estimation of animal abundance [108] and other fields. We consider a deterministic sample size . Then the variance of is [50]

(6) |

where is the joint probability that units and are both in the sampled set , and [50]. The finite-population regime amounts to letting for any . Under the infill regime, are fixed and the number of repeated samples . Under the outfill regime, and both increase to infinity such that . Figure 6 shows the non-uniform sampling mechanism under each regime.

Specifically, we consider the following setting to illustrate the asymptotic behavior of the HT estimator. Suppose consists of clusters, where the th cluster has units. We assume that is known in advance, while is observed only if a unit from cluster is sampled. In each sample, a total of units are sampled from by the following procedure: first a cluster is drawn uniformly at random each with probability . Then one unit is drawn from the units in that cluster, also uniformly at random, without replacement. An observation on sample consists of the units in , their group membership, and the sizes of groups that they belong to.

We assume that . The marginal probability that unit in cluster is sampled is

and the joint probability that two units are sampled from clusters and () is

When there are repeated observations, we assume they follow the same design and are mutually independent. In this setting, the outfill regime is defined such that each cluster in the original population is replicated and appears times in . The cluster sizes are fixed at and the number of clusters increases as . is fixed and the estimand is . The sample size satisfies . We then have:

###### Theorem 4.5.

is consistent under the finite-population regime, and MSE consistent under infill asymptotics. is unbiased and asymptotically normal with variance under the outfill regime.

## 5 Other unordered sets

### 5.1 Capture-recapture experiments

Capture-recapture (CRC) refers to a broad class of methods to estimate the size of hidden populations for which random sampling is possible [35, 58, 57, 109, 110, 111]. Estimation of the population size is based on the overlap between two or more random samples [31, 32, 8, 15]. While a wide variety of CRC estimators have been developed [112, 113, 109, 110, 114], we focus here on the two- and -sample CRC estimators with homogeneity within a closed population.

#### 5.1.1 Two-sample estimation

We first consider the common case of two-sample CRC. Let be a hidden finite set of size , where each unit has binary attributes , which are all in the beginning. We draw a sample with size from , and set for all . Then a second sample with size is drawn, independent from and uniformly at random, and we set for all . We observe , and let . Similar to the MBM, follows a hypergeometric distribution conditioning on and . The MME, , is also known as the Lincoln-Petersen estimator [115, 116].

We consider the finite-population regime with . The infill regime is that are fixed and repeated sample pairs are drawn with . Note that in contrast to the MBM, the first sample can be generated differently for repeated sampling. The outfill regime is given by with for .

Previous results exist on the bounds or estimates of biases and variances. These were implicitly based on asymptotic approximations: Chapman [56] showed a lower bound for the bias

under outfill, and bounded the variance as

under asymptotic approximation that was satisfied by the outfill regime. Though these no longer hold under finite-sample setting, it has been demonstrated through simulation that has a considerable bias under a range of settings. A less biased estimator

(7) |

was proposed [56], with bias

(8) |

for any , and variance

(9) |

under outfill [56], where means the difference between two quantities decay to 0. We have the following asymptotic result:

###### Theorem 5.1.

Under the finite-population regime, and are consistent. Under infill asymptotics, is positively biased and has MSE for at least a range of values of . is negatively biased, but the bias is within 1 if and [56]. Under the outfill regime, has bias at least and variance at least . is asymptotically unbiased with variance . Furthermore,

Comments

There are no comments yet.