# BIG sampling

Graph sampling is a statistical approach to study real graphs, which represent the structure of many technological, social or biological phenomena of interest. We develop bipartite incident graph sampling (BIGS) as a feasible representation of graph sampling from arbitrary finite graphs. It provides also a unified treatment of the existing unconventional sampling methods which were studied separately in the past, including indirect, network and adaptive cluster sampling. The sufficient and necessary conditions of feasible BIGS representation are established, given which one can apply a family of Hansen-Hurwitz type design-unbiased estimators in addition to the standard Horvitz-Thompson estimator. The approach increases therefore the potentials of efficiency gains in graph sampling. A general result regarding the relative efficiency of the two types of estimators is obtained. Numerical examples are given to illustrate the versatility of the proposed approach.

## Authors

• 12 publications
• 1 publication
• ### Incidence weighting estimation under bipartite incidence graph sampling

Bipartite incidence graph sampling provides a unified representation of ...
04/08/2020 ∙ by Martina Patone, et al. ∙ 0

• ### Empirical Characterization of Graph Sampling Algorithms

Graph sampling allows mining a small representative subgraph from a big ...
02/16/2021 ∙ by Muhammad Irfan Yousuf, et al. ∙ 0

• ### Network Sampling: From Static to Streaming Graphs

Network sampling is integral to the analysis of social, information, and...
11/14/2012 ∙ by Nesreen K. Ahmed, et al. ∙ 0

• ### Fast algorithms for general spin systems on bipartite expanders

A spin system is a framework in which the vertices of a graph are assign...
04/28/2020 ∙ by Andreas Galanis, et al. ∙ 0

• ### Efficient Estimation in the Tails of Gaussian Copulas

We consider the question of efficient estimation in the tails of Gaussia...
07/05/2016 ∙ by Kalyani Nagaraj, et al. ∙ 0

• ### Optimal unbiased estimators via convex hulls

Necessary and sufficient conditions for the square-integrability of rece...
09/06/2019 ∙ by Nabil Kahale, et al. ∙ 0

• ### Double Happiness: Enhancing the Coupled Gains of L-lag Coupling via Control Variates

The recently proposed L-lag coupling for unbiased MCMC <cit.> calls for ...
08/28/2020 ∙ by Radu V. Craiu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graph sampling provides a statistical approach to study real graphs, which represent the structure of many technological, social or biological phenomena of interest. It is based on exploring the variation over all possible subsets of nodes and edges, i.e. sample graphs, which can be taken from the given population graph, according to a specified method of sampling. Zhang and Patone (2017) synthesise the existing graph sampling theory, extending the previous works on this topic by Frank (1971, 1980a, 1980b, 2011)

. A general definition is given for probability sample graphs, in a manner that is similar to general probability samples from a finite population (Neyman, 1934); and the unbiased Horvitz-Thompson (HT) estimator is developed for arbitrary

-stage snowball sampling from finite graphs, as in finite population sampling (Horvitz and Thompson, 1952). To this end the observation procedure of graph sampling must be ancestral (Zhang and Patone, 2017), in that one needs to know which other out-of-sample nodes could have led to the observed motifs in the sample graph, had they been selected in the initial sample of nodes. Under -stage snowball sampling, additional stages of sampling are generally needed in order to identify the ancestors of all the motifs observed by the -th stage.

Ancestral observation procedure is a generalisation of the notion of multiplicity in indirect sampling (Birnbaum and Sirken, 1965). As an example, patients can be selected via a sample of hospitals. Insofar as each patient may receive treatment from more than one hospital, the patients are not nested in the hospitals like elements do under cluster sampling (Cochran, 1977). Therefore, to compute the inclusion probability of a sample patient, one needs to identify all the relevant hospitals including those outside the sample, which constitutes the information on “multiplicity” of sources that must be collected in addition to the sample of hospitals and patients. The same requirement exists as well for the other unconventional sampling methods, such as network sampling (Sirken, 2005), or adaptive cluster sampling (Thompson, 1990).

The information on multiplicity can be made apparent for the sampling methods above, when they are presented as sampling from a special type of graph which we shall refer to as bipartite incidence graph (BIG). The nodes in a BIG are divided in two parts, which represent the sampling units and motifs of interest, respectively. An edge only exists from one node to another if the selection of the former (representing a sampling unit) leads to the observation of the latter (representing a motif), i.e. there are no edges among the nodes representing the sampling units nor among those of the motifs. In the example of indirect sampling above, hospitals are the sampling units and patients the motifs, and an edge exists only between a hospital and a patient that receives treatment there. The information on the multiplicity of a patient is then simply the knowledge of the nodes that are adjacent to the node representing this patient in the BIG.

In this paper we establish the necessary and sufficient conditions for representing any graph sampling from given population graph as BIG sampling (BIGS), and apply it to arbitrary -stage snowball sampling, as well as all the above unconventional sampling methods. Two major advantages provide the motivation.

First, generally speaking, not all the observed motifs in the sample graph can be used for estimation, but only those associated with the knowledge of their ancestors in accordance with the specified sampling method. The matter is similar in adaptive cluster sampling, where Thompson (1990) proposes to use certain units in estimation only if they are observed in a specific way but not otherwise. Under arbitrary graph sampling, the sample motifs eligible for estimation are those whose multiplicities can be identified in the associated BIG. We shall derive appropriate results to substantiate this insight, and apply them to the motifs that are observed by a given stage of snowball sampling, thereby ridding the need of additional sampling for the ineligible motifs. Indeed, as we will demonstrate, applying the same idea to adaptive cluster sampling would yield other unbiased estimators beyond those considered by Thompson (1990).

Second, in addition to the HT estimator, Birnbaum and Sirken (1965) propose an unconventional Hansen-Hurwitz (HH) type estimator (Hansen and Hurwitz, 1943), which is based on the sampled hospitals and a constructed measure

for each of them, derived from the related patients. This estimator is unbiased over repeated indirect sampling and easy to compute, including the second-order inclusion probabilities that are necessary for variance estimation, which are given directly by the sampling design of hospitals. Whereas to apply the HT-estimator, one must first derive all the first and second-order inclusion probabilities of the

indirect sampling design of patients from that of the hospitals. The HH-type estimator has been used in many works on network sampling, as summarised by Sirken (2005); it was recast as a generalised “weight share” method for indirect sampling (Lavallée, 2007); a modified version was proposed by Thompson (1990) for adaptive cluster sampling. It has been observed that either the HH-type or HT estimator may be more efficient than the other in different applications (e.g. Thompson, 2012). Adopting the BIGS representation, we shall identify for the first time the general condition that governs the relative efficiency between them.

Thus, capitalising on both the advantages, the BIGS representation of graph sampling provides a unified approach to a large number of situations, considerably extending the choices of applicable unbiased estimators. The availability of the various feasible BIGS strategies offers the potentials of efficiency gains in practice.

In the rest of the paper, graph sampling and BIG sampling are described in Section 2, and the sufficient and necessary conditions for a feasible BIGS representation of graph sampling are established. In Section 3, formal BIGS representations are described for the aforementioned unconventional sampling methods (Birnbaum and Sirken, 1965; Lavallée, 2007; Sirken, 1970, 2005; Thompson, 1990). Some new unbiased estimators for adaptive cluster sampling by its feasible BIGS representations are explained and illustrated. In Section 4, we develop the BIGS representation of general -stage snowball sampling, including the relevant results for identifying the sample motifs eligible for estimation. In Section 5, the general condition governing the relative efficiency of the HT and HH-type estimators under BIG sampling is presented. In Section 6, some numerical results are provided for two-stage adaptive cluster sampling by revisiting the example considered by Thompson (1991), and for an example of -stage snowball sampling from an arbitrary population graph. Finally, some concluding remarks are given in Section 7.

## 2 Graph, BIG sampling

### 2.1 Graph sampling

Let be the population graph, with node set and edge set . For simplicity of exposition we focus on simple graphs in this paper, such that there can be at most one edge between a pair of nodes , where . Let if edge and 0 otherwise. By definition if the graph is directed, but if the graph is undirected. The theory developed below can be easily adapted to multigraphs, where there can be more than one edge between any pair of nodes.

The measurement units of interest are called the motifs in . Denote by the set of all motifs in . For any , let be the nodes involved in the motif , of order . The motif of these nodes is denoted by , such that for any , we have , but , nor is it necessary that . For example, let be an undirected graph. Let consist of all the triangles in , such that , we have , and , by which the motif can be defined. As another example, let the motif of interest be the connected components in an undirected graph . Then, and , we have either , or there must exist a sequence of nodes, denoted by , such that , by which the motif can be defined. Whereas we need not have , for any .

Zhang and Patone (2017) give the following general definition of sample graphs from . Let be an initial sample of nodes taken from the sampling frame , where , according to the sampling distribution , and and for any . Given , graph sampling proceeds according to a specified observation procedure (OP), for edges that are incident to the nodes in . The observed edges, denoted by for , are specified using a reference set , where , such that any existing edge in is observed if . That is, specifies the parts of the adjacency matrix that are observed under the given OP. Denote by the nodes that are incident to the edge . Let be the set of nodes incident to the edges . The sample graph is given by

 Gs=(Us,As)andUs=s0∪Inc(As) .

The motifs that are observed in the sample graph can now be given as follows: , we have , iff . In particular, notice that does not imply in general, but must imply .

### 2.2 BIG sampling

Graph sampling can be given a BIGS representation, provided the following. Let

 B=(F∪Ω;H)

be the BIG associated with the population graph and the motif set , where the edges exist only between and but not between any or , and an edge exists from any to iff whenever , so that graph sampling from can be represented as sampling from by incident OP (Zhang and Patone, 2017) given . Hence the term “bipartite incidence graph”.

For the aforementioned example of indirect sampling, we can simply let be the set of hospitals, and let be the set patients. We have , or , iff the patient receives treatment at hospital . We have , for a patient receiving treatment from multiple hospitals. Let denote the predecessors of in , where for any and if . Clearly, we would observe , denoted by , if any of the hospitals in is selected in the initial sample . Indirect sampling can thus be represented as BIG sampling from .

More generally, for any graph sampling from given , let for any and , iff whenever , or , according to the graph sampling design, which consists of and the OP given . For any , let

 αi={k:k∈Ω,δi,k=1} ,

which contains all the successors of in ; for any , let

 βk={i:i∈F,δi,k=1} ,

which contains all the predecessors of in . In other words, or in , iff for and . The sample BIG is given by

 Bs=(s0,Ωs;Hs)andΩs=α(s0)=∪i∈s0αiandHs=H∩(s×Ωs) . (1)

Finally, to ensure ancestral OP in BIG, we must also observe even though it is not part of , where . Below we summarise in Theorem 1 the sufficient and necessary conditions, by which one can determine whether such BIGS representation of graph sampling from is feasible or not.

###### Theorem 1.

Graph sampling from with associated motifs of interest, based on and the given OP, can be represented by ancestral BIG sampling from , iff

• [leftmargin=10mm]

• and , or 0 in can be determined given alone;

• , we have in , or equivalently in ;

• graph sampling OP in ensures the observation of in .

###### Proof.

Given (i), we can define the edge set of . Given (ii), BIG sampling covers all the motifs in , since is then positive for any . Given (iii), it is possible to calculate the inclusion probability of , based on for . Thus, conditions (i) - (iii) are sufficient. They are also necessary, because removing any of them would render the BIGS representation infeasible.∎

Let us illustrate the application of Theorem 1 with two examples. First, consider induced OP given , where we have , such that an edge is observed in only if both and are in . For example, let form a triangle (motif of interest) in , it is observed under induced OP only if all the three nodes are in , but not otherwise. Graph sampling from is a probability sampling design, as long as the third-order inclusion probability is positive under the given , but one would not be able to represent it by BIG sampling since condition (i) is violated, as cannot be determined given alone, for any 1, 2 or 3.

Second, let be a connected component (motif) in , where but otherwise for , . Let the initial sample size be .

• [leftmargin=6mm]

• As in the previous example, BIGS representation is infeasible for induced OP in .

• Suppose incident reciprocal OP, where . We would observe all the three nodes of given . But condition (i) is still not satisfied, because we do not have .

• Suppose 2-stage snowball sampling with incident reciprocal OP, where . Now we would observe given , because the second stage snowball observation from obtained in the first stage would confirm that there are no other adjacent nodes to . Thus, in , and conditions (i) and (ii) are satisfied. Condition (iii) is also satisfied for this motif, since it is confirmed that no other nodes in could lead to it, i.e. in . BIGS representation is feasible for this motif.

## 3 Indirect, network, adaptive cluster sampling

Below we describe formally BIGS representation as a unified approach to indirect sampling, network sampling and adaptive cluster sampling.

### 3.1 Indirect sampling

Generally for indirect sampling, let be the sampling frame, and the set of measurement units of interest, which are accessible via the sampling units in . For instance, can be the hospitals and the patients treated by the hospitals in , as in Birnbaum and Sirken (1965). Or, can be all the parents and the children to the people in , as in Lavallée (2007). For any and , we have or iff can be reached given , denoted by . This completes the definition of population graph . The knowledge of multiplicity that is collected under indirect sampling ensures then ancestral BIG sampling from , where the sample BIG is given by (1), with the associated out-of-sample ancestors in .

The probability of inclusion in can be derived from the initial sampling distribution , for . The (first-order) inclusion probability of is given by

 π(k)=1−¯πβk=1−Pr(∩i∈βki∉s0) , (2)

where is the exclusion probability of in , i.e. the probability that none of the ancestors of in is included in the initial sample . Notice that the knowledge of the out-of-sample ancestors is required to compute . Similarly, the second-order inclusion probabilities of is given by

 π(kl)=1−(¯πβk+¯πβl−¯πβk∪βl) . (3)

### 3.2 Network sampling

Sampling of siblings via an initial sample of households provides an example of network sampling (Sirken, 2005). Since the siblings may belong to different households, some of which are outside of the initial sample, the “network” relationship among the siblings is needed. Network sampling as such can be viewed as a form of indirect sampling, since the sampling unit (household) is not the unit of measurement (siblings), and the latter cannot be sampled directly. Notice that the term “network” has a specific meaning here, unlike when network refers to a whole valued graph (Frank, 1980a, b), e.g. an electricity network, where the nodes and edges have associated values that are of interest.

Let denote the sampling frame, which is the list of households from which the initial sample can be selected. Provided the OP under network sampling is exhaustive, in the sense that all the siblings are observed, if at least one of them belongs to a household in , one can treat each network of siblings as a motif of interest, such that consists of all the networks of siblings. For any and , let iff at least one of the siblings in belongs to household . This yields the population graph . Network sampling with observation of multiplicity is then equivalent to ancestral BIG sampling in , where , with the associated out-of-sample ancestors , such that the inclusion probabilities of the motifs can be calculated by (2) and (3).

### 3.3 Adaptive cluster sampling (ACS)

As a standard example of ACS (Thompson, 1990), let consist of a set of spatial grids over a given area. Let be the amount of a species, which can be found in the -th grid. Given , one would survey all its neighbour grids (in four directions) if exceeds a threshold value but not otherwise. The OP is repeated for all the neighbour grids, which may or may not generate further grids to be surveyed. The process is terminated, when the last observed grids are all below the threshold. The interest is to estimate the total amount of species (or mean per grid) over the given area.

One can consider each cluster of contiguous grids, where the associated ’s all exceed the threshold value, as a network. Let a grid with below the threshold value form a singleton network consisting only of itself. The OP is network exhaustive, since all the grids in a network are observed if at least one of them is selected in . A singleton network is an edge grid, if it is contiguous to a non-singleton network. Observing a non-singleton network will lead one to observe all its edge grids, but not the other way around, due to the adaptive nature of the OP. When an edge grid is selected in , but none of the grids in its non-singleton neighbour network (NNN), the inclusion probability of this edge grid cannot be calculated correctly based on the observed sample.

Below we explain how BIGS can be used to represent the approach proposed by Thompson (1990), and there can exist other feasible BIGS representations to ACS. The alternative strategies will be illustrated using the example of Thompson (1990).

#### 3.3.1 Alternative strategies of feasible BIGS representation

One can represent ACS as BIG sampling from , where the grids are both the sampling units of and the motifs of . Let if is observed under ACS whenever , for and . However, the OP of ACS is not ancestral when an edge grid is selected in , but none of the grids in its NNN, in which case one would not observe its NNN that are its ancestors in this and its inclusion probability cannot be calculated. Thompson (1990) proposes to make an edge grid eligible for estimation only if it is selected in directly, the probability of which is known, but not when it is observed via its NNN. Denote this strategy by , with the modified HT estimator .

Another strategy is to adopt a feasible BIGS representation, by which the OP of ACS is ancestral and one can use the unmodified HT estimator. Two examples are given below.

• [leftmargin=6mm]

• : An edge grid is ineligible if it is only observed via its NNN but itself is not selected in . That is, set in , where grid belongs to the NNN of , such that is eligible for estimation only when it is selected in directly.

• : An edge grid is ineligible if itself is selected in but not its NNN. That is, set for an edge grid but keep if grid belongs to the NNN of , such that is eligible for estimation only if it is observed via its NNN.

It should be noticed that while strategy is feasible in the example of Thompson (1990) considered below, it would be infeasible generally provided there exists some edge grid that is contiguous to more than one NNN, since not all of them will necessarily be included in . Moreover, strategy is likely to be more efficient than , because the inclusion probability of an edge grid tends to be lower under the former, and an edge grid by definition has an associated -value below the threshold. As a matter of fact one obtains the same estimate under either the strategy or . However, is unmodified under , so that it is unchanged by the Rao-Blackwell method; whereas, being a modified HT estimator, under differs generally to its Rao-Blackwellised version, denoted by .

#### 3.3.2 Example of Thompson (1990)

The population consists of 5 grids, with -values . Each grid has either one or two neighbours which are adjacent in the given sequence, as when they are 5 grids beside one another along the west-east axis. The threshold value is 5, such that only the two grids with values 10 and 1000 will lead on to their neighbours. The initial sample of size 2 is by simple random sampling (SRS) without replacement. Let consist of the same 5 grids. Under strategy , has the following incidence edges:

 H∗={(1,1),(0,0),(2,2),(10,10),(10,1000),(1000,10),(1000,1000)} ,

where as in Thompson (1990) we simply denote each grid by its -value. The two grids form an NNN to the edge grid 2. The incidence edges and , which are in of the strategy , are removed to ensure ancestral OP and feasible BIGS representation. For instance, given , the OP of ACS is not ancestral in since the grids and will not be observed, but it is ancestral in , since the grid has only itself as the ancestor in . Under , we have

 H†={(1,1),(0,0),(10,2),(10,10),(10,1000),(1000,2),(1000,10),(1000,1000)} ,

where the grid is only eligible for estimation when observed via its NNN , but not when itself is selected in , now that . The OP of ACS is ancestral in , since both and are observed whenever is observed.

Table 1 lists the details of the three strategies by BIGS representation of ACS in this case. The respective observed sample is given in addition to the initial sample . The strategy is proposed in Thompson (1990), where is given in italic in the 5 samples where it is observed but unused for estimation. The probability that it is eligible is , which is the same as its sample inclusion probability under .

Apart from the ’s in italics, the observed sample is always the same under both the strategies and . Hence, the estimate is the same by both. Nevertheless, as explained before, the two differ regarding the Rao-Blackwell method. In this case, the difference hinges on the last sample . Under , the same sample (including ) is also observed from or , but the estimate differs because is unused when . The Rao-Blackwell method yields given . In contrast, under the strategy the estimate is unchanged by Rao-Blackwellisation, because the observed sample from differs to that from or .

Under the strategy , the grid is not included in given or , yielding different to that under . Otherwise, the inclusion probability of is raised to , i.e. the same as or , which is not a good choice because is much smaller than 10 or 1000. The variance of is larger than under the strategy , although the relative efficiency 0.993 is not of a great concern here.

## 4 T-stage snowball sampling (T-Sbs)

Goodman (1961) considers snowball sampling (SBS) on a special directed graph, where each node has one and only one out-edge. Frank (1977) and Frank and Snijders (1994) consider one-stage SBS from arbitrary population graphs. Zhang and Patone (2017) derive the HT-estimator for general -stage snowball sampling (-SBS). Additional stages of sampling are generally needed in order to identify the ancestors of all the motifs observed under -SBS though. The matter can be illustrated using ACS as follows.

Let be an undirected simple graph, where consists of all the grids over a given area, and iff grids and are neighbours and they both have values above the threshold. Each grid with -value below the threshold is an isolated node in . Let , yielding an initial sample of seeds according to , where . Propagation of the sample is only possible from those nodes that are not isolated in ; an isolated node with value above the threshold can only be observed if it is selected in . Let be the sample of nodes observed after the first stage, where is the first-wave snowball sample, which are the seeds for the second stage snowball sample, and so on. Denote by the observed sample of nodes after stages. Under ACS, one would eventually observe all the networks of grids, treated as the motifs of interest, which have at least one node in . However, under -SBS the sampling is terminated after stages, by which time may have only covered a part of a network.

Similarly, for any population graph , a motif in may be unobserved under -SBS, even though it is observable under SBS with an infinite number of stages. Moreover, not all the observed motifs after stages are eligible for estimation, and additional stages of sampling may be required in order to observe all the ancestors that could have led to an observed motif by -SBS. However, more motifs of interest may be observed during the additional sampling, which again may or may not be eligible for estimation. Below we develop BIGS representation of -SBS from arbitrarily given population graph, by which this conundrum of ancestral observation can be resolved.

### 4.1 Observation distance to motif k from within Mk

For any and , let be the length of the geodesic from to in , which is the shortest path from to in . Since the shortest path from to varies with the OP, let us assume incident reciprocal observation for simplicity of exposition here. For example, let , where and for and otherwise. We have and . Or, for the same , let in addition to , in which case we have and . Starting from , the number of stages required to observe all the nodes in by SBS is .

Next, for any and , let be the SBS observation distance from to , which is the minimum number of stages required to observe under SBS from , when starting from . For the above two examples of , if only , then and ; whereas with , we have . Generally, we have , since we must have if , where is the reference set of -stage SBS from . A more detailed result for connected can be given as follows.

###### Lemma 1.

and , if the nodes are connected in , then

 di,k={maxj∈Mkνijif |argmaxj∈Mkνij|=11+maxj∈Mkνijotherwise ,

or if there exists a single node other than which is unconnected to in , then

 di,k=1+maxj∈Mk;iνij

where consists of the nodes in that are connect to in .

###### Proof.

Starting from , it is impossible under -SBS to observe whether there are edges or not among the nodes that are unconnected to , if there are two or more of them. This leaves one with the two possibilities listed above. Let the nodes of be connected, if there is only one node (denoted by ) which requires the maximum no. steps from , then all the other nodes are observed before , which allows one to observe any edge between them and by the last step; whereas if there are more than one node like , then an additional step is need to observe the edges among them. Similarly, such an additional step is needed, when there is a single node that is unconnected to . ∎

###### Corollary 1.

If there exists , where there are at least two nodes, , such that in , then BIGS representation of -SBS from is infeasible.

###### Proof.

There is no edge in for such a motif , from any . Starting from any , one can reach at most one of the connected components of in . Hence, there are no edges from to either, and condition (ii) of Theorem 1 is violated. ∎

### 4.2 Observation distance to motif k from outside Mk

To clarify the observation distance to motif from outside of , we introduce graph transformed from via a hypernode consisting of the nodes in . Let , where and , on replacing the nodes with the hypernode . Partition the edge set of as

 A=(A∩(Uh×Uh))∪(A∩(Uch×Uch))∪(A∩(Uh×Uch∪Uch×Uh)) .

Remove all the edges in , which are among the nodes themselves. Keep in all the edges in , which are not incident to nodes in . Regarding : for each , replace all the edges by a single edge in , and replace all the edges by a single edge in . This yields the transformed graph via hypernode .

For any and , we have if for all , in which case motif cannot be reached from . Otherwise, the nodes in can be partitioned according to for each . Let contain the nodes in with geodesic length to . It takes one more step to observe starting from . Let be transformed from via the hypernode , if . Let . The observation distance from to motif in , denoted by , can be calculated as that from any to in for , where the minimum value is 1, including when . Thus, it takes stages to observe motif in G, starting from all the nodes in at once, from which we obtain the following result.

###### Lemma 2.

and , we have , where is the observation distance from hypernode to motif in , transformed from via the hypernode consisting of nodes .

### 4.3 Estimation using all the motifs observed under T-Sbs

The sample graph observed under -SBS from has an associated matrix of geodesic distances, which is of dimension , where the -th element is the geodesic distance from to in the sample graph , denoted by . For instance, we have iff , in which case we have in as well. For non-adjacent nodes and in , we have , provided the connected component containing them in is fully observed in , but not otherwise. Thus, the geodesic-distance matrix based on the sample graph is generally not the same as that of the population graph . Additional sampling in is then necessary, in order to identify the ancestors of any observed motif in , as specified below.

###### Lemma 3.

For any , if then one needs at most stages of additional SBS from to observe all the ancestors of sample motif under -SBS from , if then one needs at most stages of additional SBS from .

###### Proof.

-SBS from all the nodes in is the same as -SBS from the hypernode with in the graph transformed from via the hypernode . This identifies all the nodes in , which can lead to the observation of at least one node in after stages at most. Since if , any node that is unobserved after the additional stages cannot be the ancestor of under -SBS from . ∎

Suppose -SBS from is a probability sampling design for that is of interest. For BIGS representation of -SBS from , let and . By Theorem 1, one needs to set for any that is the ancestor of motif under -SBS from . One can set in the sample graph directly, provided can be observed in starting from . Moreover, having identified all the ancestors of each observed motif by additional sampling, as guaranteed under Lemma 3, one can set for all the out-of- ancestors of under -SBS from . In this way, ancestral observation is achieved for all the motifs in , such that they all can be used for estimation.

### 4.4 BIGS representation for eligible motifs under T-Sbs

Not all the motifs observed under -SBS are eligible for estimation due to the requirement of ancestral observation. Using the same idea that is expounded for ACS in Section 3.3.1, we develop below strategies of BIGS representation that are feasible based on the eligible motifs observed under -SBS, without additional sampling for ineligible motifs.

Let be the population BIG representing -SBS from , where all the observed motifs can be used for estimation. For each with ancestors in , let be a non-empty subset of , where . Consider BIG sampling with restricted ancestors from , where contains only the edges from to , for each . Since is non-empty for every , conditions (i) and (ii) of Theorem 1 remain satisfied under BIG sampling from . A motif is observed in the sample , iff contains at least one of the nodes in , regardless of the nodes in . Condition (iii) of Theorem 1 is satisfied provided the knowledge of , given which the inclusion probabilities can be calculated by (2) and (3) on replacing and by and , respectively.

To ensure that BIG sampling from is a feasible representation of -SBS from , we need to define appropriately for the observed eligible motifs. By Corollary 1, BIGS representation is feasible for any motif consisting of connected nodes. Let the observation diameter of a motif be

 ϕk=maxi∈Mk di,k

which is finite for any motif of connected nodes with . Then, by definition, an observed motif with finite is eligible for estimation under -SBS from , provided we restrict its ancestors to . The result below follows.

###### Theorem 2.

Provided finite observation diameter of all , BIG sampling from is a feasible representation for -SBS from , where and .

###### Proof.

Conditions (i) and (ii) of Theorem 1 are satisfied, provided for any . Given , all the nodes in are observed under -SBS if , such that is identified for every in . Therefore, condition (iii) is satisfied as well, and all the motifs in are eligible for estimation. ∎

Additional sampling is not needed based on BIGS from with restricted ancestors as a feasible representation of -SBS from . But fewer observed motifs are used compared to BIGS representation with , which would generally require additional sampling. So there is a trade-off between statistical efficiency and operational cost. In case the uncertainty is too large to be acceptable, based on the eligible motifs in under -SBS with , additional SBS may be administered. This raises the need to update the BIGS representation for -SBS, where .

Let contain all the nodes outside of , which have maximum geodesic distance to . That is, starting from any node in , it takes at most stages of SBS to observe at least one of the nodes . Under SBS beyond , the nodes in may be identified as ancestors of eligible motifs, for … Let the diameter of motif be given by

 λk=maxi,j∈Mkνij

By Lemma 1, we have given finite . The result below follows.

###### Theorem 3.

Provided finite observation diameter of all , BIG sampling from is a feasible representation for -SBS from , where with , and with for each .

###### Proof.

Any motif is observed after at most stages starting from any node in . The first two conditions of Theorem 1 are therefore satisfied. If , then all the nodes in must have been observed at stage , so that all the nodes in are already observed after stages. It remains only to observe all the nodes in starting from , which requires at most stages. Whereas, if , then there is at least one node , which is first observed at stage , starting from any node in . Another stages may be needed to observe all the nodes in which can lead to in stages from outside . Thus, in either case, is identified for every in given , such that condition (iii) of Theorem 1 is also satisfied and all the motifs in are eligible for estimation. ∎

## 5 Two unbiased estimators under BIG sampling

For each motif , let be an associated value, which is considered as an unknown constant. Let the target of estimation be the total of over , denoted by

 θ=∑k∈Ωyk .

In the case of , is simply the total number of motifs in , which is called a graph total (Zhang and Patone, 2017); more generally, is a total over in a valued graph.

The two unbiased estimators of Birnbaum and Sirken (1965) can be applied to any graph sampling from , provided a feasible BIGS representation of it satisfying conditions (i) - (iii) of Theorem 1. For simplicity of exposition below, we always denote the population BIG by , without distinguishing in notation whether restricted ancestors are used for the eligible motifs. The HT estimator based on is given by

 ^θy=∑k∈Ωsyk/π(k)=∑k∈Ωδkyk/π(k) , (4)

where if and 0 otherwise, and is given by (2), for any . Generally, to calculate the inclusion probabilities and , we need to know for each . In the special case of SRS of , we only need the cardinality of to calculate .

The HH-type estimator based on the initial sample is given by

 ^θz=∑i∈s0zi/πi=∑i∈Fδizi/πiandzi=∑k∈αiωikykand∑i∈βkωik=1 , (5)

where if and 0 otherwise, and is the inclusion probability of under , and the ’s are constants of sampling, by which are transformed to the constructed measures . We let if or in . As noted by Birnbaum and Sirken (1965), the estimator (5) is unbiased for since

 θ=∑k∈Ωyk=∑k∈Ωyk(∑i∈βkωik)=∑i∈F(∑k∈αiωikyk)=∑i∈Fzi .

Notice that in the special case of for all , there exits only one-one or one-many relationship between the sampling units in and the motifs in , just like when the elements are clustered in the sampling unit under cluster sampling. The two estimators and are then identical. More generally, different choices of ’s would give rise to different estimates, such that by (5) defines in fact a family of unbiased estimators. Birnbaum and Sirken (1965) consider the equal-share weights . Under BIG sampling, this estimator and the HT-estimator have the same ancestral observation requirement. Patone (2020) proposes unequal weights . Additional sampling is generally needed to calculate these weights. For the feasible BIGS representation in Theorem 3, one may need upto extra stages to observe for any . Both the HH-type estimators will be illustrated in Section 6.