    # Group Testing with Correlation under Edge-Faulty Graphs

In applications of group testing in networks, e.g. identifying individuals who are infected by a disease spread over a network, exploiting correlation among network nodes provides fundamental opportunities in reducing the number of tests needed. We model and analyze group testing on n correlated nodes whose interactions are specified by a graph G. We model correlation through an edge-faulty random graph formed from G in which each edge is dropped with probability 1-r, and all nodes in the same component have the same state. We consider three classes of graphs: cycles and trees, d-regular graphs and stochastic block models or SBM, and obtain lower and upper bounds on the number of tests needed to identify the defective nodes. Our results are expressed in terms of the number of tests needed when the nodes are independent and they are in terms of n, r, and the target error. In particular, we quantify the fundamental improvements that exploiting correlation offers by the ratio between the total number of nodes n and the equivalent number of independent nodes in a classic group testing algorithm. The lower bounds are derived by illustrating a strong dependence of the number of tests needed on the expected number of components. In this regard, we establish a new approximation for the distribution of component sizes in "d-regular trees" which may be of independent interest and leads to a lower bound on the expected number of components in d-regular graphs. The upper bounds are found by forming dense subgraphs in which nodes are more likely to be in the same state. When G is a cycle or tree, we show an improvement by a factor of log(1/r). For grid, a graph with almost 2n edges, the improvement is by a factor of (1-r) log(1/r), indicating drastic improvement compared to trees. When G has a larger number of edges, as in SBM, the improvement can scale in n.

## Authors

06/18/2020

### Tight Bounds for Connectivity of Random K-out Graphs

Random K-out graphs are used in several applications including modeling ...
05/09/2019

### Learning Erdős-Rényi Random Graphs via Edge Detecting Queries

In this paper, we consider the problem of learning an unknown graph via ...
09/05/2020

### Optimal Deterministic Group Testing Algorithms to Estimate the Number of Defectives

We study the problem of estimating the number of defective items d withi...
11/22/2017

### The Stochastic Firefighter Problem

The dynamics of infectious diseases spread is crucial in determining the...
10/22/2021

### Testing network correlation efficiently via counting trees

We propose a new procedure for testing whether two networks are edge-cor...
06/20/2021

### Dynamic group testing to control and monitor disease progression in a population

In the context of a pandemic like COVID-19, and until most people are va...
08/27/2021

### Group Testing with Non-identical Infection Probabilities

We consider a zero-error probabilistic group testing problem where indiv...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Group testing [DHH00] is a well studied problem at the intersection of many fields, including computer science [DOR43, CHK+11, CGH+20], information theory [AJS19] and computational biology [KM95, FKK+97]. The goal is to to find an unknown subset of items that are different from the rest using the least number of tests. The target subset is often referred to as defective, corrupted or infected, depending on the field of study. In this work, we use the term defective. To find the subset of defectives, items are tested in groups. The result of a test is positive if and only if at least one item in the group is defective. Group testing is beneficial when the number of defective items is . It is often assumed that the (expected) number of defective items is .

Over the years, this problem has been formulated via two approaches: the combinatorial approach and the information theoretic approach. In the “combinatorial” version of the problem, it is assumed that there are defective items that are to be detected with zero error [DHH00]. Using adaptive group testing (i.e., when who to test next depends on the results of the previous tests), there is a matching upper and lower bound on the number of tests in the form [DHH00]. Using non-adaptive group testing (i.e., when the testing sequence is pre-determined), there is an upper bound of and an almost matching lower bound of . The “information theoretic” approach, on the other hand, assumes a prior statistic on the defectiveness of items, i.e., item is assumed to be defective with probability . The aim in this case is to identify the defective set with with a high probability [LCH+14]. Roughly speaking, there is a lower bound in terms of the underlying entropy of the unknowns, and an almost matching upper bound up to a factor of the lower bound.

In most existing works, it is assumed that the state of the items, whether or not they are defective, are independent from each other, which is not realistic in many applications. Group testing, for example, can identify the infected individuals using fewer tests, and therefore in a more timely manner, than individual testing, during the spread of an infectious disease (eg, COVID-19) [BMR21, VFH+21, MNB+21]. But the infection state of individuals are in general correlated, with correlation levels ranging from high to low, depending on how close they live: same household (high), same neighborhood, same city, same country (low). Correlation levels also depend on other factors such as frequency of contact, the number of paths between the individuals in the network of interactions. We elaborate on this further in Section 1.1. Another example is multiaccess channel problem. In this problem, a number of users want to communicate through a common channel and we want to assign time slots to them to avoid any conflicts. Before assigning, we aim to find the number of active users that want to send a message. Using group testing, we can identify the number of active users faster by asking a subset of users if any of them is active [BMT+84, CHL01, GH08]. But again, users are not independent. Two users might get active together many times because they share the same context or communicate with each other more often. Generally, some subset of users might communicate among themselves more often and hence, be more correlated. With this motivation, we aim to model such correlation, design group testing techniques that exploit it, and quantify the gain they provide in reducing the number of tests needed.

Our work is best related to [ACO21, AU21, NSG+21] where specific correlation models are considered and group testing methods are designed and analyzed. In [ACO21], the authors consider correlation that is imposed by a one day spread of an infectious disease in a clustered network modeled by a stochastic block model. Each node is initially defective (infected) with some probability and in the next time step, its neighbors become defective probabilistically. The authors provide a simple adaptive algorithm and prove optimality in some regimes of operation and under some assumptions. In [AU21], the authors model correlation through a random edge-faulty graph . Each edge is realized in the graph with a given probability . So depending on how the graph is realized, it is partitioned into some connected components. Each connected component is assumed defective with probability (in which case, all the nodes in that component are defective) and otherwise non-defective with probability . The authors focus only on a subset of the realizations by studying the case in which the set of connected components across realizations forms a particular nested structure. More specifically, they only consider a subset of realizations such that for each realization, it is instantiated from another realization in the subset, i.e., two realizations have the same components except one component that is partitioned into two components. Only one realization does not obey this rule, the one with the least number connected components. They found a near optimal non-adaptive testing strategy for any distribution over the considered realizations of the graph and showed optimality for specific graphs.

The correlation model we consider is close to the work of [AU21]. We consider a random (edge-faulty) graph where each edge is realized with probability . In a realized graph, each connected component is assumed defective with probability . As opposed to [AU21], we do not constrain our study to a subset of the realizations and instead consider all possible realizations of the graph . Despite its simplicity, our model captures two key features. First, given a network of nodes, one expects that when a path between two nodes gets shorter, the probability that they are in the same state increases. Our proposed model captures this intuition. By adding one edge to a path, the probability of being in the same state reduces by a factor of . Second, two nodes might have long distances from each other, but when there are many edge-distinct paths between them, it is more likely that they are in the same state. Under our model, by adding distinct paths between two nodes, the probability of them being in the same state increases.

In other related works, a graph could represent potential constraints on testing (among independent nodes) [CKM+12, SW18]. This can be viewed as a complementary direction to our setting in which a graph models the inherent dependency of the nodes, but there is no constraint in how the groups can be formed for testing. In [CKM+12], the authors design optimal non-adaptive testing strategies when each group is constrained to be path connected. In particular, they use random walks to obtain the pool of tests. In a follow up work, [SW18] shows that either the constraints are too strong and no algorithm can do better than testing most of the nodes, or optimal algorithms can be designed matching with the unconstrained version of the problem. This is attained by sampling each edge with probability ( and optimized). If a component is large enough, the algorithm tests the entire component. Our approach in this paper has similarities with [SW18] in aiming to find parts of the graph that are large and connected enough so that they remain connected with a decent probability after realizing the edges, but our techniques to find the dense subgraphs and the corresponding analysis are different.

### 1.1 Our Model

We start by motivating the key attributes that we capture in our model through an example of testing for infections in a network of people (nodes). Consider the interaction network for the spread of an infectious disease (e.g. COVID-19) in a network of people/nodes. There is an edge between two nodes if the corresponding individuals are in physical proximity for a minimum amount of time each week. Such individuals are more likely to be in the same state than those who have been distant throughout. Thus, firstly, the probability of being in the same state decreases with increase in the length of paths (i.e., distance in interaction network) between nodes. Second, infection is more likely to spread from one node to another if there are many distinct paths between them. Thus, the probability that two nodes are in the same state increases with the increase in the number of distinct paths between them.

We capture correlation through a faulty-edge graph model. Consider a graph where the node set represents the items and the edge set represents connections/correlations between them. Suppose each edge is realized with probability . After the sampling, we have a random graph that we denote by . Each node is either defective or non-defective. All nodes in the same component of are in the same state, rendering defectiveness a component property. We consider that each component is defective with probability (w.p.) independent of others. As an example, consider graph with five nodes and eight edges, and a sampled graph realization as shown in Figure 1 (left) and Figure 1 (right) respectively. When , is realized w.p. . There are two components in , namely, and ; are in the same state, which is defective w.p. , independent of the state of

This model importantly captures the two attributes we discussed: Clearly, a long path between two nodes in has a smaller chance of survival in , compared to a short path, making the end nodes less likely to be in the same state as the length of the path in between them increases. Moreover, the probability that at least one path between two nodes survive in increases with increase in the number of distinct paths between them in , so having distinct paths between a pair of nodes in makes them more likely to be in the same state.

We aim to find the minimum expected number of tests needed to find the defective items with at most errors, where can potentially be of order . To be precise, let #ERR(H) be the number of nodes mispredicted by an algorithm on graph . Then we require to have

 EH∼Gr[#ERR(H)]≤ϵn

where the expectation is taken over and possible randomization of the algorithm.

Our approach is to relate the problem to an equivalent independent group testing problem with fewer nodes and provide a basis for comparison and quantification of the improvements that our methods offer by exploiting correlation.

The tests can not be designed with the knowledge of , only the value of is known apriori. In the extreme case of , the problem is reduced to the classic group testing with independent nodes. In the extreme case of , all components of remain connected and hence the problem is reduced to independent group testing with only components of . When , the problem is non-trivial, because there can be multiple components, some with more than one node, and the number and composition of the components is apriori unknown. Thus, it is not apriori known which nodes will be in the same state. Our group testing strategies will seek to circumvent this challenge by identifying parts of that are connected enough so that they remain connected in with a high probability.

### 1.2 Contributions

We obtain upper and lower bounds on the number of group tests needed to determine the states, defective or otherwise, of individual nodes in a large class of interaction graphs in presence of correlation among the states of nodes. We progressively consider 1) cycles and trees (about links), 2) regular graphs (about links) and 3) stochastic block models or SBM ( links). The correlation is captured by the factor (see Section 1.1). The bounds are obtained in terms of the number of tests needed when the states are independent, and help us quantify the efficiency brought forth by group testing in terms of

For trees and cycles, we prove an upper bound on the optimal number of tests in terms of the number of group tests when there are independent nodes. Note that one can trivially determine the states of each node by disregarding correlation and testing among nodes (e.g. using classic group testing techniques). Our upper bound therefore shows that group testing can reduce the tests by a factor of , which is less than when . As approaches the multiplicative factor reduces even further implying even greater benefits due to group testing. Our lower bound, on the other hand, shows an improvement factor .

For regular graphs we prove new bounds for the distribution of components. This leads to a lower bound that is expressed as a sum series depending on and . We further prove an upper bound for a specific 4-regular graph, namely grid, in terms of the number of group tests when there are independent items. Thus, the improvement factor is , as opposed to only for trees; this hints us that group testing gets more efficient drastically for denser graphs.

SBM divides network into communities such that nodes in the same community are more connected than nodes in different communities. We show that the reduction in the test count due to group testing can be classified into three regimes: 1) strong intro-community connectivity but sparse inter-community connectivity, which reduces the effective number of independent nodes to the number of communities, 2) fully connected graph, thus, all nodes have the same state 3) most of the nodes are isolated, thus states of all nodes are independent. The number of tests in 1) and 3) can be determined from the characterizations of networks in which all nodes are independent, and only one test is necessary in 2).

## 2 Preliminaries and Notations

We use the following notations for the rest of the paper. Let be the expected number of tests in an optimal algorithm on graph with parameters , probability of defectiveness , and an error of . Let be the minimum expected number of tests needed for items in order to find the defective set with the error probability at most , where each item is independently defective with probability . It is noteworthy to mention that the definition of error in IndepOPT is different from CrltOPT. In their designs, they ensure that with probability all nodes are predicted correctly, and with probability at least one node is mispredicted, which is also the error defined in IndepPOT in our notation. When clear from the context, we may drop from the notations.

###### Lemma 2.1.

Let be the number of connected components of . Then

 \textscIndepOPT(C(Gr),p,ϵn)≤\textscCrltOPT(G,r,p,ϵ)
###### Proof.

Knowing the connected components of , we have independent connected components that each contain nodes in the same state. Assign a candidate node for each connected component and form a group testing problem only on the candidate nodes: Among the candidate nodes. Suppose the error of independent nodes is . Means with probability , the algorithm predicts at least one node incorrectly, so the total error is at least , and the error can’t go above than in , so we have the lemma. ∎

The following theorems are known for classic group testing, or known as Probabilistic Group Testing, where there are individuals and every individual is independently defective with probability . Let

be the indicator vector of items and

. is the entropy of vector . Then we have a lower bound on the required number of tests, and also a near-optimal adaptive and non-adaptive algorithms as following.

###### Theorem 2.2.

[LCH+14] Any Probabilistic Group Testing algorithm whose probability of error is at most fraction of the individuals requires at least tests.

###### Theorem 2.3.

[LCH+14] There is an adaptive algorithm with probability of error at most such that the number of tests is less than

 2(1+δ)(H(X)+3μ)
###### Theorem 2.4.

[LCH+14] For any and , if the entropy of satisfies

 H(X)≥Γ2γ

where

 Γγ:=log2(log1/γ(2nϵ))

then with probability of error at most

 ϵ≤Γ−δ+1γ+12ϵ

there is an non-adaptive algorithm that requires no more than

 T≤elnnlog2(1/γ)(1+δ)H(X)+Γ2γ+2μ

tests.

The following combinatorial result comes in handy in Section 5.1.1.

###### Lemma 2.5.

[AVA08] Order- Fuss-Catalan numbers that follows the recursion form

 Cdn=∏i1+⋯+id=n−1Cdi1⋯Cdid

with , has the closed form .

## 3 A Lower Bound for Sparse Graphs

In this section, we give lower bounds for the number of tests needed when the underlying graph has edges. We specifically use the result when the underlying graph is a cycle or tree. We obtain these bounds by reducing the problem to the number of tests needed for the independent case. We illustrated the connection between connected components and number of tests in Lemma 2.1. Here, we prove some concentration lemmas for the number of connected components. In the next chapters, we also get back to this concentration lemmas and use them for another purposes.

###### Definition 3.1.

[AS16] A graph theoretic function is said to satisfy the edge Lipschitz condition if, whenever and differ in only one edge, .

Using this definition, the function as the number of connected components of is edge Lipschitz, as for two graphs that differ in only one edge, either they have the same number of connected components, or the graph with one less edge has one more connected component.

The Edge Exposure Martingale. Let be an arbitrary order of the edges. We define a martingale where is the value of a graph theoretic function after exposing . Note that is a constant which is the expected of , where is drawn from . This is a special case of martingales sometimes referred as a Doob martingle process, where is the conditional expectation of , as long as the information known at time includes the information at time [AS16]. The same process can be defined for node exposure martingales, where the nodes are exposed one by one [AS16]. Node exposure can be seen as exposing one node at each step, so at the ’th step the graph has nodes, or exposing the edges connected to one node at each step, so at the ’th step only first nodes potentially have edges between them. Then we have the following theorem.

###### Theorem 3.2.

[AS16] When satisfies the edge Lipschitz condition, the corresponding edge exposure martingale satisfies .

The same theorem can be stated for node Lipschitsz and node exposure martingales. Then we can use Azuma’s inequality for the Lipschitz function.

###### Theorem 3.3.

[AS16] Let be a martingale with

 |Xi+1−Xi|≤1

for all . Then

 Pr[|Xm−c|>λ√m]<2e−λ2/2

Now for , note that and . Then we have:

###### Lemma 3.4.

Let . Then with probability we have

 |C(Gr)−E[C(Gr)|]≤O(√mlog1/δ).

Specifically, in the case that for a constant , and has edges, then with high probability the number of connected components is within .

Connected components play a main role in devising algorithms. If we knew them, we could reduce the problem to the classic group testing. Even if we knew the number of connected components with high probability, we could have established lower bounds based on the lower bounds for the classic (independent) group testing. The above lemma tells us that for a huge family of graphs, only knowing the expected number of components would result concentration around it, hence a lower bound on the number of tests. We now apply the above corollary for trees and cycles (which we know the expectated number of components) to get a lower bound on the number of tests.

###### Theorem 3.5.

Let be a cycle or a tree. Then we have

 \textscIndepOPT((1−r)n−10√nlogn,p,ϵn)≤\textscCrltOPT(G,r)+O(1/n).
###### Proof.

In a tree, by removing each edge we get one more connected component, so after removing edges the tree has connected components and the cycle has .

Each edge is removed with probability , so the expected number of components is for trees, and for cycles. By Lemma 3.4, the number of components is with probability , and with probability The difference in tests is at most , hence additional tests. Applying Lemma 2.1 thus completes the proof. ∎

The above proof also works for any graph with edges, where is a constant. In other words, when the number of edges is less than , a lower bound one the number of tests needed for almost independent nodes is also a lower bound on the number of tests under our model.

## 4 An Upper Bound for Graphs with a Few Edges: Cycles and Trees

In this section, we provide algorithms to find the defective set and provide theoretical bounds. We start by a simple cycle as a warm up. Later, we generalize the ideas used in cycles to devise algorithms for any tree. Note that after having an algorithm for trees, we would have an algorithm for general graphs, by just considering a spanning tree of it. But the algorithm might be far from optimal.

First, we provide an algorithm for cycles.

###### Theorem 4.1.

Let and . Then there is an algorithm that uses tests and finds the defective set with the error at most over a cycle of length .

###### Proof.

Consider the following algorithm:

1. Let . Partition the cycle into paths of the same length , except one path that may be shorter.

2. for each path, choose one of its nodes at random and let the corresponding nodes be .

3. Use an algorithm (by Theorem 2.3 for adaptive or Theorem 2.4 for non-adaptive group testing) to find the defective items among where and the probability of being defective equals to .

4. Assign the state of all the nodes in as for all .

Note that for each , the defectiveness probability of is . The probability that is actually connected after a realization is . So the probability that is not in the same state of is . Then assuming that we detect all ’s correctly, the error in is at most . By replacing , the error becomes less than . But we might also have probability of error for the ’s (given the criteria set in IndepOPT), means with probability all the nodes are predicted correctly, and with probability we have at least one mispredicted node, and at most mispredicted nodes. So the total error from this part is at most . So the total error is at most and we have the above theorem. ∎

###### Corollary 4.2.

For the case of where is a constant, then, , and by Theorem 2.3, the number of tests is bounded by

 H(X)+μ=O(lognl+c)=O(lognlog1/rlog[1/(1−ϵ/2)])≃O(lognlog1/rϵ).

Note that when , means the correlations are strong, then the algorithm does a constant number of tests, as expected.

We now generalize the ideas to achieve a bound for trees. Same as the previous section, we partition the graph into groups of nodes, find the probability of being connected in a random realization, and then optimize it over . From a high level, we try to group nodes that have small paths to each other, as shorter paths remain with higher probability in the graph, so the probability of the pairs being in the same state maximizes.

We first give a definition to formalize the number of nodes needed to make a subset of nodes connected.

###### Definition 4.3.

Let of a graph . The smallest connecting closure of is a subset such that the induces graph over is connected.

For example, consider the graph in Figure 1. If , then the smallest connecting closure of is , as by adding to we make S connected.

Now we provide a grouping of nodes for trees such that for each group, we don’t need to add many other nodes to make them connected. Formally:

###### Lemma 4.4.

Let be a tree. There is a partition of nodes into groups of size , except one group might have less than nodes, such that the size of the smallest connecting closure for each group is less than or equal to , for each .

###### Proof.

We prove the lemma by induction on the number of nodes of . for , the statement is trivially true. Now suppose the lemma is true for any number of nodes less than , we prove it for .

We aim to find a set of nodes of size such that, first, by removing the set the graph remains connected, and second, the smallest connecting closure of the set is at most . Then by removing the aforementioned set and consider it as one of the final groups, we use induction hypothesis for the rest of the graph.

To do that, suppose the tree is hanged by an arbitrary node. Let be a deepest leaf, means any other node is at higher or equal level of . Let be the first ancestor of , such that the subtree rooted at , including , has equal or more than nodes. If there is exactly nodes, we can group them together, and the aimed group is found. So suppose the subtree rooted at has more than nodes. Note that the distance from to is less than . Now we form a group with small connecting closure. Starting with empty set , we add the subtree of the child of that is an descendant of it, and the child itself, call this subtree . Note that the added set has nodes, and by removing them from , remains connected.

Recursively, do the same process for the other subtress of for the updated . To be precise, has other subtrees than , as the subtree rooted at has more than nodes and has at most nodes. Then consider another subtree of , called . If , we update and add to and continue with another subtree of , which by the same argument exists. If , then again we choose a deepest leaf of and proceed with the same process as before to find another group of nodes, i.e. We start with the deepest leaf and go up in the tree until it exceeds , and repeat the procedure. Note that after moving to the subtrees of , we disregard the rest of the graph, so is an ancestor of all the nodes we encounter next. Again, for the next recursion, the subtree of the node that exceeds updated is an ancestor of the rest of the nodes. Let’s call and other nodes that we make a recursion “breaking point”. Then any pair of breaking points are ancestor and descendant, and all the nodes added to are subtrees of breaking points. So by connecting all the breaking points by a single path, which has length at most , as the distance is less than to , we connect , so the smallest closure of is less than . More than that, we have only included some subtress of , so by removing , remains connected and we can use the induction for the remaining tree.

Now we’ve found that has smallest connecting closure at most , and includes only subtrees of , so removing that does not disconnect the graph. Then we save as an aimed group and use induction hypothesis on the rest of the graph. Then other formed groups has size (except one) and has small connecting closures. By adding to the groups, we get the desired grouping of all nodes and the proof is completed. ∎

Now we are ready to prove the upper bound for trees.

###### Theorem 4.5.

Consider a tree with nodes and let . Let . Then there is an algorithm that uses tests and finds the defective set with at most errors. I.e.,

 \textscCrltOPT(G,r,p,ϵ)≤\textscIndepOPT(⌈n/l⌉,p,ϵ′).
###### Proof.

Consider the following algorithm:

1. By Lemma 4.4, partition the tree into groups of the same length , one group might be shorter than the other ones.

2. for each group, choose one of its nodes at random and let them be .

3. Use an algorithm to find the defective set among .

4. Assign the state of all the nodes in as , for all .

First, we calculate the probability that is connected. By Lemma 4.4, we know that each has the property that its smallest connecting closure is less than or equal to . This ensures that at most edges (over the edges already in ) are needed to to make connected. Therefore, the probability of be connected is at least . So the probability that is not in the same state as is at most . The rest of the proof revolves around proving that the total error is less than as was done for cycle and this completes the proof. ∎

###### Corollary 4.6.

Corollary 4.2 can be recovered for trees with an additional factor of 2.

## 5 An Upper Bound for Graphs with More Edges: Grid and SBM

In this section, we focus on graphs that potentially have many edges. As the number of edges increases, the correlation between nodes increases even when is not large. As mentioned earlier, we need to target those components that are more likely to appear in various realizations.

We know that there is a threshold phenomenon in some edge-faulty graphs, meaning that when is below a threshold, there are many isolated nodes (and hence many independent tests are needed) and when is above that threshold, we have a giant component (and hence a single test suffices). Most famously, this threshold is for Erdős-Rényi graphs. For random -regular graphs, also, [GOE97] has shown that when a graph is drawn uniformly from the set of all -regular graphs with nodes and then each edge is realized with probability , is a threshold almost surely.

For the rest of this section, we first study a (deterministic) -regular graph, known as the grid111There is a subtle difference worthwhile to mention here: The degree regularity does not hold on the boundaries of the grid. and then provide near-optimal results for the stochastic block model. When we consider (deterministic) -regular graphs, we can’t use the results of [GOE97] for random -regular graphs because we can not be sure that the specific chosen graph is among the “good” graphs that constitute the almost sure result. So we need to develop new results on the number of connected components and the distribution on them for our purposes.

### 5.1 The grid

We first formally define a grid. A grid with nodes and side length is a graph where nodes are in the form of . Node is connected to its four close neighbors (if exist), namely . Border nodes (with or ) might have three or two neighbors.

In order to get a lower bound, we need to know the expected number of components in . Traditionally, this has been done by finding the expected component size that nodes would belong to [AS16, GOE97]. Consider the following process. Pick a node , mark it as processed, and let it be the root of a tree. For each that is not processed and is a neighbor of , is realized with probability and added as a child of . The same process is repeated for each realized in a Breath First Search (BFS) order. When the process ends, there is a tree with root , and the expected size of the tree is the expected size of the component that ends up in.

An example is show in Figure 3. Node is the root (colored in blue), and the children that are realized are in green, and the children that are not realized are in red. The component would be .

By repeating the process for each node that is not processed yet, we get a spanning forest. The expected number of components in the forest is the expected number of components in the original random graph.

Here, the challenge is that we don’t know the number of available (unprocessed) neighbors of a node. It highly depends on the previously chosen nodes, especially when is small, like in the grid. We circumvent this issue by analyzing an infinite regular tree process that effectively corresponds to a more connected graph and therefore leads to a lower bound on the expected number of connected components for the grid.

#### 5.1.1 3-regular Trees

Consider an infinite tree with root such that each node in the tree has three children. Consider the process where each edge is realized with probability . Let be the component that ends up in. The following lemmas approximates the distribution of .

###### Lemma 5.1.

Under the above process and for ,

 P(|C(v)|=t)=12t+1(3tt)rt−1(1−r)2t+1
###### Proof.

Let be an embedded tree in with nodes. In order to be realized in the process, all the edges in should be realized and the rest of edges that has a node in should not be realized. There are edges in , and each node has three potential edges, so there are edges that are not realized. So the probability that be realized is .

Let be the number of trees with nodes and as the root. Then

 P(|C(v)|=t)=Ct⋅rt−1(1−r)2t+1.

Note that . We find a closed form of by recursion. Node has three potential subtrees, where the sum over the size of the subtrees is . Then, we get the recursion

 Ct=∏i+j+k=t−1,i,j,k≥0CiCjCk.

This recursion has the same initial points and the same recursion as second order Catalan numbers 2.5, so the solution has the form and the proof is completed. ∎

###### Lemma 5.2.

Let . Then,under the above process,

 P∞={0r≤1/33r−√r(4−3r)2r2otherwise
###### Proof.

In order to be in an infinite component, at least one of its children should be in an infinite component. Either has one child, and this one should be infinite, which happens with probability . Or has two children and at least one of them lies in an infinite component, which happens with probability . Or has three children and at least one of them lies in an infinite component, which happens with probability . So in total we have

 P∞=3r(1−r2)P∞+3r2(1−r)(1−(1−P∞)2)+r3(1−(1−P∞)3).

This equation has 3 solutions 0 and . Note that , so is not valid. Also, by previous lemma, when , the sum over all finite components does not add up to one, so for , is the correct solution. Note that when , , so zero is the correct solution and the lemma is proved. ∎

###### Theorem 5.3.

For , we have

 E(|C(v)|)=∞∑t=1t2t+1(3tt)rt−1(1−r)2t+1 (1) ≃1−rr√34π∞∑t=1√t(2t+1)(274r(1−r)2)t
###### Proof.

The proof is done using Stirling Approximation in Lemma 5.1 and Lemma 5.2. ∎

It is worthwhile to remark that the proof generalizes to general regular tree processes.

#### 5.1.2 Lower Bound for the Expected Number of Components in a Grid

In the case of grid, we consider 3-regular trees, as after the root in grid, each child has at most 3 potential neighbors and if we choose a node in the border, the root also has at most 3 potential children, as illustrated in Figure 3. So the 3-regular tree process that we analyzed corresponds to a more connected graph than the grid. Therefore its expected number of connected components that we found in (1) provides a lower bound on the expected number of connected components in the grid.

Let be the number of connected components. The random process in 3-regular tree is symmetric over all the nodes, so . Then immediately by Theorem 5.3 we have the following result.

###### Theorem 5.4.

For a grid with nodes and , we have

 E(NC)=n∑∞t=1t2t+1(3tt)rt−1(1−r)2t+1
 ≃n1−rr√34π∑∞t=1√t(2t+1)(274r(1−r)2)t
###### Corollary 5.5.

Immediately by using Lemma 2.1 in conjunction with Theorem 5.4, any lower bound on the number of tests for independent nodes is also a lower bound on the number of tests on a grid under our model.

#### 5.1.3 An Upper Bound for the Grid

In this subsection, we provide an upper bound for the number of tests in a grid. We do this by partitioning the grid into subgrids of length , where

is to be optimized, and compute the probability of error for each subgrid. We first estimate the probability that a subgrid becomes connected.

###### Lemma 5.6.

Let be the probability that a grid of length becomes connected when each of its edge is realized with probability . Then we have:

 Pk≥Pk−1rΘ((1−r)k)
###### Proof.

Consider the subgrid of length that contains bottom-left corner node. Then the main grid can be seen as the subgrid and a path of length , where each node in the path has one edge to the subgrid. With probability the subgrid is connected. The path would be decomposed into subpaths with probability at least . Each subpath has at least one edge to the subgrid, so each one is connected to the path with probability at least . The probability that all of them connect to the subgrid then is at least and the lemma is proved. ∎

###### Theorem 5.7.

Let be the probability that a grid of length becomes connected when each of its edge is realized with probability . Then we have:

 Pk≥rΘ((1−r)k2)=eΘ(log(r)(1−r)k2))
###### Proof.

The proof is done by replacing with , and then with and so in in Lemma 5.6 and at last replacing . ∎

Now similar to Theorem 4.5, by setting error probability of each group small enough, that is , we get . Then the error is less than with at most independent node tests with error .

### 5.2 Optimal Algorithm for Stochastic Block Model

In this section, we study our model on SBM graphs. We apply the same techniques used in Erdős-Rényi graphs to find the connectivity threshold to find the structure of the connected components.

A stochastic model has clusters of size , where any two pair in the same cluster is connected with probability , and any two pairs of nodes in different clusters are connected with probability . Note that after realizing each edge with probabilities and , then based on our model each edge remains with probability . So with probability an edge remains in the same cluster and with probability an edge remains between two different clusters. Here, we assume the size of the clusters are much bigger than , i.e. . We find the connected components based on and .

###### Theorem 5.8.
• If and , then with high probability is connected. (first regime, one test needed)

• If and , then with high probability each cluster is connected but most of the clusters are isolated. (second regime, independent tests needed)

• If and , then with high probability has many isolated nodes. (third regime, independent tests needed)

• If and , and , then with high probability is connected. (fourth regime, one test needed)

###### Proof.

First, suppose . A cut of size in a single cluster has potential edges between parts. Then the probability that the specific cut is disconnected is

 (1−r1)i(k−i)(i)≤e−r1i(k−i)(ii)≤e−100logni(1−i/k)
 (iii)≤e−50iloggk=(1gk)50i.

The first inequality, (i), is true because for , (ii) is true by , and (iii) is true by . Note that number of cuts of size is , and by Union Bound, the probability that any cut of size becomes disconnected is at most . But by a simple counting argument, so the probability of a cut be disconnected is at most . So with probability a single cluster of size is connected. Again, by Union Bound, with probability at most there is a disconnected cluster. So with probability , all clusters are connected.

Now if we assume all the clusters are connected, if there is an edge between two clusters, then those two clusters are connected. So if we consider a graph where the nodes represent the clusters and two nodes are connected if there is at least one edge between the corresponding clusters, then we need to understand the connectivity of the new graph. The probability that there is at least one edge between two clusters is , and again if this value is more than , then is connected. If , then the probability that a cluster is isolated is more than , so most of the clusters are isolated, which proves the first two parts of the theorem.

If , then with the same argument, with high probability most of the nodes in all clusters are isolated. If we also have , then this means that most of the nodes don’t have any neighbors outside of their cluster with high probability, so in total the graph has many isolated nodes, which proves the third part.

Now suppose , and we prove the last part of the theorem. We assume each cluster is empty, and even when they’re empty with , the graph is connected with high probability. Consider a cut in with nodes. Each node has potential neighbors in other clusters. So it has at least potential neighbors outside of its clusters and the chosen cut. Then, almost similar to the first part of the theorem, the probability that this cut is disconnected is at most

 (1−r2)i⋅(n−k−i)≤e−r2⋅i⋅(n−k−i)
 ≤e−100logn⋅i⋅(n−k−in)(i)≤e−100logn⋅i⋅(1/2−1/g)
 =(1n)100i(1/2−1/g).

Here, (i) is true by and . Again, there are cuts of size . So the probability that any cut of size is disconnected is at most . It is not hard to see that if , then . So the probability that any cut is disconnected is bounded by

 n/2∑i=1(1n)100i(1/2−1/g−0.01)≤n⋅o(1/n2)=o(1/n).

So in the case of , we’ve proved the last part of the theorem. If , for , a cut of size has at most edges in the node set of size , as the graph is bipartite and the number of edges in the set is maximized when nodes is chosen from each part of the graph. So the potential edges to the other side of the cut is at least , as , and we can repeat the repeat the reasoning to prove that with high probability all cuts in this graph is connected. It is also easy to verify that when the cut is a single node or a pair of nodes, then the cut is disconnected with probability at most , and this completes the proof. ∎

## References

• [ACO21] S. Ahn, W. Chen, and A. Ozgur (2021) Adaptive group testing on networks with community structure. arXiv preprint arXiv:2101.02405. Cited by: §1.
• [AJS19] M. Aldridge, O. Johnson, and J. Scarlett (2019) Group testing: an information theory perspective. arXiv preprint arXiv:1902.06002. Cited by: §1.
• [AS16] N. Alon and J. H. Spencer (2016) The probabilistic method. John Wiley & Sons. Cited by: Definition 3.1, Theorem 3.2, Theorem 3.3, §3, §5.1.
• [AU21] B. Arasli and S. Ulukus (2021) Group testing with a graph infection spread model. arXiv preprint arXiv:2101.05792. Cited by: §1, §1.
• [AVA08] J. Aval (2008) Multivariate fuss–catalan numbers. Discrete Mathematics 308 (20), pp. 4660–4669. Cited by: Lemma 2.5.
• [BMT+84] T. Berger, N. Mehravari, D. Towsley, and J. Wolf (1984) Random multiple-access communication and group testing. IEEE Transactions on Communications 32 (7), pp. 769–779. Cited by: §1.
• [BMR21] V. Brault, B. Mallein, and J. Rupprecht (2021) Group testing as a strategy for covid-19 epidemiological monitoring and community surveillance. PLoS computational biology 17 (3), pp. e1008726. Cited by: §1.
• [CHK+11] M. Cheraghchi, A. Hormati, A. Karbasi, and M. Vetterli (2011) Group testing with probabilistic tests: theory, design and application. IEEE Transactions on Information Theory 57 (10), pp. 7057–7067. Cited by: §1.
• [CKM+12] M. Cheraghchi, A. Karbasi, S. Mohajer, and V. Saligrama (2012) Graph-constrained group testing. IEEE Transactions on Information Theory 58 (1), pp. 248–262. Cited by: §1.
• [CHL01] B. S. Chlebus (2001) Randomized communication in radio networks. COMBINATORIAL OPTIMIZATION-DORDRECHT- 9 (1), pp. 401–456. Cited by: §1.
• [CGH+20] A. Coja-Oghlan, O. Gebhard, M. Hahn-Klimroth, and P. Loick (2020) Optimal group testing. In Conference on Learning Theory, pp. 1374–1388. Cited by: §1.
• [DOR43] R. Dorfman (1943) The detection of defective members of large populations. The Annals of Mathematical Statistics 14 (4), pp. 436–440. Cited by: §1.
• [DHH00] D. Du, F. K. Hwang, and F. Hwang (2000) Combinatorial group testing and its applications. Vol. 12, World Scientific. Cited by: §1, §1.
• [FKK+97] M. Farach, S. Kannan, E. Knill, and S. Muthukrishnan (1997) Group testing problems with sequences in experimental molecular biology. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 357–367. Cited by: §1.
• [GOE97] A. Goerdt (1997) The giant component threshold for random regular graphs with edge faults. In International Symposium on Mathematical Foundations of Computer Science, pp. 279–288. Cited by: §5.1, §5, §5.
• [GH08] M. T. Goodrich and D. S. Hirschberg (2008) Improved adaptive group testing algorithms with applications to multiple access channels and dead sensor diagnosis. Journal of Combinatorial Optimization 15 (1), pp. 95–121. Cited by: §1.
• [KM95] E. Knill and S. Muthukrishnan (1995) Group testing problems in experimental molecular biology. arXiv preprint math/9505211. Cited by: §1.
• [LCH+14] T. Li, C. L. Chan, W. Huang, T. Kaced, and S. Jaggi (2014) Group testing with prior statistics. In 2014 IEEE International Symposium on Information Theory, pp. 2346–2350. Cited by: §1, Theorem 2.2, Theorem 2.3, Theorem 2.4.
• [MNB+21] L. Mutesa, P. Ndishimye, Y. Butera, J. Souopgui, A. Uwineza, R. Rutayisire, E. L. Ndoricimpaye, E. Musoni, N. Rujeni, T. Nyatanyi, et al. (2021) A pooled testing strategy for identifying sars-cov-2 at low prevalence. Nature 589 (7841), pp. 276–280. Cited by: §1.
• [NSG+21] P. Nikolopoulos, S. R. Srinivasavaradhan, T. Guo, C. Fragouli, and S. Diggavi (2021) Group testing for connected communities. In

International Conference on Artificial Intelligence and Statistics

,
pp. 2341–2349. Cited by: §1.
• [SW18] B. Spang and M. Wootters (2018) Unconstraining graph-constrained group testing. arXiv preprint arXiv:1809.03589. Cited by: §1.
• [VFH+21] C. M. Verdun, T. Fuchs, P. Harar, D. Elbrächter, D. S. Fischer, J. Berner, P. Grohs, F. J. Theis, and F. Krahmer (2021) Group testing for sars-cov-2 allows for up to 10-fold efficiency increase across realistic scenarios and testing strategies. Frontiers in Public Health, pp. 1205. Cited by: §1.