Structure-aware combinatorial group testing: a new method for pandemic screening

02/17/2022
by   Thais Bardini Idalino, et al.
UFSC
uOttawa
0

Combinatorial group testing (CGT) is used to identify defective items from a set of items by grouping them together and performing a small number of tests on the groups. Recently, group testing has been used to design efficient COVID-19 testing, so that resources are saved while still identifying all infected individuals. Due to test waiting times, a focus is given to non-adaptive CGT, where groups are designed a priori and all tests can be done in parallel. The design of the groups can be done using Cover-Free Families (CFFs). The main assumption behind CFFs is that a small number d of positives are randomly spread across a population of n individuals. However, for infectious diseases, it is reasonable to assume that infections show up in clusters of individuals with high contact (children in the same classroom within a school, households within a neighbourhood, students taking the same courses within a university, people seating close to each other in a stadium). The general structure of these communities can be modeled using hypergraphs, where vertices are items to be tested and edges represent clusters containing high contacts. We consider hypergraphs with non-overlapping edges and overlapping edges (first two examples and last two examples, respectively). We give constructions of what we call structure-aware CFF, which uses the structure of the underlying hypergraph. We revisit old CFF constructions, boosting the number of defectives they can identify by taking the hypergraph structure into account. We also provide new constructions based on hypergraph parameters.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/04/2020

Group testing for overlapping communities

In this paper, we propose algorithms that leverage a known community str...
02/06/2019

Information-theoretic and algorithmic thresholds for group testing

In the group testing problem we aim to identify a small number of infect...
12/04/2020

Network Group Testing

We consider the problem of identifying infected individuals in a populat...
04/30/2020

Geometric group testing

Group testing is concerned with identifying t defective items in a set o...
07/16/2020

Community aware group testing

Group testing pools together diagnostic samples to reduce the number of ...
02/10/2022

Group Testing on General Set-Systems

Group testing is one of the fundamental problems in coding theory and co...
12/21/2021

Weisfeiler-Leman for Group Isomorphism: Action Compatibility

In this paper, we show that the constant-dimensional Weisfeiler-Leman al...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Group testing literature dates back to the Second World War as an efficient way of testing blood samples for syphilis screening [4, 3]. The idea consists of grouping blood samples together before testing, so that negative results could save hundreds of individual tests. This idea was then applied to many other areas: screening vaccines for contamination, building clone libraries for DNA sequences, data forensics for altered documents, modification tolerant digital signatures [8, 6, 13, 14, 15, 16, 17, 18]. Currently, it is considered a promising scheme for saving time and resources in COVID-19 testing [5, 23, 24, 25, 29]. In fact, several countries, such as China, India, Germany and the United States, have adopted group testing as a way of saving time and resources [23].

In combinatorial group testing (CGT), we are given items of which at most are defective (or contaminated). We assume we can test any subset of items, and if the result of the test is positive the subset contains at least one defective (contaminated) item, and if it is negative all items in the subset are non-defective (uncontaminated). The main goal is to minimize the number of tests for given and , while determining all defective items. For a comprehensive treatment, see the text by Du and Hwang [4].

Group testing may be adaptive or non-adaptive [4]. Adaptive CGT allows us to decide the next tests according to the results of previous tests. This is the case of the binary spliting algorithm, which meets the information theoretical lower bound of tests. In this paper, we focus on non-adaptive CGT. Due to test waiting times, non-adaptive CGT is a useful approach, since we decide all groups at once and can run tests in parallel. In addition, in non-adaptive CGT, we can have more balanced sizes of the groups (items in each test), which is limited in some real applications. For COVID-19 screening, researchers are testing how many samples can be grouped together without compromising the detection of positive results [23, 29].

Items and tests in CGT can be represented by a binary matrix where items correspond to columns and tests correspond to rows, where a 1 means a test uses an item. A -cover free family (or -CFF()) is a matrix with special properties that guarantee the identification of defective items among items using tests and a simple decoding algorithm that takes time (see Section 2).

In this paper, we are interested in applications where the defective items are more likely to appear together in predictable subsets of items, which are given as edges of a hypergraph. For example, if we want to monitor a highly transmissible disease among students in a school, classrooms can be the edges (or regions) where it is more likely that if there is one infected individual we may find many. In this way, outbreaks may be detected early while only a few classrooms have infected students. In this model, we are given a hypergraph where items are vertices and regions are edges such that there are at most edges that together contain all defective vertices. The objective is still to minimize the number of tests while identifying all defective items. A weaker version of the problem consists of simply identify all infected edges. In this paper we initiate a more systematic study of how to build CFFs for combinatorial group testing under the hypergraph model, which we call structure-aware cover-free families.

Recent related work. A similar hypergraph model has been recently proposed as group testing in connected and overlapping communities in the context of COVID-19 testing [24, 25] and as variable cover-free families motivated by problems in cryptography [16]. The work in [24, 25] span both adaptive and nonadaptive CGT algorithms, but there is not much emphasis on CGT matrix contructions. Our work is on efficient cover-free family constructions for the hypergraph model. The idea of structure-aware CFF was introduced in the first author’s PhD thesis [16] under the name of variable CFFs (VCFFs) with an equivalent definition. This was inspired by applications in cryptography, where they would allow for location of clustered modifications in a signed document when using modification-tolerant digital signatures.

Our results and paper structure. Basic concepts for cover-free families are given in Section 2. The new definitions of structure-aware cover free families and edge-identifying CFFs are given in Section 3 along with related decoding algorithms. CFF constructions for hypegraphs with non-overlaping edges are given in Section 4. In particular, we revisit known -CFF constructions (Sperner, product, array group testing, polynomials in finite fields) and show how they can be viewed as a structure-aware CFF, allowing a much larger defect identification when items are clustered into conveniently chosen hypergraphs. We exemplify how these hypergraphs relate to realistic community-like structures. In a generalization of the Sperner construction () we also give results under the more realistic assumption of limited number of samples per tests (Section 4.1). CFF constructions for the more general case of hypergraphs with overlapping edges are given in Section 5. We give constructions for both and using edge-colouring and strong edge-colouring of hypergraphs, to partition the hypergraph into non-overlapping subgraphs that can be constructed using results from the previous section. Proofs are in the appendix for refereeing purposes.

2 Cover-Free Families

Cover-free families were first introduced by Kautz and Singleton [19] in the context of superimposed codes. They are found under different names, such as -disjunct matrices and strongly selective families [4, 27]. We can define -CFF via a matrix or a set system.

Definition 1 (CFF via matrix)

Let be a positive integer. A -cover-free family, denoted -CFF, is a 0-1 matrix where the submatrix given by any set of columns contains a permutation matrix (each row of an identity of order ) among its rows.

A set system consists of a set and a collection of subsets of . The set system associate to matrix is the set system with corresponding to rows and corresponding to columns of , where has column

as its characteristic vector,

. A -CFF can be equivalently defined in terms of its set system , by specifying that no set of columns “covers” any other column.

Definition 2 (CFF via set system)

Let be a positive integer. A -cover-free family, denoted -CFF, is a set system with and such that for any subsets , we have

(1)

Next we show an example of a -CFF(), which can be used to test items with tests and identify up to defective items.

After running the tests on groups of items according to the rows of a -CFF matrix , we can run a simple algorithm to identify the invalid items. When we apply Algorithm 1 with a -CFF matrix and the number of defectives is indeed bounded by , then after the first loop has at most nonzero components. So for -CFF, the second loop can be removed and substituted by a simple check that the number of 1’s in does not exceed ; in this case, the output will be Boolean, i.e. every component is in , and correct. We give this more general algorithm, used in Section 3

. In the case of other types of matrices or when the hypothesis of testing are not satisfied, the algorithm classifies the items into three types of defective status (yes, no, maybe) according to the information provided by test results. Assuming correct outcome for group testing, the items with

do not give false positive/negative results.

Input: Group testing matrix and test result , with iff the i-th test was positive.
Output: Vector , if the j-th item is defective, nondefective, unknown, respectively.
for i = 1, …, t do
     for j = 1, …, n do
         if  and  then               
for j such that  do
     if  such that ( and (, with )) then
          Item is on a failing test together with only non-defective items
     else  Can’t guarantee is the cause of failures but maybe defective      
return
Algorithm 1 Non-adaptive CGT algorithm to identify invalid items

For a given and , we are interested in constructing -CFFs with the smallest possible , so we define . For , Sperner’s theorem gives an optimal construction for -CFFs. The value grows as as , which meets the information theoretical lower bound on the number of bits necessary to uniquely distinguish the inputs. For , the best known lower bound on for -CFF() is given by for some constant  [9, 28, 30], with proven to be in [9] and in [28].

For , there are several approaches to construct -CFFs, for example, we can use codes and combinatorial designs [22]. Probabilistic methods usually provide the best existence results known, and derandomization techniques can be used to yield efficient algorithms to construct CFFs, such as in [2, 27, 10, 11]. Using this approach, polynomial time algorithms exist to construct a -CFF() with  [2, 27, 10, 11].

3 Structure-aware Cover-Free Families

In this section, we define structure-aware cover-free families (SCFFs) by adding a hypergraph structure to a CFF. Vertices correspond to columns and edges specify sets of columns where defective items may appear more likely together. We use the assumption that defective items are contained in a small number of edges inside of which any number of defective items may be found. For example, the outbreak of a disease in a school/university could be detected by associating vertices with students, edges with classrooms/courses; even if the number of infected students is high, the CFF would detect them as long as they are concentrated in a small number of classrooms/courses.

Definition 3 (Structure-aware CFFs)

Let and be integers. Let be a hypergraph with vertices and edges, and let be a binary matrix with associated set system , . Matrix is a structure-aware cover-free family, denoted -CFF(), if for any -set of hyperedges , and for any and any , we have

(2)

We observe that a -CFF() is equivalent to an -CFF() where edges are singleton vertices .

We now consider how the status of edges influence the detectability of defective items. An edge is defective if it contains a defective vertex and non-defective, otherwise. A set of edges is a defect cover if the set of defective vertices is contained in the union of these edges; such a defective cover is minimal if no proper subset is a defect cover. A minimal defect cover is always contained in the set of defective edges, but the number of defective edges may be much larger than the size of a defect cover for hypergraphs with overlaping edges. The next proposition shows that a structure-aware CFF ability to detect defectives only depends on the cardinality of a minimum defect cover being bounded by .

Proposition 1

Let be a hypergraph, be an -CFF() and be the result of tests given by on items . If has a defect cover with at most edges then Algorithm 1 on inputs returns a Boolean output such that if and only if item is defective.

Proof

Let be a defect cover with . Let be an item and take . Since is a -CFF(), Equation (2) guarantees there exists a row in that tests item and avoids all other defective items. If item is non-defective, this row will be a passing test, , and will be set to in the first loop. Otherwise, item is defective, and will remain equal to at the end of the first loop. In addition, row will prove that the condition on the second loop is false for so will never be set to 0.5. Therefore, the output will be a Boolean that correctly informs the status of the items. ∎

We are also interested in identifying infected edges when the output of Algorithm 1 is not Boolean, which can happen if defective items are spread over too many edges (defective covers have size ). For example, in schools the tests may not provide full information on infected students, but we still may extract information on which classrooms are infected. The following algorithm provides edge information based on ternary vertex information for a hypergraph .

Input: Hypergraph with vertices and edges; Group testing matrix , boolean results ; Vector , if the j-th item is defective, non-defective, unknown, respectively.
Output: Vector , if the e-th edge is defective, nondefective, unknown, respectively.
for  do this loop gets edge status from vertices
     ;
     for each vertex in edge  do
         if  then  
         else
              if  and  then                               
for  do this loop gets edge status from test results
     if  then
         
         for  do
              if () and (then                               
return
Algorithm 2 Edge information from vertices

Some CFFs may have a value of for vertex status identification but have a larger value for edge status identification. This can be useful for applications, in that infected communities are identifiable even though we do not have perfect individual identification. To capture this property, we define edge-identifying CFFS (ECFFs), which has a weaker coverage requirement than SCFFs.

Definition 4 (Edge-identifying CFFs)

Let , , , , and be as in Definition 3. We say is an -ECFF() if for any -subset of hyperedges , , and any , we have

(3)
Proposition 2

Let be a hypergraph, be a -ECFF(), and be the test results for . Let be the output of Algorithm 1 for inputs y. Then, if has a defect cover with at most edges then Algorithm 2 applied to xy returns an output such that forms a defect cover.

Proof

Let be any minimal defect cover with and let . Then, any item is non-defective and Equation (3) guarantees there is a row that tests and avoids all items in , and thus avoids all defective items, which means and Algorithm 1 sets . Now, consider any edge and let . Since is minimal, must contain a defective item . By Equation (3), using , there must be a test/row that contains and avoids . Thus, we must have , which implies Algorithm 2 sets . Therefore, for all and possibly for a few other edges. Since every superset of an defect cover is a defect cover is a defect cover. ∎

For any CFF, structure-aware CFF, or ECFF matrix we denote by the number of ones in each row of . We keep track of these quantity in some constructions, since we may have limit on the number of ones per row, in cases where combining too many samples can result on a false negative.

4 Structure-aware CFFs: non-overlapping edges

We revisit old CFF constructions and show we can boost the number of defectives it can identify by taking a suitable hypergraph structure into account. We also propose some new constructions. Here we consider the case of non-overlapping edges, meaning that items do not participate in more than one edge.

4.1 Sperner-type constructions for

A Sperner set system is a set system where no set is contained in any other set in the set system. Sperner’s theorem states that the largest Sperner set system on an -set is formed by taking all subsets of cardinality . Given , a -CFF with minimum is obtained from Sperner theorem by taking and the corresponding matrix having the characteristic vectors of -subsets as columns. We note that and this is the best possible, since being 1-CFF is equivalent to being Sperner.

A Sperner set system with sets with cardinality can be used as a -CFF if exceeds a maximum allowed number of ones per row, . For nonoverlapping hypergraphs and , we give constructions for SCFF for both unlimited and limited .

Proposition 3 (, unlimited )

Let be a hypergraph with disjoint edges of cardinality at most that span . Let be the vertical concatenation of matrices and . Let be obtained from a -CFF matrix with in such a way that if vertex is incident to edge column of repeats column of . Let be a

matrix with an identity matrix of dimension up to

pasted under the items of each edge . Then, is an -ECFF and is an -CFF.

For uniform hypergraphs the construction above gives , but does not limit . The next proposition is useful for limited , as shown in the example that follows it.

Proposition 4 (, )

Let be a positive integer that limits the number of 1s in each row of the CFF. Let be a hypergraph with disjoint edges of cardinality at most that span , where . Let . Then,

  1. If and then given in Proposition 3 is an -CFF with .

  2. Otherwise, let . Take such that and . Then, there exists a -CFF matrix with .

Proof

The first statement comes from Proposition 3. The second statement comes from vertically concatenating and where is formed by a Sperner system of -subsets of a -set on the edges (repeating columns for vertices in the same edges) and is build similarly to in Proposition 3 but splitting rows (the ones in each row are split into up to new rows not exceeding ). ∎

Example 1 (Proposition 4 used for classrooms with students each)

Suppose students are divided into classrooms of size up to . Then Proposition 4 can be used to identify all infected students, provided they are all in a single classroom (). The table below reports on number of tests for each scenario depending on value of for the construction on Proposition 4. The line with shows the number of tests for the construction for unlimited (Proposition 3). The last line shows the lower bound given in [9] for the number of rows on a -CFF required for location of any set of infected students, not necessarily concentrated on a single classroom.


.5 1 2 3 1 2 4 6 1.5 3 6 9 2 4 8 12 2.5 5 10 15 3 6 12 18
10 10 10 10 20 20 20 20 30 30 30 30 40 40 40 40 50 50 50 50 60 60 60 60
5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30 5 10 20 30
10 11 16 26 36 17 27 47 67 23 38 68 98 28 48 88 128 33 58 108 158 39 69 129 189
15 11 16 26 36 17 27 47 67 18 28 48 68 23 38 68 98 28 48 88 128 29 49 89 129
20 11 16 26 36 12 17 27 37 18 28 48 68 18 28 48 68 23 38 68 98 24 39 69 99
25 10 16 26 36 12 17 27 37 18 28 48 68 18 28 48 68 18 28 48 68 24 39 69 99
30 10 16 26 36 12 17 27 37 14 18 28 38 18 28 48 68 18 28 48 68 19 29 49 69
10 15 25 35 11 16 26 36 12 17 27 37 13 18 28 38 13 18 28 38 13 18 28 38
21 66 180 270 21 66 231 496 21 66 231 496 21 66 231 496 23 66 231 496 25 66 231 496

4.2 Kronecker product constructions (general )

Let be an binary matrix, for , and 0 be the matrix of all zeroes with same dimension as . The Kronecker product is a binary matrix formed of blocks such that if and , otherwise. We denote by the row matrix with ones and by the identity matrix of dimension . The propositions given after each theorem specializes the theorem construction and generalizes to SCFF, boosting the defective detection.

Theorem 4.1 (Li et al. [22] for , Idalino and Moura [18])

Let be a -CFF and be a -CFF, then is a -CFF.

Proposition 5

Let be a hypergraph formed by disjoint edges of cardinality , . Let be a positive integer, and let be an -CFF. Then is an -ECFF and is an -CFF.

Theorem 4.2 (Li et al. [22] for , Idalino and Moura [18])

Let , be a -CFF, be a -CFF, be a -CFF. Let be the vertical concatenation of with . Then is a CFF.

Proposition 6

Let be a hypergraph formed by disjoint edges of cardinality , . Let be a positive integer, be an -CFF, and be an -CFF. Then the vertical concatenation of with is an -CFF. Moreover, if edges have different cardinalities bounded by , a similar construction yields an -CFF.

Construction in Proposition 5, using -CFF(9,12) :

, , .  

Construction in Proposition 6, using -CFF(9,12) and -CFF(6,12) :

, , .

Figure 1: Two -CFF(), consists of 12 disjoint edges of size 3. Up to six defective items concentrated within 2 edges can be identified.

4.3 Array and Hypercube Constructions

An array-based scheme for group testing uses an array, where each entry of the array corresponds to an item to be tested and the tests are performed on rows and columns, for a total of tests. This can be used on a 2-stage algorithm, where all items at the intersection of a positive row and column should be individually tested in a second stage to solve ambiguities [26, 12, 20]. For defective item, one stage is enough. Figure 2 (a) shows a array with defective items in red. This idea can be generalized to higher dimensions, constructing an hypercube [1, 21], which is a -CFF. Figure 2 (b) shows a -dimensional hypercube, where each point represents an item and tests are given by fixing the value of one dimension. If all defective items are clustered in either a row or a column in a 2-dimensional array, we can precisely identify all of them in one round, thus this is a structure-aware -CFF for corresponding to rows and columns. We generalize this for higher dimensions in the next proposition. To simplify the notation, we take , but the next results are valid for the general case. An -hypercube group testing matrix is an -CFF matrix defined as follows. Items are in and rows/tests are given by , , . Denote for , .

Figure 2: (a) A array GT with items and 10 tests. (b) A hypercube GT with items and tests.
Proposition 7

Let be an -hypercube group testing matrix. Let where , , and let where . Then, for any , is an -CFF() and if , is also an -ECFF() Moreover, is an -CFF().

4.4 Construction from polynomials

Now we look at a construction of -CFFs from polynomials over finite fields, given by Erdös et al. [7]. Let be a prime power, a positive integer, and be a finite field. We define () as follows, for each polynomial of degree at most : Then, is a -CFF() for .

This -CFF has an interesting structure, which allows us to discard some rows when smaller values of are enough [15]. We restrict the CFF matrix to blocks of rows by considering , and , which yields the following result.

Proposition 8 (Idalino and Moura [15], Theorem 3.2)

Let be a prime power, and , and let be the -CFF() obtained from the polynomial construction. If we restrict to the first blocks of rows, we obtain a -CFF(), for any .

For instance, for and , if we restrict a -CFF() to its first blocks of rows, we get a -CFF(), with blocks of rows we get a -CFF(), etc. Next we show that this construction is an structure-aware CFF that can tolerate as many as errors with as few as tests.

Theorem 4.3

Let and be a prime power such that . Let be a set-partition of such that for all . Then, there exists an -CFF(). If , it is also an -ECFF().

Proof

Each column of the 01-matrix is associated with a polynomial of degree at most k. Letting , identify the blocks with , for . Each row of is associated with pair , and if and only if . Let , for some , be a set of defective items. We need to show that for any column , there exists a row s.t. and , . Let and . We consider two cases.
Case i) : Taking , we know for any , ; for otherwise, since they already have the same evaluation for , this would imply they would be the same polynomial.
Case ii) : Let and take . We claim and for all . Indeed, by the block definitions, for , .

If , one block of rows in has each test coinciding with each edge. Thus, is also a -ECFF().∎

As an example, for and we have edges This gives us an -CFF() with