1.1 Background and Motivation
In the quantitative group testing (QGT) problem, whose roots can be traced back to work of Dorfman [dorfman_1943], Erdös and Rényi [erdos_1963] and Shapiro [shapiro_1960], some individuals out of a large population suffer from a rare disease. The goal is to efficiently identify those infected individuals. To this end, we are equipped with a testing procedure, whereby we can pool individuals into groups. Each test outputs the number of infected individuals in the tested group. The goal is to devise a test design that identifies the infected individuals with the least number of tests. In the literature, this problem has been alternately studied under the name of quantitative group testing [cao_2014, karimi_2019, martins_2014, wang_2015, wang_2017], coin weighing [bshouty_2009, djackov_1975, erdos_1963, fine_1960, hwang_1987, shapiro_1960] or as a special case of the pooled data problem [alaoui_2017, scarlett_2017, wang_2016]. Over the last years, the problem has re-attracted considerable attention and found a wide range of applications from computational biology [cao_2014, sham_2002] over traffic monitoring [wang_2015] and confidential data transfer [adam_1989, dinur_2003]martins_2014, wang_2016].
The prevalent test design in the QGT literature assigns individuals to several tests by placing each individual independently and randomly into tests [alaoui_2017, karimi_2019, lee_2015, scarlett_2017, wang_2016]. In this paper, we employ a similar model originating from related statistical inference problems [aldridge_2016, aco_2019, johnson_2019] under which the size of each test remains fixed and participants are assigned uniformly at random with replacement. To be precise, we create a random bipartite multigraph with vertices “on the left” and vertices “on the right”. Vertices represent the individuals, while represent the tests. Two vertices and are connected, if and only if individual participates in test . See Figure 1 for an example. The graph will feature multiedges w.h.p.111The expression with high probability (w.h.p.) refers to a probability that tends to 1 as . , signifying individuals included in a test more than once. The vertices are colored with values in and by indicating whether an individual is healthy or infected. The number of infected individuals can either be a constant fraction of the total population (linear regime) or grow sublinearly in the total population size . The latter is the regime which this paper is devoted to.
Given and and a suitable choice of the degree of the test vertices , we are interested in the minimum number of tests to correctly identify infected individuals with vanishing error probability. Like in many inference problems, this question is two-fold. First, the information-theoretic perspective asks for the least amount of tests, if we have unlimited computational power at our disposal and are not concerned with the algorithmic running time to infer the true configuration . Let us denote this threshold as . Second, what is the minimum number of tests such that a polynomial-time algorithm returns the correct configuration, which we will denote by ? Clearly, it holds that .
QGT fits nicely into a group of statistical inference problems, where the goal is to infer a hidden truth based on some observed signal. One notable problem in this regard that is closely related to QGT is binary group testing. The difference to QGT is that each test result does not output the number of infected individuals, but merely the information whether at least one infected individual is included in the test. Over the past years, both the linear and sublinear regime for binary group testing have attracted considerable attention and since recently are well understood [aco_2019, johnson_2019, scarlett_2017]. For QGT, the current state of research is different. While the linear case is completely resolved by pioneering work of [alaoui_2017, scarlett_2017], only first attempts have been made to understand the sublinear regime [karimi_2019]. In this paper, we resolve this open problem and pin down the sharp information-theoretic threshold for the sublinear regime that exactly extends the linear regime threshold by [alaoui_2017, scarlett_2017]. This information-theoretic bound constitutes the primary achievement of the present paper. To this end, we borrow techniques from the theory of random constraint satisfaction problems. The guiding question is how many sets of infected individuals next to the correct set exist that are consistent with the test results. We demonstrate that for w.h.p. there only exists one configuration of individuals that is consistent with the test result and many such configurations for , thereby deriving a sharp phase transition at .
Similarly, most efficient algorithms have so far only been suggested and analyzed for the linear case, the most notable among them being the approximate message passing algorithm by [alaoui_2017]. Like all efficient algorithms suggested so far, it scales in the number of infected individuals and is therefore not order-optimal from an information-theoretic perspective. For the first time [karimi_2019] recently proposed an algorithm for the sublinear regime that is inspired by error-correcting codes and attains the same order as the message passing algorithm by [alaoui_2017]. In this paper, we present a greedy algorithm called Maximum Neighborhood (MN) that outperforms the algorithm by [karimi_2019] for certain regimes. Therefore, in combination with the bound by [alaoui_2017] it provides a new algorithmic bound for the sublinear regime. The algorithm proceeds by first identifying the total number of infected individuals in the neighborhood of each individual and then declaring the individuals with the highest (normalized) neighborhood as infected. Since neither the previously known nor the MN-Algorithm are order-optimal in terms of the information-theoretic bound, an exciting avenue for future research is to explore algorithms that either attain or get closer to the information-theoretic bound. In the following, we will state the main results of this paper precisely and provide a detailed discussion of prior literature on the quantitative group testing problem. The proofs are outlined in Section 2.
1.2 The Information-Theoretic Threshold
In our model, we set the size of each test to exactly , which maximizes the entropy of the test results. The individuals are chosen uniformly at random with replacement. Accordingly, the number of tests per individual is . Moreover, the test design is non-adaptive, meaning that all tests have to be specified upfront and an adjustment based on prior test results is not allowed. The characteristic of this paper is that we assume that the number of infected individuals grows as a polynomial in , i.e., for . It thereby extends the current literature in a way that fits well into other inference problems [aco_2019, scarlett_2017], where considerable attention has been devoted to the sublinear regime. Let
be a vector of Hamming weightchosen uniformly at random, where the one-entries represent the infected individuals. The vector and the random bipartite multigraph described above enable us to derive , which represents the sequence of test results. Specifically, , the number of infected individuals in test . Observe that an infected individual can participate in a test more than once and thereby contribute to the sum multiple times. We are interested in the minimum so that we can infer from . Our first theorem shows that the corresponding information theoretic threshold known for the linear case [alaoui_2017, scarlett_2017] extends to the sublinear regime.222All logarithms in this paper are to base .
Suppose that , and and let
If , there exists an algorithm that given outputs w.h.p.
If , there does not exist any algorithm that given outputs with a non-vanishing probability.
The theorem is two-fold. Our main contribution is showing that for , w.h.p. there only exists one configuration which given satisfies , namely the true configuration , so that it is information theoretically possible to infer . The second part of Section 1.2 was already established for both the linear and the sublinear regime. It follows for instance from [djackov_1975, Theorem 1] by applying Stirling’s formula. Indeed, while [djackov_1975] only shows that for many satisfying configurations exist, it follows from the proof of the information-theoretic upper bound in this paper that any other randomly chosen satisfying configuration will w.h.p. be far away from the true configuration. A brief outline of this argument can be found in Appendix C.
1.3 A Novel Efficient Algorithm
Having determined a sharp information-theoretic bound for the sublinear regime, the important question is how close efficient algorithms can come to this bound. Quite recently the first algorithm has been analyzed for the sublinear case [karimi_2019] that is inspired by BCH codes and attains a bound of with . In this paper, we make a further step towards understanding the algorithmic solvability of the sublinear regime of QGT by analyzing a plain greedy strategy. The algorithm is based on calculating the total sum of infected individuals in the tests an individual participates in and accordingly will be labeled Maximum Neighborhood (MN). The algorithm is defined in Algorithm 1.
The next theorem is concerned with performance guarantees for the MN-Algorithm. We define
Suppose that , and . If , then Algorithm 1 outputs w.h.p. on input .
This plain greedy algorithm outperforms the algorithm by [karimi_2019] in ultra-sparse regimes, i.e., for . Therefore, we can now state an algorithmic bound as the combination of both:
As previously suggested algorithms, our algorithm does not achieve the order of the information-theoretic bound. An exciting avenue for future research is to investigate whether other algorithms can be order-optimal or even achieve the information-theoretic bound. However, it might also be the case that QGT exhibits a similar impossible-hard-easy transition that can be observed for many other statistical inference problems, where the best known efficient algorithms do not attain the information-theoretic bounds.
1.4 Related Work and Discussion
The order for the minimum number of tests follows from a simple information-theoretic argument. Specifically, each test admits a maximum of different test results. The total number of test result configurations must exceed the number of possible configurations with infected individuals and therefore . It follows for that
In Dorfman’s original work [dorfman_1943], group testing was carried out adaptively, i.e., the test results of earlier rounds were used to inform the design of subsequent tests. So, if a test result returned no infected individual, no further test would be required since every individual is necessarily healthy. In contrast, further tests would be specified for individuals in a test that returned one or more infected individuals. The adaptive information-theoretic bound for QGT works out to be and an efficient algorithm is known that attains this bound [bshouty_2009]. In contrast, [djackov_1975] established an information-theoretic lower bound for non-adaptive QGT at for all sparsity regimes.
While adaptive group testing might seem like the natural design for group testing and initially attracted most attention, recent years were characterized by an increasing popularity for non-adaptive group testing, where all tests have to be specified upfront [aldridge_2014, karimi_2019, scarlett_2017, wang_2017, zhang_2013]. It is also the focus of the paper at hand. The crucial idea behind non-adaptive group testing is to assign individuals to several tests and then infer the status from the combined wisdom of the tests the individual participates in. The reason behind the popularity for non-adaptive designs are two-fold. First, tests are often time-consuming and non-adaptive designs allow tests to be carried out in parallel rather than sequentially. Second, it allows for significant automation in processing the tests. Due to these advantages, some of today’s most important applications in QGT are non-adaptive like DNA screening [sham_2002], traffic monitoring [wang_2015] and computational biology [cao_2014].
The characteristic of the present work is that we set for thereby considering a setting where the number of infected individuals grows sublinearly in . The study of the sublinear regime for QGT was initiated by [karimi_2019] and is inherently interesting. For most real-world applications, the occurrence of an event, i.e., infection by a disease or the presence of certain gene properties scales sublinearly in the observed individuals or items. Prominent examples are Heap’s law of epidemiology [benz_2008] and decoding of genomes [emad_2014]. Not surprisingly, research on binary group testing in recent years has increasingly focused on and by now features a vast literature on the sublinear regime [aldridge_2017, aldridge_2017_a, aldridge_2014, aldridge_2016, baldassini_2013, aco_2019, johnson_2019, scarlett_2016]. Therefore, rigorously understanding the sublinear regime from an information-theoretic and algorithmic perspective constitutes the logical next step for research on QGT. In addition to introducing a novel algorithm, we pinpoint the sharp information-theoretic threshold for this sublinear regime. Our proof techniques resemble those used in [alaoui_2017] for the linear regime complemented by a argument precluding other configurations with large overlaps with . This latter argument is new, but necessary for the sublinear regime since certain asymptotics that hold for small overlaps fail to hold for large overlaps.
Throughout the paper, denotes the random bipartite multigraph with describing the number of tests each individual participates in. The vector encodes which individuals are infected, and indicates the test results where . When we refer to any configuration and not the true one, we simply write for the configuration and for the corresponding test result vector. Moreover, for signifies the number of infected individuals. Additionally, we write for the set of all individuals and and for the set of healthy and infected individuals, respectively. For an individual we write for the multiset of tests adjacent to . Analogously, for a test we denote by the multiset of individuals that take part in the test. In the presence of multiedges, one individual may appear more than once in .
For each , we let be the sum of test results for all tests adjacent to . Obviously, the status of has a significant impact on this sum, increasing it by , if individual is infected. To account for this effect, we introduce a second variable that sums the adjacent test results and excludes the impact of the status of individual . Formally, for any configuration
Furthermore, let and . When we consider the specific instance , we will write and for the sake of brevity. Notably, while is known to the observer or an algorithm instantly from the test results, is not, since the individual infection status is unknown.
In subsequent sections, all asymptotic notation refers to the limit . Thus, denotes a term that vanishes in the limit of large , while stands for a function that diverges to as . We let denote a positive function from the natural numbers to such that
While we will assume that as for the information-theoretic bound, we will see that the algorithmic bound requires being a function of . As described before, every test is sized exactly with individuals assigned uniformly at random with replacement. If an individual participates in a given test more than one time, it will increase multiple times if it was infected. Given ,
is a vector of random variables with. Denote by the -algebra generated by the random bipartite graph. In particular, given , the sequence is given. Let , . Similarly, let be i.i.d. binomial variables describing the number of infected individuals per test with . Given , we obtain the sequences . Define as the event that
For the information-theoretic bound, we would like to characterize alternative configurations yielding the same test result as the true configuration. To this end, let be the set of all vectors of Hamming weight such that
In words, contains the set of all vectors with ones that label the individuals infected and healthy in a way consistent with the test results. Let .
Deriving a sharp information-theoretic bound for the sublinear regime is the principal achievement of the current work. This section provides an outline of the proof. As an information theoretic lower bound already exists [djackov_1975] that coincides with the upper bound we are able to show, we only prove part a) of Section 1.2. Moreover, we give the description and analysis of a greedy algorithm for the sublinear regime. The technical details are left to the appendix.
2.1 Information-Theoretic Upper Bound
The proof rests on techniques that are regularly employed for random constraint satisfaction problems [achlioptas_2011, achlioptas_2006, alaoui_2017, molloy_2012]. We aim to characterize the number of configurations that satisfy the test result and demonstrate that for , w.h.p., i.e., there only exists one (namely the true) configuration with infected individuals satisfying the test result. This configuration can be found via exhaustive search. Therefore, we introduce as the number of alternative configurations that are consistent with the test results and have an overlap of with . The overlap signifies the number of infected individuals under that are also infected under the alternative configuration. Formally, we define
We aim to show that for , w.h.p.,
. To this end, two separate arguments are needed. First, we show via a first moment argument that no second satisfying configuration can exist with a small overlap with. Second, we employ the classical coupon collector argument to show that a second satisfying configuration cannot exist for large overlaps, i.e., one individual flipped from healthy under to infected under an alternative configuration initiates a cascade of other changes in infection status to correct for this initial change. Though the proof relies on knowing exactly upfront, this assumption can readily be removed by just performing one additional test, where all individuals are included and which therefore returns . A similar two-fold argument was recently used to settle some important open problems for binary group testing [aco_2019].
The following two propositions rule out configurations with a small and a big overlap, respectively.
Let and and assume that . W.h.p. we have for all . The core idea is to derive an expression for and use Markov’s inequality to show that when , . Therefore, we would like to demonstrate that when , as .
The proofs of the following Lemmas are included in the appendix. Here, we provide the combinatorial meaning of the initial term of and the rationale behind its simplifications. The initial term reads as
The combinatorial meaning is immediate. The binomial coefficients count the number of configurations of overlap with . The subsequent term measures the probability that a specific configuration yields the same test result vector as . To this end, we divide individuals into three categories. The first contains those individuals exhibiting the same status under and , while the second and third category feature those individuals that are infected under and healthy under and vice versa. The probability for an individual to be in the second or third category is each, while the probability in the first category is . The key observation is that a test result is the same between and , if the number of individuals in the second category is identical to the number in the third category. We compute the sum over the amount of individuals to be flipped. Since the probability term allows for an individual included in a test multiple times to be both infected and healthy, the expression is an upper bound to . Simplifying the term yields the first lemma: For every and a random variable , we have
Using standard asymptotics, we are able to simplify this expression. For every , and , we have
The key question is how to choose so that for every and . We find that takes its maximum at . Therefore, the r.h.s. of (2.1) becomes negative, if and only if the number of tests , parametrized by , is larger than . This is formalized in the following lemma concluding the proof of Section 2.1. For every , and we have
While we could already establish that there are no feasible configurations that have a small overlap with the true configuration , we still need to ensure that there are no feasible configurations that are close to . Indeed, we can exclude configurations with a large overlap with the next proposition. Let and and assume that . W.h.p. we have for all . The proof is detailed in the technical appendix. It follows the classical coupon collector argument. If we consider a configuration different from with the same Hamming weight , at least one individual that is infected under , is labeled healthy under an alternative configuration . W.h.p. this individual participates in at least tests, whose results all change by . To compensate for these changes, we need to find individuals that are healthy under and infected under . By the coupon collector arguments (Appendix A), we require at least such individuals w.h.p., which establishes Section 2.1.
For the linear case, several efficient algorithm have been suggested that require tests [alaoui_2017, karimi_2019]. The only analyzed algorithm for the sublinear regime is due to [karimi_2019] and is based on error-correcting codes. Here, we propose a plain greedy algorithm for the sublinear regime which outperforms this algorithm for certain regimes.
2.2.1 Performance Guarantees
Recall the random variables and , which denote the vectors consisting of the sum of the test results of any individual, once are including the impact of this individual and once excluding it. as defined in Algorithm 1 is derived from by normalizing with the individual-specific number of tests. The MN-Algorithm proceeds by sorting the individuals according to and labeling the individuals with the highest as infected. We note that the normalizing constant from to vanishes in the large system limit. As our analysis is devoted to the asymptotic behavior of Algorithm 1, our proof will be based on rather than . The advantage of the normalizing factor comes into effect for moderate , where simulations show a significant improvement in the performance of the MN-Algorithm. Denote by
the joint distribution ofand by the distribution of . As and given , the total variation distance of and vanishes, i.e.,
In the first step, we would like to get a handle on the distribution of . Clearly, are identically distributed between infected and healthy individuals. However, are not, since an infected individual increases by . The central idea behind the greedy algorithm is that the different distributions of between infected and healthy individuals do not overlap w.h.p. and therefore labeling the individuals with the highest reliably recovers .
Let us start by characterizing the distribution of and . For any individual irrespective of its own infection status, . Correspondingly, .
Clearly, . The crucial idea behind showing that the algorithm succeeds, is to identify an , so that for all healthy individuals and for all infected individuals w.h.p. In that case, the distributions of do not overlap between the group of infected and healthy individuals and selecting the individuals with the highest recovers w.h.p.
The following two lemmas describe the probability for a single healthy or infected individual to be below or above the threshold stipulated by , respectively. The lemmas can be proved by a carefully executed Chernoff argument. A detailed calculation can be found in the technical appendix. For any and any constant it holds that
For any and any constant it holds that
Proof of Section 1.3.
With Sections 2.2.1 and 2.2.1, we are in a position to proof Section 1.3. From Section 2.2.1 we know the probability that the neighborhood of a healthy individual deviates by more than from its expectation. Similarly, Section 2.2.1 gives us the corresponding probability for an infected individual. By (8) it follows that replacing with in Sections 2.2.1 and 2.2.1 only adds a multiplicative error of . We need to ensure that the union bound over all healthy and infected individuals vanishes as respectively, i.e., the two distributions are separated w.h.p. Formally, we need to identify a function and a value such that
For (9) to hold, each individual term needs to vanish in the large limit. The first term of gives
Equivalently, we obtain
Since the first expression of (12) is strictly decreasing in , while the second is strictly increasing, the expression is minimized for such that both expressions equal. As a result, we get
concluding the proof of the theorem. ∎
2.2.2 Empirical Analysis
In Figure 1(a), we compare the number of tests needed for successful reconstruction of on a log-log-scale against the population size for different -regimes. The theoretical bound suggests that the number of tests under the MN algorithm scales in . This property implies that the slope of the curves in Figure 1(a) should be close to . Indeed, the simulation demonstrates this behavior even for small values of .
Figure 1(b) visualizes the probability for successful recovery of against different numbers of tests for . Even for the small population size, we observe the phase transition as predicted by Section 1.3 (shown as dashed lines) up to a constant factor of at most . Overall, the implementation hints at the practical usability of the MN algorithm even for small population sizes.
The technical appendix contains the proofs of the main body of the paper. Appendix A
features standard results on concentration bounds for binomial distributions and asymptotics that will be of use in the proofs of subsequent sections. InAppendix B, we present the proofs for the information-theoretic upper based on the small and large overlap argument sketched in Section 2.1. Appendix C contains the outline on establishing the strengthened version of the information-theoretic lower bound. Appendix D deals with the algorithmic bound of the MN algorithm. Throughout the appendix, we keep the notation introduced before. Moreover, in line with the main body, we set
For the information-theoretic and algorithmic perspective, we set . This choice is order-optimal by the plain information-theoretic argument of Equation 2 and maximizes the entropy of the test results.
Appendix A Preliminaries
In this section we present some standard results on concentration bounds for the parameters occurring in the described testing scheme . Afterwards, we present some technical lemmas from the theory of concentration inequalities of the binomial distribution and approximating results for random walks that are used throughout the proof section.
We begin, for the convenience of the reader, with the basic Chernoff-Hoeffding bound [hoeffding_1963]. If , denote by
the Kullback-Leibler divergence ofand . Then the Chernoff-Hoeffding bound reads as follows. Let and . Then
As a weaker, but often sufficient bound, we get the well-known Chernoff bound. Let and . Then
In QGT, the underlying factor graph is bipartite. The structure is induced by degree sequences and . Observe that will feature many multiedges w.h.p. The chosen test design is randomized. Nevertheless we can apply standard techniques to gain insight into the form of the underlying graph. Appendices A and A will be used to gain a better understanding about bounds of and of the underlying factor graph .
Given , with probability , we find that
Given the random experiment leading to , it follows that each is distributed as independently of all other sources of randomness. Then Appendix A implies
Taking the union bound over all individuals implies the lemma. ∎
Given the random experiment leading to , with probability , we find
The following lemma is a standard result ([Asymptopia, Section 12.3]) that describes a strict phase transition in the balls and bins experiment. [Coupon Collector] Suppose that balls are thrown uniformly at random into bins. For any , the probability that there is at least one empty bin is , if . On the other hand, this probability becomes if . The following lemmas are results on the asymptotic behavior of random walks. A random walk can be described by its transition probabilities . The simple random walk on has the transition probabilities [Section 1.5 of [Asymptopia]] The probability that a one-dimensional simple random walk with steps will end at its original position is asymptotically given by .
The following asymptotic equivalence holds for every s.t. .
Let be a distribution, a real-valued function and . Then the Jensen gap is defined as
A well known upper bound on the Jensen gap for functions s.t. for all (see equation (1) of [gao_2018]) is given by
An immediate consequence is the following corollary. Let s.t. . Then, as , the following holds.
Appendix B Proof of the Information-Theoretic Upper Bound
b.1 Proof of Section 2.1
Proof of Section 2.1.
The product of the two binomial coefficients simply accounts for the number of configurations that have overlap with . Hence, with denoting the event that one specific that has overlap with belongs to , it suffices to show for that
By the pooling scheme, the size of each test is fixed to with individuals chosen uniformly at random with replacement. Clearly, all tests are independent of each other. Therefore, we need to determine the probability that for a specific and a specific test the test result is consistent with the test result under , i.e., . Given the overlap , we know for a uniformly at random drawn that and finally holds for all individuals . It can readily be derived that given and , we find
The last two components of (21) describe the probability that a one-dimensional simple random walk will return to its original position after steps, which is by Appendix A equal to . The term before describes the probability that a random variable takes the value . As long as , the expectation of given and , is at least of order , such that the asymptotic description of the random walk return probability is feasible. Note that if gets closer to , the expectation of gets finite, s.t. the random walk approximation is not feasible anymore. Therefore, using Appendix A, we can, as long as , simplify (21) in the large-system limit to