The concept of test pooling was apparently invented by Robert Dorfman [Dor43] in 1943 who suggested that it would be more effective to test WW2 would-be recruits for syphilis by mixing the blood samples of several recruits and test the pool for antigens. If the pool tests negative then all the pool members are deemed healthy; otherwise, each member of the pool is tested separately. A simple analysis (see next section) shows that for a given probability that a recruit is infected there is an optimum pool size that minimizes the expected number of needed tests. The lower the , the larger the and the lower the expected number of tests required. Dorfman’s analysis has been further refined and generalized to deal with various problems such as false negatives [LLZA12] and studied as part of the broad topic of Combinatorial Group Testing [DH93, AJS19].
Note that some recursive and adaptive approaches dear to computer scientists, such as binary search, often may not work for this problem: there are pragmatic limitations on (a) the size of the pool beyond which dilution results in too many false negatives; (b) the number of samples available from a given specimen; and (c) the total time required to produce an answer.
Nevertheless the emergence of the COVID-19 pandemic and the cost and scarcity of tests for the underlying virus has revived an enormous interest in test pooling. For COVID-19 test pooling has been shown to be doable with pools as large as 64 [YAST20] and is already in use in several countries including Germany [SCea] and Israel [YAST20].
The purpose of this note is to propose a simple variation on Dorfman’s approach that we call double pooling; for clarity, we refer to Dorfman’s method as single pooling. Double pooling works as follows: given a probability of a positive test, pick an optimal size for the pool size. (The optimal is larger than the corresponding optimal for single pooling.) Divide the population to be tested into non-overlapping pools of size (the division is assumed to be random) twice. Thus, now every patient belongs to two pools and is tested in two parallel rounds, and . For every patient if both the pools test positive then test the patient individually. Otherwise consider that patient cleared. If the pool tests do not ever produce false negatives the algorithm is clearly correct. (The false positives only reduce efficiency.)
It turns out the double pooling is particularly advantageous for corresponding to testing a large population of asymptomatic patients but it is more efficient than single pooling even for .
We will discuss double pooling in the next section in more detail, but to see its advantages and build an intuitive understanding we start with an example.
Assume that . It turns out is the optimal size for single pooling with this and results in an expected cost of tests/patient, a nice improvement over testing everyone. However using double pooling the optimum is 23 and the expected cost further declines to just , an almost % improvement. (These quantities will be obvious from our analysis later.)
At first blush these gains might seem surprising, but here is a quick-and-dirty computation: Assume that we are testing 1000 patients and remember so we will posit we have exactly 11 positive patients.
For single pooling, since , we start with tests. Assuming all positive cases end in separate pools (an upper bound) we will need to do another 110 tests to deal with all the suspicious cases, hence total tests (which is close enough to .)
For double pooling, since , we do twice 44 tests with 23 patients each (88 tests). In each round at most 11 tests will come back positive raising suspicions about a total of healthy patients. Thus a given healthy patient has probability to be a suspect in Round A and the same probability in round B. These are quasi-independent, hence the probability of being suspected twice is only . Thus we expect to have less than () healthy patients that were in a positive pool in both rounds and will have to be retested. In addition the 11 truly positive patients will be retested as well. Thus the total number of tests is (which is reasonably close to the claimed since we overestimated and also because in fact we now cover patients).
In conclusion, the “magic” of double pooling comes from a paradigm that has been observed in many other situations, e.g., Bloom filters [Blo70, BM03] and balanced allocations [ABKU99, Mit01]. Although the probability of being “unlucky” in a given trial might be high, the probability of being unlucky in two or more independent trials decreases dramatically.
Consider the expected cost attributable to one patient in the single pooling situation, where the size of the pool is :
if the patient is positive, then the cost is (the patient’s share of the pool + their individual test);
if the patient is negative, then the cost is (the patient’s share of the pool + their individual test iff not all the other patients are healthy).
Since the probability of being positive is , the total expected cost per patient in this case is
To determine the pool size that minimizes the total for a given , we take the derivative of the cost with respect to :
and set it to .
The solution of interest can be expressed in terms of the Lambert function111https://en.wikipedia.org/wiki/Lambert_W_function namely
Let us define as the value of for which the optimum is exactly 10. It turns out that which is the value we used in the introductory example. More generally, Figure 1 shows the optimum integer as a function of .
Let us turn to double pooling: now each patient will be assigned to two random pools each of size . A patient will be tested individually iff both their pools test positive. Again let us look at the expected cost induced by the testing of one patient:
if the patient is positive, then the cost is (the patient’s share of the two pools + their individual test);
if the patient is negative, then the cost is (the patient’s share of the pools + their individual test iff both pools test positive).
Hence the total expected cost is
where for brevity stands for .
As before, to determine the pool size that minimizes the total cost for a given , we take the partial derivative of the cost with respect to :
and set it to .
This has to be solved numerically at each . Solving this at we see that the optimum . More generally, Figure 2 shows the optimum integer as a function of and Figure 3 shows the expected cost per patient tested as a function of for both single and double pooling using optimal integer values of and .
In principle we can generalize double pooling to -pooling, whereby each patient participates in independent pools in parallel rounds. The expected cost becomes
Depending on this can yield further improvements but they are probably impractical especially if gets larger. (With triple testing for and about 1000 samples we would need only 128 tests with pools of size 36, and with quadruple testing only 122 tests with pools of size 47.). Even more asymptotically efficient tests can be constructed [MT11, PR11], but it is unclear if they can be practical.
We presented double pooling, a simple, easy-to-implement variation on test pooling, that in certain ranges for , the a priori probability of a positive tests, is significantly more efficient than the standard single pooling approach. Figure 4 shows the percentage of savings of double pooling over single pooling as a function of . We can see that double pooling is particularly advantageous for below corresponding to large scale testing of asymptomatic patients, but is still at least better than single pooling all the way up to .
Our analysis assumes sampling from an infinite distribution, but in practice, double pooling can be implemented after accumulating a fairly small collection of samples. (There is a small efficiency penalty due to the correlation between rounds that we will discuss in the final version.) The main disadvantage of double pooling is that it is more sensitive to dilution-induced false negatives for two reasons: one physical: the pools used are larger; and one mathematical: a true positive sample will be missed if either of its two pools produces a false negative. We will discuss this further in the final version.
We are reaching out to our colleagues in the medical field to find out whether double pooling is practically usable for COVID testing and will update our note with their feedback.
Presently, there is an extraordinary flurry of activity and independent work on group testing for COVID. This includes an analysis of a single pooling method [Gol20] and a proposal based on binary search [Gos20]. It might well be the case that independent researchers have already obtained the same results presented here. We are encouraging members of the community to send us their comments and feedback.
We thank our colleagues Fernando Pereira and Tamás Sarlós for many useful comments.
- [ABKU99] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. SIAM J. Comput., 29(1):180–200, 1999.
- [AJS19] Matthew Aldridge, Oliver Johnson, and Jonathan Scarlett. Group testing: An information theory perspective. Foundations and Trends in Communications and Information Theory, 15(3-4):196–392, 2019.
- [Blo70] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. C. ACM, 13(7):422–426, 1970.
- [BM03] Andrei Z. Broder and Michael Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485–509, 2003.
- [DH93] Ding-Zhu Du and Frank K. Hwang. Combinatorial Group Testing and its Applications. World Scientific, 1993.
- [Dor43] Robert Dorfman. The detection of defective members of large populations. Ann Math Stat., 14:436–440, 1943.
- [Gol20] Christian Gollier. Optimal group testing to exit the COVID confinement. Toulouse School of Economics, 2020. Technical report.
- [Gos20] Olivier Gossner. Group testing against COVID-19. Center for Research in Economics and Statistics, 2020. Working Papers 2020-02.
- [LLZA12] Aiyi Liu, Chunling Liu, Zhiwei Zhang, and Paul S. Albert. Optimality of group testing in the presence of misclassification. Biometrika, 99:245–251, 2012.
- [Mit01] Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Trans. Parallel Distrib. Syst., 12(10):1094–1104, 2001.
- [MT11] Marc Mezard and Cristina Toninelli. Group testing with random pools: Optimal two-stage algorithms. IEEE Transactions on Information Theory, 57(3):1736–1745, 2011.
- [PR11] Ely Porat and Amir Rothschild. Explicit nonadaptive combinatorial group testing schemes. IEEE Transactions on Information Theory, 57:7982–7989, 12 2011.
- [SCea] Erhard Seifried, Sandra Ciesek, and et al. Pool testing of SARS-CoV-2 samples increases test capacity. https://www.medica.de/de/News/Redaktionelle_News/Pool-Testen_von_SARS-CoV-2_Proben_erhoht_Testkapazitat.
- [YAST20] Idan Yelin, Noga Aharony, Einat Shaer-Tamar, Amir Argoetti, Esther Messer, Dina Berenbaum, Einat Shafran, Areen Kuzli, Nagam Gandali, Tamar Hashimshony, Yael Mandel-Gutfreund, Michael Halberthal, Yuval Geffen, Moran Szwarcwort-Cohen, and Roy Kishony. Evaluation of COVID-19 RT-qPCR test in multi-sample pools. medRxiv, 2020.