Efficient identification of infected sub-population

by   Anže Slosar, et al.

When testing for infections, the standard method is to test each subject individually. If testing methodology is such that samples from multiple subjects can be efficiently combined and tested at once, yielding a positive results if any one subject in the subgroup is positive, then one can often identify the infected sub-population with a considerably lower number of tests compared to the number of test subjects. We present two such methods that allow an increase in testing efficiency (in terms of total number of test performed) by a factor of ≈ 10 if population infection rate is 10^-2 and a factor of ≈50 when it is 10^-3. Such methods could be useful when testing large fractions of the total population, as will be perhaps required during the current coronavirus pandemic.



There are no comments yet.


page 1

page 2


Boosting test-efficiency by pooled testing strategies for SARS-CoV-2

In the current COVID19 crisis many national healthcare systems are confr...

Group design in group testing for COVID-19 : A French case-study

Group testing is a screening strategy that involves dividing a populatio...

Bayesian adjustment for preferential testing in estimating the COVID-19 infection fatality rate: Theory and methods

A key challenge in estimating the infection fatality rate (IFR) of COVID...

Implementing Stepped Pooled Testing for Rapid COVID-19 Detection

COVID-19, a viral respiratory pandemic, has rapidly spread throughout th...

Improving Biomarker Based HIV Incidence Estimation in the Treatment Era

Estimating HIV-1 incidence using biomarker assays in cross-sectional sur...

Positive results from UK single gene testing for SARS-COV-2 may be inconclusive, negative or detecting past infections

The UK Office for National Statistics (ONS) publish a regular infection ...

Diversity of symptom phenotypes in SARS-CoV-2 community infections observed in multiple large datasets

Understanding variability in clinical symptoms of SARS-CoV-2 community i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

During the recent coronavirus outbreak in the US, it has been suggested on the Twitter that poor people that are unable to pay for the COVID-19 testing, can instead opt for coughing onto a rich person and wait for them be tested. This has prompted this author to think about how should a set of poor people go about coughing onto a limited number of rich people in order to optimally determine who is infected among them. Somewhat less morbidly, the problem is exact identification of infected people from a pool of people using fewer than tests.

Of course, this is only possible if samples from multiple people are somehow combined. In this note we consider a test in which samples from subjects are combined into a single sample, which tests positive if one more or more constituent subjects are positive. Whether this is viable in practice is beyond the scope of this note, but one would naively expect that it should be possible with testing methodology that relies on detecting trace viral fragments. Since we still need samples from all subjects, this and related techniques makes sense only if testing rather than collecting samples is the resource limiting step. With these caveats, let us proceed to the calculation.

true infect Iterations cost false positives cost theoretical min cost.
1077 8 68, 35, 20, 12, 7, 4, 2, 1 13461 0.13 69 6 1950 8700/11727 0.09/0.12 0.08
106 12 692, 357, 196, 125, 77, 48, 30, 19, 12, 7, 3, 1 2003 0.020 693 9 301 1305/1712 0.013/0.017 0.011
Table 1: Results for both methods using a toy example of a perfect test and of either or . in both cases. For Divide and Conquer method we show the number of iterations required to converge, the vales took during these iterations and final cost (number of tests divided by ). For Group coding method we show the values of and used (given by Eqs. 2 and 4, rounded to the nearest integer) and the total number of test used. We give two values of cost: the lower number is without a second pass to weed out false positives. In the last column we give the information theoretical minnimum cost of Eq. 1.

Ii Context

Let be the overall rate of infections in a population of size . The information content in who is infected and who is not, is given by


in bits, i.e. it would take in average that many questions with a yes/no answer to uniquely determine who is infected. Since each testing procedure gives one bit of information, it also sets the theoretical lower bound on the required number of tests. The numbers are for and for . The lower the population infection rate, the fewer bits of information are needed to describe it. Of course, existence of this lower bound does not actually guarantee that a better method exist or is practicable.

By a similar token, an information content from a single test is optimally informative, when there is the same probability of getting a positive or negative answer. Let us combine samples from

subjects and let assume the test if positive is any one of them is positive. The probability of test being negative is and requiring this to be we get


So the main trick is to combine subjects in groups of size , so that a test on such group has about the same probability of being positive or negative, which maximizes the information gain from the test. However, we of course need to repeat tests in order to identify the actually infected subjects. Below we give two example methods.

ii.1 Method 1: Divide and Conquer

The Method 1 is a simple divide and conquer approach. We use the estimate of

to make a first pass over the entire population spliting it into groups. As discussed above, approximately half these groups will test negative and the other half are now ”concetrated” with effective approximately double that of full populationa and hence approximately half. We repeat the process until reduces to unity, at which point we individually test the remaining subjects, yielding the infected sub-population.

ii.2 Method 2: Group coding

Method 2 is somewhat more complicated, but in our simulation tests performs marginally better and has a distinct advantage that it is perfectly parallelizable, at least in the most time-consuming first step, because all groupings are decided in advance.

Again we start by generating groups of size , with a total number of groups give by , where is a parameter that controls the number of false positives as discussed below. These groups are such that each subject appears in groups and no two subjects appear in exactly the same set of groups, i.e. a set of groups uniquely codes a given subject.

We then proceed to test each of these groups. We use these results to assign infected status as follows: each subject is deemed positive if all the outcomes from all of the groups they belong to are positive and negative otherwise.

If the subject is actually positive, then all of their groups will test positive, so they will be marked positive. In fact, for a perfect underlying test, there are no false negatives. On the other hand, if subject is negative, then it will test positive with a probability , because we have arranged so that each group has equal probability of testing positive or negative. We can make large enough so that the number of false positives is manageable. Alternatively, we can retest all the positives, which brings the total number of tests to


We can now optimize for smallest , giving a surprisingly ugly equation


ii.3 Results

We have coded a toy example of both methods in a Jupyter notebook, which can be found at https://github.com/slosar/infections. We present results in the Table 1. We see that both methods perform reasonably close to theoretical expectations. In particular, for we can get away with or tests compared to the brute force tests. This is somewhat less efficient that the theoretically optimum number of . For , we are again performing worse than the theoretical minimum with or test rather than , but still with a very large efficiency gain compared to the brute-force tests. In particular, if we are willing to live with some false positives (i.e. quarantining a set of unlucky souls) then the number of tests is even lower.

It is likely that methods could be improved further, however at likely diminishing costs.

Iii Conclusions

In this note we have presented two methods for identifying the infected individuals by using slightly more sophisticated methods than brute-force testing every single sample.

Both methods are in statistically perfect with no false positives and no false negatives (assuming second pass to weed out false positives in Method 2), but can amplify underlying errors. For example, if method has a false negative rate of , then the false negative rate of Method 2 will be and similar for Method 1.

Both methods are likely impractical using the current testing methods, because of housekeeping complexity of preparing dividing up and combining samples without making a mess of your laboratory. However, it is conceivable that future testing machines could employ Method 1, by taking samples and directly performing the necessary combinations and divisions internally.

To answer the original question: the poor people should cough onto the insured rich in a way that make the rich person probability of catching infection about 50%. Alternatively, countries should adopt medical systems in which no person would need to cough onto anybody. Otherwise, the history will do it for them [10.2307/j.ctv346rs7.23].