During the recent coronavirus outbreak in the US, it has been suggested on the Twitter that poor people that are unable to pay for the COVID-19 testing, can instead opt for coughing onto a rich person and wait for them be tested. This has prompted this author to think about how should a set of poor people go about coughing onto a limited number of rich people in order to optimally determine who is infected among them. Somewhat less morbidly, the problem is exact identification of infected people from a pool of people using fewer than tests.
Of course, this is only possible if samples from multiple people are somehow combined. In this note we consider a test in which samples from subjects are combined into a single sample, which tests positive if one more or more constituent subjects are positive. Whether this is viable in practice is beyond the scope of this note, but one would naively expect that it should be possible with testing methodology that relies on detecting trace viral fragments. Since we still need samples from all subjects, this and related techniques makes sense only if testing rather than collecting samples is the resource limiting step. With these caveats, let us proceed to the calculation.
|true infect||Iterations||cost||false positives||cost||theoretical min cost.|
|1077||8||68, 35, 20, 12, 7, 4, 2, 1||13461||0.13||69||6||1950||8700/11727||0.09/0.12||0.08|
|106||12||692, 357, 196, 125, 77, 48, 30, 19, 12, 7, 3, 1||2003||0.020||693||9||301||1305/1712||0.013/0.017||0.011|
Let be the overall rate of infections in a population of size . The information content in who is infected and who is not, is given by
in bits, i.e. it would take in average that many questions with a yes/no answer to uniquely determine who is infected. Since each testing procedure gives one bit of information, it also sets the theoretical lower bound on the required number of tests. The numbers are for and for . The lower the population infection rate, the fewer bits of information are needed to describe it. Of course, existence of this lower bound does not actually guarantee that a better method exist or is practicable.
By a similar token, an information content from a single test is optimally informative, when there is the same probability of getting a positive or negative answer. Let us combine samples fromsubjects and let assume the test if positive is any one of them is positive. The probability of test being negative is and requiring this to be we get
So the main trick is to combine subjects in groups of size , so that a test on such group has about the same probability of being positive or negative, which maximizes the information gain from the test. However, we of course need to repeat tests in order to identify the actually infected subjects. Below we give two example methods.
ii.1 Method 1: Divide and Conquer
The Method 1 is a simple divide and conquer approach. We use the estimate ofto make a first pass over the entire population spliting it into groups. As discussed above, approximately half these groups will test negative and the other half are now ”concetrated” with effective approximately double that of full populationa and hence approximately half. We repeat the process until reduces to unity, at which point we individually test the remaining subjects, yielding the infected sub-population.
ii.2 Method 2: Group coding
Method 2 is somewhat more complicated, but in our simulation tests performs marginally better and has a distinct advantage that it is perfectly parallelizable, at least in the most time-consuming first step, because all groupings are decided in advance.
Again we start by generating groups of size , with a total number of groups give by , where is a parameter that controls the number of false positives as discussed below. These groups are such that each subject appears in groups and no two subjects appear in exactly the same set of groups, i.e. a set of groups uniquely codes a given subject.
We then proceed to test each of these groups. We use these results to assign infected status as follows: each subject is deemed positive if all the outcomes from all of the groups they belong to are positive and negative otherwise.
If the subject is actually positive, then all of their groups will test positive, so they will be marked positive. In fact, for a perfect underlying test, there are no false negatives. On the other hand, if subject is negative, then it will test positive with a probability , because we have arranged so that each group has equal probability of testing positive or negative. We can make large enough so that the number of false positives is manageable. Alternatively, we can retest all the positives, which brings the total number of tests to
We can now optimize for smallest , giving a surprisingly ugly equation
We have coded a toy example of both methods in a Jupyter notebook, which can be found at https://github.com/slosar/infections. We present results in the Table 1. We see that both methods perform reasonably close to theoretical expectations. In particular, for we can get away with or tests compared to the brute force tests. This is somewhat less efficient that the theoretically optimum number of . For , we are again performing worse than the theoretical minimum with or test rather than , but still with a very large efficiency gain compared to the brute-force tests. In particular, if we are willing to live with some false positives (i.e. quarantining a set of unlucky souls) then the number of tests is even lower.
It is likely that methods could be improved further, however at likely diminishing costs.
In this note we have presented two methods for identifying the infected individuals by using slightly more sophisticated methods than brute-force testing every single sample.
Both methods are in statistically perfect with no false positives and no false negatives (assuming second pass to weed out false positives in Method 2), but can amplify underlying errors. For example, if method has a false negative rate of , then the false negative rate of Method 2 will be and similar for Method 1.
Both methods are likely impractical using the current testing methods, because of housekeeping complexity of preparing dividing up and combining samples without making a mess of your laboratory. However, it is conceivable that future testing machines could employ Method 1, by taking samples and directly performing the necessary combinations and divisions internally.
To answer the original question: the poor people should cough onto the insured rich in a way that make the rich person probability of catching infection about 50%. Alternatively, countries should adopt medical systems in which no person would need to cough onto anybody. Otherwise, the history will do it for them [10.2307/j.ctv346rs7.23].