Is Group Testing Ready for Prime-time in Disease Identification?

by   Gregory Haber, et al.

Large scale disease screening is a complicated process in which high costs must be balanced against pressing public health needs. When the goal is screening for infectious disease, one approach is group testing in which samples are initially tested in pools and individual samples are retested only if the initial pooled test was positive. Intuitively, if the prevalence of infection is small, this could result in a large reduction of the total number of tests required. Despite this, the use of group testing in medical studies has been limited, largely due to skepticism about the impact of pooling on the accuracy of a given assay. While there is a large body of research addressing the issue of testing errors in group testing studies, it is customary to assume that the misclassification parameters are known from an external population and/or that the values do not change with the group size. Both of these assumptions are highly questionable for many medical practitioners considering group testing in their study design. In this article, we explore how the failure of these assumptions might impact the efficacy of a group testing design and, consequently, whether group testing is currently feasible for medical screening. Specifically, we look at how incorrect assumptions about the sensitivity function at the design stage can lead to poor estimation of a procedure's overall sensitivity and expected number of tests. Furthermore, if a validation study is used to estimate the pooled misclassification parameters of a given assay, we show that the sample sizes required are so large as to be prohibitive in all but the largest screening programs



There are no comments yet.


page 1

page 11


Correlation Improves Group Testing

Population-wide screening to identify and isolate infectious individuals...

Adaptive Group Testing with Mismatched Models

Accurate detection of infected individuals is one of the critical steps ...

More testing or more disease? A counterfactual approach to explaining observed increases in positive tests over time

Observed gonorrhea case rates (number of positive tests per 100,000 indi...

Group design in group testing for COVID-19 : A French case-study

Group testing is a screening strategy that involves dividing a populatio...

A note on optimal design for hierarchical generalized group testing

Choosing an optimal strategy for hierarchical group testing is an import...

Neural Group Testing to Accelerate Deep Learning

Recent advances in deep learning have made the use of large, deep neural...

how well can sensitivity and sojourn time be estimated

Chronic disease progression models are governed by three main parameters...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developing design strategies to reduce study expense is an important job for the practicing biostatistician. In many settings, measuring biomarkers can be expensive, and design strategies for reducing these costs are needed. In 1943, Dorfman1 proposed a simple method to make the testing of syphilis feasible in recruits for the U.S. Army. This simple design suggested testing a pooled collection of k samples, and only testing individual samples if the combined sample is positive. Intuitively, this design could provide a tremendous cost reduction in terms of the required number of tests if the disease prevalence is small.

There has been a vast amount of research in group testing since the original Dorfman paper. A majority of this work can be divided into either using group testing for disease screening or for prevalence estimation. Particularly for the application of group testing to screening, there has been lots of work not only in the statistics literature, but also in computer science and applied mathematics.(see, 2, 3, 4, 5, 6 among others) This is because optimality of group testing algorithms often involves choosing a particular set of group sizes that minimizes the expected number of tests required to identify all cases in a population of individuals. Deriving optimal designs often involve the development of complex algorithms that rely on dynamic programming, mathematical techniques that are usually applied in areas of applied mathematics and computer science, and less often in statistics. There has also been extensive work in prevalence estimation, where participants are tested in pools with no re-testing at the individual level. This article will focus on the use of group testing in disease screening.

Although methodological research in this area has expanded, we believe that there has been limited use of these designs as they were originally formulated in the biomedical sciences for disease screening. The syphilis example in World War II is a notable example. We have seen a reluctance by our epidemiologic collaborators to use these designs in large scale studies. In part this is due to our laboratory and epidemiology colleagues not being aware of the advantages of the group testing methodology. However, most often, scientists are afraid that the combining of tests across samples on different participants decreases the sensitivity of the test, thereby increasing the likelihood of a false negative on a grouped test. Furthermore, it is perceived that in much of the group testing literature unreasonable assumptions have been made that, in many cases, favor the use of group testing procedures over single testing. The goal of this current article is to provide a balanced view of the research in this area and to provide suggestions for evaluating its feasibility in practical settings.

There are a number of issues that have caused confusion and have made comparisons with individual testing difficult. These include:

  • Questions over how to choose a design that appropriately accounts for misclassification.

  • Assumptions of non-differential misclassification, or that the testing errors do not change with the group size.

  • Assumptions that sensitivity and specificity values are known a priori from external sources and can be readily applied to the question at hand.

A careful comparison of group testing with individual testing that takes into account these issues is important in deciding the situations where group testing should be used for disease identification in the biosciences. In what follows, we introduce several case studies that are representative of the types of problems in which group testing is appealing to researchers, but current limitations raise questions about or prevent its use. This is followed by a full discussion of each of the above issues. We then present numerical comparisons to examine the impact of incorrect assumptions regarding the misclassification parameters on a screening procedure and the feasibility of using a validation study to estimate these values.

2 Case studies

2.1 Population based screening for COVID-19 infection

Coronovirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in Wuhan, China in late 2019 and is rapidly spreading worldwide with dramatic impacts on the healthcare and economic landscapes. In the United States, shortages of testing reageants hinder the ability to carry out sufficient screening for infection with SARS-CoV-2 which may ultimately threaten the ability of public health officials to adequately control the spread of the virus. The need for large scale screening, coupled with scarce testing resources, make this an ideal scenario for implementing group testing to reduce the number of tests required to carry out a screening program.

2.2 Large scale screening for HIV viral load

Monitoring of viral load in individuals diagnosed with HIV is important for determining treatment failure and making informed treatment decisions. Current World Health Organization guidelines recommend viral load testing at 6 and 12 months following initiation of antiretroviral therapy (ART), and annually thereafter.7 These recommendations have proven to be cost prohibitive, and continue to be impractical in many low resource settings. For example, in Malawi, a country with over 1 million people living with HIV and over 600,000 taking ART,8 the annual burden of sufficient viral load monitoring is enormous. In this context, group testing has regularly been considered as a cost saving measure, however concerns over false negative rates are common.9, 10, 11

2.3 Stratified cancer screening

Cervical cancer is currently the 4th most common cancer among women,12 highlighting the need for cheap and effective clinical screening. The most effective indicator of cervical cancer risk is HPV infection, however only a small number of HPV positive women will go on to develop cervical cancer. To minimize unnecessary and invasive follow-up procedures such as colposcopy, it is necessary to develop better tests for triaging HPV infections in order to identify those at greater risk of progressing to cancer. One promising method is methylation testing which captures the methylation of HPV DNA transitioning to precancer.13, 12, 14 Unfortunately, such tests are too expensive to routinely carry out for all HPV positive women. Group testing could offer one way of making such testing feasible.

2.4 Biomarker presence in cohort study

In many cohort studies specimens collected from individual participants are to be screened for a variety of biomarkers. For example, the Connect study is a cohort study funded by the National Cancer Institute planning to enroll 200,000 adults in the United States with the goal of understanding the etiology of cancer through longitudinal assessment of biomarkers, environmental exposure, and the occurrence of cancer precursors. One biomarker of interest in this study is monoclonal gammopathy of undetermined significance (MGUS), a premalignant plasma cell disorder present in about 3% of adults.15 MGUS is a precursor for multiple myeloma and other blood cancers. Screening all cohort participants for MGUS would add significant costs to the Connect study, a particular concern since it is one of many biomarkers of interest. In theory, this is an ideal case for group testing since the low prevalence of MGUS would result in a large reduction in the number of tests to screen the entire cohort.16 In practice, however, the sensitivity of pooling procedures when screening for MGUS is unknown, and a validation study would be required to assess the feasibility of group testing in this case. The costs of this validation study would have to be considered in light of the, initially unknown, potential savings from a pooling design.

3 Notation and Dorfman procedure

For a screening program, we assume a population of

individuals is represented by random variables

. We will assume that each member of this population has an identical probability,

, of having some characteristic (e.g., an infectious disease) so that . Let be a random variable representing the observed test outcome (allowing for testing error) for determining the presence of the given characteristic in any member of a pool of size and the true status of the same pool. Then, . For a given assay we define the sensitivity and specificity for a pooled test containing individuals to be and , respectively. Using these definitions, we have .

The first published group testing procedure proposed by Dorfman1 is a simple two stage procedure. To implement it, a group size is chosen and the population is divided into pools of that size. The choice of is typically made to minimize the expected number of tests given by

For each pool, an initial test is carried out to determine if any of the samples within are positive for the disease. If negative, each sample is assumed to be disease free. If positive, the pool is broken up into individual samples which are tested to identify those with the disease. Much of the more recent group testing literature has focused on more efficient designs. Typically, such designs are optimized with respect the expected number of tests and do not account for the overall sensitivity and specificity of a test. As such, for the examples in this paper we will focus on the Dorfman design for which, in addition to its simplicity, it is easiest to control the overall procedure sensitivity. For example, if a single unit test is treated as a gold standard (e.g., no misclassification) the overall sensitivity of the Dorfman procedure with initial group size will be . With non-differential misclassification, it will not be possible for any pooling design which tests all individuals in pools to improve on this value.

4 Testing with misclassification and comparison with single testing

Since nearly all of the issues impeding the use of group testing in medical settings involve questions of misclassification, we briefly review the literature related to this issue here. Beginning in the 1970s with the reinsurance of research in group testing, methodology for accounting for misclassification was proposed. The idea is that even if a test on a single sample has little or no misclassification, it is natural to think that there may be measurement error induced by the combining of samples across individuals. Graff and Roeloffs (1972)17 and Hwang (1976a)18 recognized early that the objective function to minimize should not be the expected number of tests; when tests can be misclassified. Graff and Roeloffs (1972)17 proposed a modification of the Dorfman procedure and searched for a design that minimizes total cost as a linear function of the expected number of tests, weighted expected number of good items misclassified as defective, and weighted expected number of defective items misclassified as good. Burns et al. (1987)19 generalized Graff and Roeloffs (1972) results to the situation where the probability of misclassification depends on the proportion of defective items in the group. Hwang (1976a) 18 studied a group testing model with the presence of a dilution effect, where a group containing a few defective items may be misidentified as a group containing no such items, especially when the size of the group is large. He calculated the expected cost under the Dorfman procedure in the presence of the dilution effect and derived the optimal group sizes to minimize this cost. Further, Wein and Zenios (1996)20 embedded group testing model for continuous test outcomes into a dynamic programming algorithm that derives a group testing policy to minimize the linear combination of expected cost due to false negatives, false positives, and testing. Malinovsky et al. (2016)21 characterized the optimal design in the Dorfman procedure in the presence of non-differential misclassification by maximizing the ratio between the expected number of correct classifications and the expected number of tests. Using the same criterion and testing procedure, they also characterized a cut-off point of disease prevalence where all individuals should be tested together at the first stage. Aprahamian et al. (2019)22 considered the Dorfman procedure in the population with heterogeneous prevalences23, 24 under the setting of non-differential misclassification. They investigated two models: in the first one a linear combination of the expected number of false-positives, false-negatives and total number of tests was minimized; in the second one a linear combination of the expected number of false-positives and false-negatives was minimized, subject to constrains on the upper bound of the expected total number of tests.

However, recent authors have argued that the expected number of tests should be used and that careful accounting for the number of correct classifications is an unnecessary complication 25. This later view is in direct conflict with earlier work on this subject. Further, the basis for Hitt et al.’s25 argument that misclassification need not be considered in optimal design is based on a comparison of the expected number of tests versus the ratio of the expected number of tests and the expected number of correct classifications. However, this clearly is not generally true. For example, if misclassification is not considered, the optimal design based on the expected number of tests would be to test all samples in one group, and call all individuals in the group as positive for the disease if the group test is positive, and negative otherwise. This would only require one test, but would clearly result in a high misclassification rate!

Many recent papers assume non-differential misclassification. Specifically, they assume that misclassification does not depend on the size of the group and that there is misclassification for a single test. Although the assumption of non-differential misclassification may be reasonable for some types of sample pooling, it cannot be generally assumed. Further, misclassification needs to be defined relative to a gold standard. A natural comparison of group testing screening designs is with a design where individuals are tested separately. In many cases, it is reasonable to assume that the assay tested on a single sample is the gold standard. In this case, misclassification for group testing will be relative to single testing with the sensitivity and specificity of the individual test being 1, as it also was assumed earlier by Hwang (1976a)18.

5 What is the optimal design?

As can be seen from the previous section, many works have appropriately attempted to account for misclassification for the optimal group testing design for disease identification. Minimizing the expected number of tests has been used as an objective function, which we also believe is problematic. In most of the previous works the authors have proposed minimizing a linear combination of the expected number of tests and the rate of correct classification.17, 19, 22 However, choosing the coefficients for these terms is subjective and difficult to motivate from a medical or public health perspective. Some authors include a cost for incorrect classification18, 20 which is also difficult to motivate from a medical or public health perspective. Although, the criterion proposed by21 does not require such specifications, it assigns the same weight for the expected number of tests and the expected number of correct classifications, and therefore can also be subjective. In this work we propose a simple easy to interpret objective function which is easy for the scientist to interpret and understand. We propose a different way to incorporate misclassification in the examples below.

6 Importance of correctly specifying Se(k) and Sp(k)

To choose an optimal design and understand its properties, it is essential that researchers first have knowledge of the diagnostic accuracy of the test for pooled samples. In fact, many claims by authors that group testing is more efficient than individual testing for disease screening are based on calculations that assume that the sensitivity and specificity do not depend on group size and that these quantities are known and therefore have no estimation error. There are a number of issues of concern here. First, the estimates of sensitivity and specificity are based on studies conducted in other populations. The problem of applying the sensitivity and specificity of a test in one population when it was estimated in a different population with a different mix of patients has been well recognized in the area of diagnostic medicine.26 Second, the uncertainty in the estimation of sensitivity and specificity is not taken into account in most comparisons.

Misspecification of the sensitivity and specificity can impact the testing procedure in two primary ways. First, even small differences can lead to changes in the optimal choice of design. This is particularly true if the sensitivity changes with the group size, since the expected number of tests typically decreases with the sensitivity. This will result in a poor understanding of the expected number of tests for a given design and may lead to poor decisions regarding the application of a group testing procedure in a given population.

Second, misspecification of the misclassification parameters can lead to choosing a design with very high error rates. For example, if the sensitivity of a test is decreasing with pool size then overestimating the sensitivity by even a small degree can lead to choosing a large pool size for which the assumed overall error rate estimate is overly optimistic.

These issues are illustrated in the following example.

6.1 Example

In this section, we explore numerically how misspecification of the sensitivity function, , when choosing an optimal design can lead to errors in the estimation of a procedures overall sensitivity as well as the expected number of tests. We will assume in all cases that . To illustrate different potential ways the sensitivity might be subject to differential misclassification, we consider the dilution function of Hwang (1972), , for various values of . Note that, when , corresponding to no false negatives. For , gives the probability of a single unit being positive given there is at least one positive in the group. We assume that the true assay sensitivity can be represented by taking , and will look at designs constructed assuming values of , and . Plots of for and for each value of are shown in Figure 1.

Figure 1: Plot of Hwang function, for for various values of and .

To find the optimal group size, , we consider two approaches. The first, for a given and function , solves the unconstrained optimization problem

where is the set of all possible group sizes and

The second approach enforces a lower bound on the overall sensitivity of a procedure by solving the constrained optimization problem


where is a fixed threshold value and, again, and are assumed known.

For our numerical comparisons, we set and . Results are shown in Table 1. The table contains estimates for three basic quantities:

  • , the optimal group size chosen for a given optimization procedure;

  • , the sensitivity function evaluated at the optimal group size;

  • , the expected number of tests for a given procedure based on the optimal group size.

For each quantity, an estimated value is indicated by a hat, (e.g., is the expected number of tests based on the estimated optimal group size and is the estimated value of this same estimand). By estimated sensitivity, we refer to the value to which the sensitivity would converge in a large sample validation study.

Unconstrained Constrained
0.01 0 11 12 1 0.836 0.831 0.196 0.178 0.178 11 1 1 0.836 1 0.196 0.178 1
0.01 11 12 0.976 0.836 0.831 0.193 0.178 0.178 11 1 0.976 0.836 1 0.193 0.178 1
0.05 12 12 0.884 0.831 0.831 0.184 0.178 0.178 2 1 0.966 0.95 1 0.519 0.519 1
0.1 13 12 0.775 0.826 0.831 0.172 0.178 0.178 1 1 1 1 1 1 1 1
0.3 21 12 0.404 0.797 0.831 0.125 0.199 0.178 1 1 1 1 1 1 1 1
0.05 0 5 6 1 0.889 0.877 0.426 0.401 0.399 5 2 1 0.889 0.951 0.426 0.401 0.593
0.01 5 6 0.984 0.889 0.877 0.423 0.401 0.399 5 2 0.984 0.889 0.951 0.423 0.401 0.593
0.05 5 6 0.925 0.889 0.877 0.409 0.401 0.399 2 2 0.967 0.951 0.951 0.594 0.593 0.593
0.1 6 6 0.84 0.877 0.877 0.389 0.399 0.399 1 2 1 1 0.951 1 1 0.593
0.3 10 6 0.514 0.845 0.877 0.306 0.439 0.399 1 2 1 1 0.951 1 1 0.593
0.1 0 4 4 1 0.906 0.906 0.594 0.562 0.562 4 2 1 0.906 0.952 0.594 0.562 0.681
0.01 4 4 0.987 0.906 0.906 0.589 0.562 0.562 4 2 0.987 0.906 0.952 0.589 0.562 0.681
0.05 4 4 0.937 0.906 0.906 0.572 0.562 0.562 2 2 0.968 0.952 0.952 0.684 0.681 0.681
0.1 4 4 0.877 0.906 0.906 0.552 0.562 0.562 1 2 1 1 0.952 1 1 0.681
0.3 25 4 0.414 0.797 0.906 0.424 0.779 0.562 1 2 1 1 0.952 1 1 0.681
Table 1: Estimated design parameters when the true sensitivity function is and is chosen with the assumption that or . Results are given for designs chosen based on solving the unconstrained and constrained optimization problems.

From the results, we see that the unconstrainted optimization, which considers only the expected number of tests, often yields very poor estimated overall sensitivity. This highlights the fact that such an optimization procedure can do nothing to control the overall sensitivity rates and should be used cautiously, particularly with differential misclassification. While the overall sensitivity values are always much higher when using the constrained procedure, when the assumed sensitivity function overestimates the true values the chosen group size can yield an overall sensitivity value much smaller than is assumed to be true. The large differences observed between true and estimated sensitivity tend to decrease sharply as increases. This is due to the fact that for larger the expected number of tests decreases rapidly with the group size, regardless of the sensitivity values. Differences between the estimated and true expected number of tests, however, follows the opposite pattern, with the differences increasing with . Here, overestimation and underestimation of the sensitivity values leads to overestimation and underestimation, respectively, of the expected number of tests.

7 Optimal design incorporating estimation error in and

In most cases, for group testing to be applicable it is necessary to first obtain population specific estimates of the sensitivity and specificity. To do this will require a validation study design in which individuals with known disease status (perhaps from initial individual screening) are tested in pools of varying group sizes. To date, there exists no work in the literature related to the question of how to best design such studies. However, in practice it is important to consider how large such validation studies would need to be before deciding if group testing is a reasonable approach. Another important question is, given that a validation study of a certain size is to be carried out, how large of a target population for screening is required to see an overall benefit from utilizing group testing. Answers to such questions will vary greatly depending on the underlying population and particular assay being used, but it is reasonable to assume that such considerations will show group testing is not warranted in many situations when such an approach might otherwise be desirable.

7.1 Example

In this section we explore by simulation the size of a validation study required to ensure that the estimates of are sufficiently accurate. For the constrained optimization problem described above in (1), accuracy is here defined as achieving where is some threshold value.

To estimate and , we used the validation design described in Algorithm 1 for an initial sample of size and a maximum group size . Once the misclassification parameters were estimated, they were used to find from the constrained optimization procedure described above in (1). To determine the necessary validation size, , we took 50,000 simulations and found the smallest such that the empirical probability . As above, we assumed for all . For the sensitivity functions, we considered several possibilities:

Simulations were carried out for .

Once the smallest was determined, we found the smallest total population size, , required to see a benefit from group testing following such a validation procedure. This value was determined by solving the inequality

where is the total number of assay tests required in the validation study. The expected value in this expression was taken as the average expected value across all simulations for the given validation sample size. Results are shown in Figure 2.

for  do
     if  is an integer then Randomly group units into pools of size
     else  Randomly form pools of size and construct a final pool with the remaining units and duplicate units randomly chosen from the other groups 
     end if
     if  then Assuming estimates from the stage are correct, estimate and
     end if
end for
Algorithm 1 Procedure for validation study.
Figure 2: Barplots showing required validation study size, , total number of assay tests in validation study, , and minimum population size to see a benefit from group testing, for various underlying true sensitivity functions. The bottom axes are values of .

Unsurprisingly, for all sensitivity functions we see the required validation sample sizes decrease with increasing prevalence. This is expected as smaller numbers of individuals are required to ensure an adequate number of pools with at least one positive member. For the sensitivity functions on the top row, the decrease in sensitivity is more gradual so that a larger group size can be chosen. For these cases, larger validation sample sizes are required to accurately estimate the sensitivity function. However, since the larger pool sizes will allow for a smaller expected number of tests, the additional sample size required to see a benefit from group testing is small.

8 Revisiting the case studies

8.1 Population based screening for COVID-19 infection

Large scale screening for SARS-CoV-2 is an important and pressing public health issue. Implementation of group testing to facilitate such screening currently faces several obstacles which must be considered before beginning such a program. First, prevalence values in a given region are unknown and constantly changing. This forces any design choices to be made somewhat ad hoc. This is particularly an issue as testing protocols and indications are currently in flux across state and national health departments, so that the underlying screening population characteristics can change at any time. For example, if the positive rate of tests in an area is approximately 5%, a procedure testing pools of size 5 could offer significant advantages. If, however, at a later time the prevalence of the screening population increases to 29%, or greater, due to testing only individuals at higher risk such a procedure would require more tests than individual screening. The inability of testing facilities to anticipate such swings could lead to very expensive mistakes.

A second issue is that it is not known a-priori how test sensitivity changes with pool sizes. This is particularly true as such values may vary across populations and labs given the wide range of testing techniques currently being implemented. While a validation study would be feasible for such a use case, it is unlikely that public health officials would be willing to reallocate sparse testing materials for large speculative studies at this time. In a public health crisis like COVID-19, we recommend that samples be stored so that, at a minimum, the feasibility of group testing can be evaluated at a later time.

If despite these concerns a group testing program was to be implemented, a basic prevalence estimate could be obtained using recent individual testing data. To carry out a validation procedure as described above we show the estimated validation sample sizes, total number of tests, and population size required to see a benefit from group testing for prevalence values and in Table 2. Values are reported based on two assumed underlying sensitivity functions: 1) a linear function, and 2) the Hwang function with , .

Se(k) = 1 - 0.02(k-1)
Prevalence N N
0.05 31,679 92,789 147,238 13,710 40,158 1,307,444
0.1 17,500 51,259 100,771 8,906 26,090 1,077,732
0.25 1,640 4,806 31,494 10,781 31,582 4,754,298
Table 2: Estimated validation sample size, , number of validation tests, , and necessary population size to see a benefit from group testing, , for a screening program based on prevalence values of , and . Values are calculated separately for two underlying sensitivity functions: Se(k) = 1 - 0.02(k-1) and .

8.2 Large scale screening for HIV viral load

The specifics of designing a screening program for monitoring HIV viral load will vary as different regions employee differing testing protocols and thresholds. As an example, we consider a case with suspected ART failure prevalence around 9%, a value reported among those using ART for at least 18 months for a Malawian cohort in Nicholas et al. (2019).27 While studies evaluating the pooled sensitivity for fixed group sizes have been done10, 11, such values are likely cohort specific and would need to be re-estimated before application to a specific population. Furthermore, in order to make informed decisions about an optimal group size it would be necessary to first understand how the sensitivity changes with the pool size. Using the procedure outlined above, we can look at the sample sizes required under different assumed sensitivity functions to evaluate how feasible group testing would be in this case. For example, if we assumed the linear sensitivity function for a prevalence of we would require 18,750 individuals to enroll in a validation study requiring 54,920 total tests and a population size to 103,097 to see a benefit from group testing. If, however, the sensitivity function was the Hwang function with , we would require 9,375 individuals to enroll in a validation study requiring 27,462 tests and a population size of 1,094,884. In either case, for a population of 600,000 screened semi-annually there is a clear potential for savings from group testing, even after carrying out a validation procedure. Without any prior knowledge of the sensitivity function, it would be difficult to choose an initial validation sample size as it is impossible to give conservative bounds. Still, if the resources are available for an initial large investment for a validation study, and health officials are able to deal with the possibility that the pooled sensitivity will be too low for practical use, the long term and ongoing nature of HIV viral load screening can potentially benefit largely from group testing.

8.3 Stratified cancer screening

For HPV methylation screening, we consider a program aimed at screening the entire US population for HPV related cervical cancer risk. This could be achieved by collecting samples from all women, identifying those with high risk HPV subtypes (i.e., those that act as cancer precursors), and finally administering methylation testing for the high risk HPV group. Those with positive methylation tests would then be followed more intensely to ascertain cervical cancer risk. Using 2010 population estimates and estimated prevalences from 201428, we could approximate that around 20 million women in the US would test positive for a high risk HPV subtype and we would like to design a group testing procedure to screen each of these women using methylation testing. To date, there are no population based estimates of methylation positive testing rates so we will assume a value of 5% for this example. Using these values and the validation procedure outlined above, if the underlying sensitivity function were the linear function then we would require a validation sample size of 31,679 and a total of 92.789 tests with a required population size to see a benefit from group testing of 147,238. If, however, the true sensitivity function were the Hwang function with , we would require 13,710 women for a validation procedure requiring 40,158 tests and a total population size of 1,307,444 to see a benefit from group testing. In either case, given the large population required for screening, group testing would likely provide large savings in this setting, even with a necessarily large validation study. This would be true even if the actual rate of positive methylation tests in the high risk HPV infected population were much higher. Here, the only real impediment to using group testing would be if health officials were unwilling to accept any additional loss of sensitivity due to pooling.

8.4 Biomarker presence in cohort study

For MGUS screening, we assume a prevalence of 3% and that we would like to determine the status of approximately 200,000 individuals. If the true sensitivity function were the linear function then we would require a sample size of 51,250 individuals for a validation procedure requiring 150,113 tests and a population size of 221,091 to see a benefit from group testing. If, however, the true sensitivity function were the Hwang function with , we would require 21,093 individuals for a validation procedure requiring 61,785 tests and a population size of 1,832,663 to see a benefit of group testing. Given these numbers, and lacking any a priori information on the sensitivity function, it is unlikely that researchers would attempt to implement such a validation procedure in this case. While the large sample sizes are offset somewhat by the need for repeat testing, the non-trivial possibility of finding that pooling of any size reduces the sensitivity to an unacceptable level make this an unlikely gamble for resource allocation.

9 Discussion

In this paper we have reviewed several of the issues faced by practitioners when deciding if group testing can provide a feasible solution for their screening program. In this context we have explored several issues numerically based a simple algorithm (the Dorfman two-stage procedure) and several simplifying assumptions. In practice, there exist many additional considerations which may alter the final decision concerning whether to implement group testing.

For all numerical comparisons, we have assumed grouping does not impact specificity (e.g., for all ). While this may be reasonable in many settings, the failure of this assumption can result in large increases in the number of individuals required for a validation sample. In particular, by using a minimum threshold to determine estimation accuracy we have had to assume that is monotone as a function of the validation sample size. While this holds for , this may not be true otherwise, requiring more complicated evaluation criteria and larger sample sizes.

When designing our validation procedure, we made the assumption that the sensitivity does not depend on the number of positives in a given pool (i.e., we have assumed that the sensitivity is only a function of the maximum of all pool members). In practice, this assumption may fail resulting in significantly more complicated sensitivity functions (which must now be a function of both the group size and the number of positives in the pool). This could especially be an issue when the test classification is a function of underlying continuous test output. If such issues could reasonably be suspected, it would be necessary to design the validation study which accounts for this issue.

One assumption we have made is that there is a complete lack of a-priori information on the underlying sensitivity function, necessitating the validation design to be non-parametric. However, in cases where researchers are able/willing to make certain simplifying assumptions (e.g., that sensitivity is linear in ) more efficient validation designs may be possible. In such cases, smaller validation studies could potentially make group testing feasible in a wider range of settings. However, given that the properties of the final design are sensitive to the correct specification of the dilution function, we generally make the recommendation of a non-parametric approach when designing important screening programs using group testing.

We have emphasized the importance of estimating the sensitivity and specificity for different size groups in the same population that we intend to screen. The validation study design assumes that sample is collected from a random sample of individuals from the population at hand and groups of varying size be randomly formed from these samples. There are different alternative designs for the validation sample that may lead to efficiency gains in some situations. For example, if we assume that the specificity is 1 for all group sizes (here, we assumed it was necessary to estimate these specificities in order to confirm this in our calculations), we may save resources by never grouping all negative samples together. Alternatively, rather than attempt to find the optimal design, we could simply evaluate the properties of a group testing design for a single fixed group size. If the false negative rate is too high, we could sequentially evaluate the properties for a smaller group size. This approach may be advantageous for the COVID-19 example, where it is more important to obtain a good design quickly than to spend more time to find the optimal design (i.e., the perfect is the enemy of the good).

An additional assumption we have made is that the underlying population is homogeneous with respect to the primary trait of interest. In many cases, this is reasonable as long as the validation sample is chosen representatively across the entire population and the subsequent samples are not grouped based on underlying heterogeneous clusters. The impact of heterogeneity will include additional challenges to determine the size of the validation sample and to ensure a feasible solution to the optimization problem (1). The issue is that even under the perfect test setting, we need to determine not only group sizes but also the members of the groups and number of such possibilities (number of the partition of the population) is astronomical even for the small population size. In fact, under error-free testing, the optimal partition is known only for the Dorfman procedure 23, 24. From a the practical perspective, Hwang’s method can be used for Dorfman’s procedure, and the methods developed in 29, 30 can be used for other group testing procedures. Another possibility, which also may be logistically easier to implement, is a stratification of the population, such that in each stratum, there is a homogeneous population. In such a case, the methodology developed in the present work can be used with respect to each stratum separately.

In this paper, we have focused exclusively on the Dorfman design. In the case of a homogeneous population, there are more efficient designs than Dorfman’s two-stage procedure.31, 32, 33 In many of these designs, the expected number of tests is not given in closed-form, but rather calculated using recursion or dynamic programming.29 In the presence of differential misclassification or dilution effects, expressions for the expected number of tests (an important component in the objective function to evaluate) are difficult to obtain in these cases.

Our work focused on the screening of a single disease. However, occasionally screening for multiple diseases from a single assay may be of interest. Group testing for disease screening for multiple diseases with test misclassification is an area for future research. With respect to feasibility, the subject of the current paper, we want to emphasize that any design would need a validation sample sized to be sufficient to estimate the more complex misclassification structure that would be required for such designs.


  • 1 Dorfman R. The Detection of Defective Members of Large Populations. The Annals of Mathematical Statistics 1943; 14(4): 436–440. doi: 10.1214/aoms/1177731363
  • 2 Aldridge M, Johnson O, Scarlett J. Group Testing: An Information Theory Perspective. Foundations and Trends in Communications and Information Theory 2019; 15(3-4):196–392.
  • 3 Bar-Lev SK, Boxma O, Stadje W, Van der Duyn Schouten FA. A Two-Stage Group Testing Model for Infections with Window Periods. Methodology and Computing in Applied Probability 2010; 12(3): 309–322. doi: 10.1007/s11009-008-9104-4
  • 4 Zhigljavsky A. Probabilistic Existence Theorems in Group Testing. Journal of Statistical Planning and Inference 2003; 115(1): 1–43. doi: 10.1016/S0378-3758(02)00148-9
  • 5 Macula AJ. Probabilistic Nonadaptive Group Testing in the Presence of Errors and DNA Library Screening. Annals of Combinatorics 1999; 3(1): 61–69. doi: 10.1007/BF01609876
  • 6 McMahan CS, Tebbs JM, Bilder CR. Informative Dorfman Screening. Biometrics 2012; 68(1): 287–296. _eprint: 10.1111/j.1541-0420.2011.01644.x
  • 7 World Health Organization . Consolidated Guidelines on the Use of Antiretroviral Drugs for Treating and Preventing HIV Infection: Recommendations for a Public Health Approach. . 2016. OCLC: 953973054.
  • 8 Kanyerere H, Girma B, Mpunga J, et al. Scale-up of ART in Malawi Has Reduced Case Notification Rates in HIV-Positive and HIV-Negative Tuberculosis. Public Health Action 2016; 6(4): 247–251. doi: 10.5588/pha.16.0053
  • 9 Rowley CF. Developments in CD4 and Viral Load Monitoring in Resource-Limited Settings. Clinical Infectious Diseases 2014; 58(3): 407–412. doi: 10.1093/cid/cit733
  • 10 El Bouzidi K, Grant P, Edwards S, et al. Pooled Specimens for HIV RNA Monitoring: Cheaper, but Is It Reliable?. Clinical Infectious Diseases 2014; 59(9): 1346–1347. doi: 10.1093/cid/ciu562
  • 11 Tilghman MW, Guerena DD, Licea A, et al. Pooled Nucleic Acid Testing to Detect Antiretroviral Treatment Failure in Mexico. JAIDS Journal of Acquired Immune Deficiency Syndromes 2011; 56(3): e70. doi: 10.1097/QAI.0b013e3181ff63d7
  • 12 Hernández-López R, Lorincz AT, Torres-Ibarra L, et al. Methylation Estimates the Risk of Precancer in HPV-Infected Women with Discrepant Results between Cytology and HPV16/18 Genotyping. Clinical Epigenetics 2019; 11(1): 140. doi: 10.1186/s13148-019-0743-9
  • 13 Clarke MA, Gradissimo A, Schiffman M, et al. Human Papillomavirus DNA Methylation as a Biomarker for Cervical Precancer: Consistency across 12 Genotypes and Potential Impact on Management of HPV-Positive Women. Clinical Cancer Research 2018; 24(9): 2194–2202. doi: 10.1158/1078-0432.CCR-17-3251
  • 14 Kelly H, Benavente Y, Pavon MA, Sanjose SD, Mayaud P, Lorincz AT. Performance of DNA Methylation Assays for Detection of High-Grade Cervical Intraepithelial Neoplasia (CIN2+): A Systematic Review and Meta-Analysis. British Journal of Cancer 2019; 121(11): 954–965. doi: 10.1038/s41416-019-0593-4
  • 15 Landgren O, Hofmann JN, McShane CM, et al. Association of Immune Marker Changes With Progression of Monoclonal Gammopathy of Undetermined Significance to Multiple Myeloma. JAMA Oncology 2019; 5(9): 1293–1301. doi: 10.1001/jamaoncol.2019.1568
  • 16 Kyle RA, Therneau TM, Rajkumar SV, et al. Prevalence of Monoclonal Gammopathy of Undetermined Significance. New England Journal of Medicine 2006; 354(13): 1362–1369. _eprint: 10.1056/NEJMoa054494
  • 17 Graff LE, Roeloffs R. Group Testing in the Presence of Test Error; An Extension of the Dorfman Procedure. Technometrics 1972; 14(1): 113–122. _eprint: 10.1080/00401706.1972.10488888
  • 18 Hwang FK. Group Testing with a Dilution Effect. Biometrika 1976a; 63(3): 671–680. doi: 10.1093/biomet/63.3.671
  • 19 Burns KC, Mauro CA. Group Testing with Test Error as a Function of Concentration. Communications in Statistics - Theory and Methods 1987; 16(10): 2821–2837. _eprint: 10.1080/03610928708829544
  • 20 Wein LM, Zenios SA. Pooled Testing for HIV Screening: Capturing the Dilution Effect. Operations Research 1996; 44(4): 543–569. doi: 10.1287/opre.44.4.543
  • 21 Malinovsky Y, Albert PS, Roy A. Reader Reaction: A Note on the Evaluation of Group Testing Algorithms in the Presence of Misclassification. Biometrics 2016; 72(1): 299–302. _eprint: 10.1111/biom.12385
  • 22 Aprahamian H, Bish DR, Bish EK. Optimal Risk-Based Group Testing. Management Science 2019; 65(9): 4365–4384. doi: 10.1287/mnsc.2018.3138
  • 23 Hwang FK. A Generalized Binomial Group Testing Problem. Journal of the American Statistical Association 1975; 70(352): 923–926. _eprint: 10.1080/01621459.1975.10480324
  • 24 Hwang FK. Optimal Partitions. Journal of Optimization Theory and Applications 1981; 34(1): 1–10. doi: 10.1007/BF00933355
  • 25 Hitt BD, Bilder CR, Tebbs JM, McMahan CS. The Objective Function Controversy for Group Testing: Much Ado about Nothing?. Statistics in Medicine 2019; 38(24): 4912–4923. _eprint: 10.1002/sim.8341
  • 26 Ransohoff D, Feinstein A. Problems of Spectrum and Bias in Evaluatin the Efficacy of Diagnostic Tests. The New England Journal of Medicine 1978; 299: 926–930.
  • 27 Nicholas S, Poulet E, Wolters L, et al. Point-of-care Viral Load Monitoring: Outcomes from a Decentralized HIV Programme in Malawi. Journal of the International AIDS Society 2019; 22(8). doi: 10.1002/jia2.25387
  • 28 McQuillan G, Unger ER. Prevalence of HPV in Adults Aged 18–69: United States, 2011–2014. 2017(280): 8.
  • 29 Malinovsky Y. Sterrett Procedure for the Generalized Group Testing Problem. Methodology and Computing in Applied Probability 2019; 21(3): 829–840. doi: 10.1007/s11009-017-9601-4
  • 30 Malinovsky Y, Haber G, Albert PS. An Optimal Design for Hierarchical Generalized Group Testing. Applied Statistics 2020; In press.
  • 31 Sterrett A. On the Detection of Defective Members of Large Populations. The Annals of Mathematical Statistics 1957; 28(4): 1033–1036.
  • 32 Sobel M, Groll PA. Group Testing To Eliminate Efficiently All Defectives in a Binomial Sample. Bell System Technical Journal 1959; 38(5): 1179–1252. _eprint: 10.1002/j.1538-7305.1959.tb03914.x
  • 33 Hwang FK. An Optimum Nested Procedure in Binomial Group Testing. Biometrics 1976b; 32(4): 939–943. doi: 10.2307/2529277