Identification of infected patients from a large population using clinical tests, such as blood tests and polymerase chain reaction tests, requires significant operating costs. Group testing is one of the approaches to reduce such costs by performing tests on pools of specimens obtained from patients [1, 2]. When the fraction of infected patients in a population is sufficiently small, the infected patients can be identified from tests on pools whose number is smaller than that of the population. Originally, group testing was developed for blood testing by Dorfman, and is now applied to various fields such as quality control in product testing  and multiple access communication .
Group testing is roughly classified into non-adaptive and adaptive. In non-adaptive group testing, all pools are determined in advance and fixed during all tests. In adaptive group testing, pools are designed sequentially, depending on the previous test outcomes. Dorfman’s original study considered the simplest adaptive procedure, the so-called two-stage testing, where in the first round, tests are performed on pools designed in advance, and all patients belonging to the positive pool are individually tested in the subsequent stage. A generalization of the two-stage testing is known as the binary splitting method[5, 6], where the positive pool in the previous stage is split into two subpools. Tests in the subsequent stage are performed on the subpools until the infected patients are identified. Further, the splitting of the positive pools into several subsets larger than two sometimes reduces the number of tests required for identifying the infected patients . These splitting-based methods are effective when the number of infected patients is sufficiently small. However, the splitting-based methods have a limitation in the correction of false negative results because patients in the negative pools are never tested again, even when the negative result is false.
Different from the splitting-based design, active design of data sampling has been studied in statistics and machine learning, known as experiments design[8, 9]10, 11], and Bayesian optimization [12, 13]. In these approaches, the optimal way to select training data for efficient learning is constructed under several criteria that quantify informativeness of the unknown data. It is known that active design of data sampling improves the performance of algorithms in several fields such as text classification 14]
, and support vector machine. Active data sampling is particularly effective when data possess uncertainty due to a noisy generative process and there exists a limitation in the number of data sampling. In the context of group testing, active sampling of data corresponds to active design of pools for the subsequent stage. Further, the presence of noise in tests and the requirement to reduce the number of tests match to the situation where active sampling is expected to make a significant contribution.
In this paper, we propose an active pooling design method employing Bayesian inference for efficient identification of infected patients using group testing. Bayesian modeling can take into account the finite false probabilities in the test and provide a measure to quantify the uncertainty, posterior predictive distribution. We sequentially design pools based on the predictive distribution in adaptive group testing. The procedure is executed using a statistical-physics-based algorithm, belief propagation (BP)[16, 17, 18, 19]
, which achieves a reasonable approximation of estimates with a feasible computational cost . We show that the proposed pooling method effectively corrects errors with a smaller number of tests, as compared to randomly generated pools.
Ii Mathematical formulation
Let us denote the true state of -patients by , where and indicate that the -th patient is infected and not infected, respectively. The pooling of the patients is determined by a matrix , where is the number of pools and and indicate that the -th patient is in the -th pool and is not, respectively. The true state of the -th group, denoted by , where is the -th row vector of , is given by , where denotes the logical sum of components. Namely, when the -th pool contains at least one infected patient, the state of the -th pool is 1 (positive); otherwise, it is 0 (negative).
The test error is modeled by a function that returns 0 or 1 according to the probability conditioned by the input as
and and correspond to the true-positive (TP) and false-positive (FP) probabilities in the test, respectively [18, 20]. We assume that the test errors are independent of each other, and from the property of , the generative model of is given by , where
is a Bernoulli distribution conditioned byand .
The goal of the current problem is to infer the true states of patients from the observation . We use Bayes formula to achieve the purpose introducing the prior distribution of the patient states , where is the assumed infection probability. Following the Bayes rule, the posterior distribution is given by . The -th patient’s state is identified on the basis of the marginal distribution , where denotes the components of other than . As the variable is binary, we can represent the marginal distribution using a Bernoulli probability as
and corresponds to the infection probability estimated under the test result , namely, the probability that . We have to convert the returned probability to a binary value for the identification of the patients’ states. The simplest estimate of is the maximum a posteriori (MAP) estimator given by
where is the indicator function whose value is 1 when is true, and 0 otherwise.
Iii Adaptive design of pools
Here, we divide -tests into -tests under pools fixed in advance as the initial stage and -tests sequentially performed on actively designed pools as the adaptive stage. Hence, . We denote the index set of patients who are in the -th pool as , where for ; otherwise, 0. We consider the determination of among possible pools denoted by based on the -th test outcomes, denoted by , which are performed on pools . The predictive distribution for the unknown result of the test , which will be performed on a certain pool , is defined as
where and . By setting , which is the estimated probability under given that all patients in the pool are not infected, the predictive distribution is expressed as
The predictive distribution measures the adequacy of the posterior distribution to describe the unknown data, and is used as a modeling criterion in Bayesian inference . We use the predictive distribution for active design of pools. For intuitive discussion, let us consider the case that and are significantly different. We consider the case that they are close to 1 and close to 0, respectively. This means that the posterior distribution is consistent with the new observation performed on the pool in the sense that the current posterior matches the new test result , and is supposed to be the test error. We do not take into account this ‘explainable pool’ in the subsequent stage because the test performed on the explainable pool is not expected to modify the current posterior to be realistic. Instead, we take into account the pool that gives comparable and , where the posterior at step cannot explain the test result performed on the pool , and hence the test result is expected to correct the posterior to explain it.
This strategy can be expressed by the maximization of the predictive entropy at step defined as
where gives the entropy maximum. Active design of data sampling based on the entropy maximization is known as uncertainty sampling in active learning [22, 23]. As shown in eq.(7), the predictive entropy is expressed by one parameter . Regarding the predictive entropy as a function of , the maximum of the predictive entropy is achieved at given by
where is assumed. We determine the -th pool as
The remaining task is the calculation of for possible under the given test results . The mathematical form of depends on the size of , denoted by . When , we obtain
For larger pools, the correlation between the patients in the pool should be considered for the exact evaluation of . For example, when , we obtain
where is the susceptibility and denotes the average according to the posterior distribution .
Next, we discuss the relationship between , , and . From the definition of , eq.(9), if , then . This indicates that the pools with are likely to be chosen when . In other words, when the probability that at least one patient in a pool is infected is larger than the probability that no one is infected, the pool tends to be chosen. This can be understood as follows. Introducing false negative probability , is equivalent to . This means that false test results are mainly contained in positive results. Hence, pools with contain significant uncertainty, as compared to . Therefore, in the active pooling design based on uncertainty, pools with are preferably chosen, when . Following the same logic, we can understand that the pool with is likely to be chosen when .
Iv Implementation by belief propagation
The computation of the marginal distribution requires the exponential order of the sums, and thus is intractable. We approximately calculate the marginal distribution using the BP algorithm [17, 18, 19]. Compared to the approximation using the BP algorithm with the exact calculation at a small size, the BP algorithm has sufficient approximation performance when applied to group testing . In this study, we use the BP algorithm as a reasonable method owing to its approximation accuracy and computational time. In Appendix A, the BP algorithm for calculating the infection probability given by the posterior distribution is summarized. We denote the obtained estimates of and the corresponding MAP estimator as and , respectively. We measure the accuracy of the MAP estimator by the TP and FP rates given by
respectively. A TP value larger than and an FP value smaller than indicate that the BP-based identification has better performance than the parallel test of -patients.
To apply the BP algorithm to adaptive testing, we need to obtain for each . For its exact computation, we need multibody correlations between patients except when , although the BP algorithm returns one-body information. In this study, we use the simplest approximation provided by the BP algorithm as , where is the -th component in the pool , to avoid the increase in the computational time required for the calculation of multibody correlation. Further, to reduce the time of the computation of for all possible , we focus on the subspace of pools and ; hence, . In principle, BP can approximately compute the correlation between patients by deriving conditional posterior expectations, which requires additional computations of the order of
according to the product-rule of conditional joint distributions. As an example, we calculate the susceptibility using the BP algorithm and implement active pooling design on the basis offor case, as shown in Appendix B. The consideration of the susceptibility does not provide large improvements in terms of TP and FP rates in the problem setting studied herein. Hence, we use one-body approximation throughout the study.
The setting of the numerical simulation described in this section is as follows. Let us denote the longitudinal coupling of matrices or vectors and that have the same number of columns as . Hence, . The submatrix of given by from the -st to the -th row vectors is denoted by ; hence, . The pooling matrix for the initial stage, , is randomly generated under the constraint that the number of pools each patient belongs to and the number of patients in each pool are fixed at and , respectively. Hence, for and hold, and the relationship holds. The corresponding test result in the initial stage, , is generated as . The posterior distribution under given and is approximately calculated using the BP algorithm. For the subsequent adaptive stage, we actively choose among or based on the predictive entropy given by the posterior distribution of the initial stage. Next, we construct , so that for ; otherwise, 0. The test result is generated as , and we obtain the posterior distribution under and using the BP algorithm. This adaptive test procedure is repeated -times, where , and the state of patients is determined by the MAP estimator corresponding to , where and . The pseudocode is summarized in Algorithm 1, where indicates the calculation of the infection probability using the BP algorithm under the input and (see Appendix A).
The true state of patients is randomly generated under the constraint that . Here, we assume that the correct parameters , , and
are known in advance. For more general cases where the estimation of unknown parameters is required, we can construct their estimators by combining the BP algorithm with the expectation-maximization method, or introducing a hierarchical Bayes model.
Fig.1 shows the--dependence of (a) TP and (b) FP at ,
with and .
The error probabilities are set at and
, and the group size in the initial stage is .
and in the figure denote the
results of the active pooling in the spaces and ,
respectively 111 We note the heuristics used in the simulation.
We note the heuristics used in the simulation. Whenis sufficiently small such as , BP for the active pooling does not converge sometimes. This is due to the overlapped pools; that is, certain pools are selected several times in the adaptive stage. It is known that rank deficiency can cause the instability of BP. To avoid this problem, we exclude the already existing pool from the candidates in the subsequent stage for the small- case.. For comparison, the results of random pooling are shown, where tests in steps are performed on random pools generated by the same rule as the initial -times tests. Each data point represents the averaged value with respect to 100 realizations of , and . For any region of , TP under a random test cannot exceed the , which is indicated by the horizontal line in Fig.1 (a). The adaptive test improves TP and achieves when for case and for case. As shown in Fig.1 (b), FP is smaller than even when the pooling is randomly determined, but the adaptive test can further decrease FP.
The performance of the adaptive test depends on the number of initial random tests . Fig.2 shows the -dependence of (a) TP and (b) FP at , , , and . The pool size in the initial stage is . The results for , 400, and 500 are shown. The horizontal dashed line in (a) indicates 0.9, which is the TP probability of the test. As increases, the adaptive testing leads to high TP close to 1. Moreover, for large such as , the possible pooling space does not influence the performance significantly in terms of TP and FP. Meanwhile, for small , the result of TP depends on the pooling space, and more accurate identification is achieved by .
The -dependence shown in Fig.2 can be intuitively understood that the inference under small has large uncertainty in identifying the infected patients, hence larger pooling space is required to effectively modify the posterior in the adaptive stage. We quantify the uncertainty remaining in the posterior using the predictive entropy. We consider the expected predictive entropy , where denotes the expectation with respect to test results and pools whose sizes are . The possible maximum value of is given by the prior information before any tests are performed, and its value is obtained by replacing with . Meanwhile, the minimum value of is achieved when the infected patients are known, and is given by
where is the binary entropy. The first and second terms of eq.(14) correspond to the uncertainty caused by the test errors on the positive pools and the negative pools, respectively. Eq.(14) do not contain any uncertainty with respect to the identification of the infected patients. The effectiveness of the active pooling approach can be understood how the tests on the pools reduce close to the minimum value. Fig.3 shows at , , , and for different values of ; (a) and (b) . The corresponding values of TP and FP are shown in Fig.2. For comparison, after the initial stage is shown by solid line ‘Initial Stage’. In case, the active pooling among gives larger decrease in , hence larger pooling space containing -pools is expected to lead to more accurate and efficient estimation. In case of , the active stage leads to almost the same value as the possible minimum without depending on the pooling space. Therefore, the effort in the setting of pooling space is not required for large .
The active pooling method is robust to the errors in the test, as compared to the random pooling. Fig.4 shows (a) the -dependence of TP for and (b) the -dependence of TP for at , , , and . The random tests in the initial stage are performed on pools of size . For the random pooling case, is achieved only when is sufficiently small such as . The adaptive test improves TP, and the parameter region where is extended in particular for the case .
These results indicate the efficiency of the active pooling design based on predictive distribution in group testing. An disadvantage of this approach is the higher computational cost compared with the non-adaptive approach. We repeat the estimation of the infection probability by BP algorithm steps, hence roughly speaking, the computational cost of the adaptive approach is times larger than non-adaptive approach. However, the adaptive approach achieves accurate estimation using smaller number of tests. The trade-off between the reduction of operating cost in tests and the increase in the computation time of inference should be considered for the practical use of the adaptive approach.
V Summary and discussion
In this study, we propose an active pooling design in adaptive group testing, where the pool for the subsequent stage is determined based on the Bayesian posterior predictive distribution under the test outcomes in the previous stage. The proposed method was implemented using the BP algorithm, and it was shown that the identification of infected patients using adaptive tests is more accurate than that using randomly designed pools. In particular, the active pooling design reduced the number of required tests to achieve . Further, the proposed method is robust to test errors and holds in smaller and larger , as compared to randomly designed pooling.
In the current study, we restrict the possible pooling space within and . Mathematically, more uncertain pool can be taken into account removing this restriction, and further improvement in the TP and FP rates is expected. However, the straightforward calculation of the predictive entropy for all possible is computationally intractable. Hence, some approximation will be required. An efficient sampling method in
to find the uncertain pool should be developed such as the Markov chain Monte Carlo method.
We focused on the MAP estimator to convert the estimated infection probability, which is variable, into the state of patients,
variable, because of its simplicity, but it is known that changing the decision threshold from 0.5 results in improvements in the TP rate. For example, the estimate using confidence interval constructed on the basis of bootstrap method has been proposed and it achieves higher TP rate than MAP estimator, but its computational cost is unrealistic to accompany the active pooling procedure. The receiver operating characteristic (ROC) analysis is a promising method to understand the appropriate decision threshold [25, 26]. Along with the ROC analysis, mathematical background of the active pooling proposed in this paper is expected to be established .
The MATLAB code used in this study is distributed on GitHub https://github.com/AyakaSakata/GroupTesting.
Acknowledgements.This work was accomplished thanks to pleasant discussions with Yukito Iba. Further, the author thanks Koji Hukushima, Yoshiyuki Kabashima, and Satoshi Takabe for helpful comments and discussions. This research was partially supported by Grant-in-Aid for Scientific Research 19K20363 from the Japanese Society for the Promotion of Science (JSPS) and JST PRESTO Grant Number JPMJPR19M2, Japan.
Appendix A BP algorithm for group testing
We denote and as the indices of the patients in the -th pool and those of the pools in which the -th patient is included, respectively. For the edge that connects the -th factor (test) and the -th variable (patient), two types of messages and are defined. Intuitively, the messages and represent the marginal distributions of before and after the -th test is performed, respectively. The variable is binary. Hence, the messages can be expressed by the Bernoulli probability and given by
where , and
Appendix B Calculation of susceptibility using the BP algorithm
, where recursive update of tensors that give susceptibility is introduced on the basis of linear-response theory. In the current problem setting, the variables to be estimated obey the Bernoulli probability. Hence, we can compute the susceptibility in a simpler way.
Let us denote the expectation of under the posterior conditional distribution as . This expectation value is evaluated using the BP algorithm by fixing and for . The conditional expectation value obtained using the BP algorithm is denoted as . Thus, the susceptibility is given by . We can show that the symmetry holds.
To check the accuracy of the susceptibility derived using the BP algorithm, we compute the exact posterior distribution by sampling all configurations in . Examples of the exact susceptibility and the approximated one are shown in Fig.5(a) at , , , , and , Where the -dependence of is shown for two different realizations of , and . Here, the pooling matrix is randomly generated to be and . The difference between and is quantified by , whose behavior is shown in Fig.5 (b) at and different values of , , and . For any parameter region, is . Therefore, we consider that the BP algorithm provides a reasonable approximation of the susceptibility and expect that it is also applicable for larger .
In Fig.6, (a) TP and (b) FP are shown for the cases when the susceptibility is considered (denoted by ‘: with ’) and not considered (denoted by ‘: without ’); namely, eq.(12) is used to determine the pool in the subsequent stage by substituting calculated by BP into at , , , and . Each data point is averaged over 50 samples of , , and . The initial stage consists of random tests with and . The case is compared to the random case with the same test at the initial stage. Considering the susceptibility, a slight improvement in TP is observed.
Following the same procedure, we can compute a higher order correlation in principle. For example, is obtained by fixing , , and , for and .
- Dorfman  R. Dorfman, Ann.Math.Statist. 14, 436 (1943).
- Du and K. Hwang  D.-Z. Du and F. K. Hwang, Combinatorial Group Testing and Its Applications (World Scientific, 2000).
- Sobel and A. Groll  M. Sobel and P. A. Groll, Bell System tech. J. 28, 1179 (1959).
- Wolf  J. K. Wolf, IEEE Transactions on Information Theory 31, 185 (1985).
- Sobel and Groll  M. Sobel and P. A. Groll, Bell Labs Technical Journal 38, 1179 (1959).
- Sobel and Groll  M. Sobel and P. A. Groll, Technometrics 8, 631 (1966).
- K. Hwang  F. K. Hwang, Journal of the American Statistical Association 67, 605 (1972).
- V. Fedorov  V. V. Fedorov, Theory of Optimal Experiments (Academic Press (New York), 1972).
- Pukelsheim  F. Pukelsheim, Optimal Design of Experiments (Academic Press (New York), 1972).
A. Cohn et al. 
D. A. Cohn, Z. Ghahramani, and M. I. Jordan, Journal of Artificial Intelligence Research4, 129 (1996).
- Settles  B. Settles, Active learning literature survey, Tech. Rep. (University of Wisconsin-Madison Department of Computer Sciences, 2009).
- Brochu et al.  E. Brochu, V. M. Cora, and N. de Freitas, A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning, https://arxiv.org/abs/1012.2599 (2010).
- Shahriari et al.  B. Shahriari, K. Swersky, Z. Wang, P. A. R, and N. De Freitas, Proceedings of the IEEE 104, 148 (2015).
- Zhu et al.  X. Zhu, J. Lafferty, and Z. Ghahramani, in ICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data (2003) pp. 58–65.
- Tong and Koller  S. Tong and D. Koller, Journal of Machine Learning Research 2, 45 (2001).
- Mézard and Montanari  M. Mézard and A. Montanari, Information, physics, and computation (Oxford University Press, 2009).
- Mézard et al.  M. Mézard, M. Tarzia, and C. Toninelli, Journal of Physics: Conference Series 95, 012019 (2008).
- Sejdinovic and Johnson  D. Sejdinovic and O. Johnson, in 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (IEEE, 2010) pp. 998 – 1003.
- Kanamori et al.  T. Kanamori, H. Uehara, and M. Jimbo, Journal of Statistical Theory and Practice 6, 220 (2012).
- Sakata  A. Sakata, Journal of Physical Society of Japan 89, 084001 (2020).
- Kitagawa  G. Kitagawa, Communications in statistics - theory and methods 26, 2223 (1997).
- D. Lewis and A. Gale  D. D. Lewis and W. A. Gale, in ACM SIGIR Conference on Research and Development in Information Retrieval (ACM/Springer, 1994) pp. 3–12.
- D. Lewis and Catlett  D. D. Lewis and J. Catlett, in International Conference on Machine Learning (ICML) (Morgan Kaufmann, 1994) pp. 148–156.
-  We note the heuristics used in the simulation. When is sufficiently small such as , BP for the active pooling does not converge sometimes. This is due to the overlapped pools; that is, certain pools are selected several times in the adaptive stage. It is known that rank deficiency can cause the instability of BP. To avoid this problem, we exclude the already existing pool from the candidates in the subsequent stage for the small- case.
- Kumar and Indrayan  R. Kumar and A. Indrayan, Indian Pediatr 48, 277 (2011).
- Hajian-Tilaki  K. Hajian-Tilaki, Caspian J Intern Med 4, 627 (2013).
- Mézard and Mora  M. Mézard and T. Mora, J Physiol Paris 103, 107 (2009).
- Yasuda and Tanaka  M. Yasuda and K. Tanaka, Physical Review E 87, 012134 (2013).