Bayesian inference of infected patients in group testing with prevalence estimation

04/28/2020 ∙ by Ayaka Sakata, et al. ∙ 0

Group testing is a method of identifying infected patients by performing tests on a pool of specimens collected from patients. For the case in which the test returns a false result with finite probability, we propose Bayesian inference and a corresponding belief propagation (BP) algorithm to identify the infected patients from the results of tests performed on the pool. We show that the true-positive rate is improved by taking into account the credible interval of a point estimate of each patient. Further, the prevalence and the error probability in the test are estimated by combining an expectation-maximization method with the BP algorithm. As another approach, we introduce a hierarchical Bayes model to identify the infected patients and estimate the prevalence. By comparing these methods, we formulate a guide for practical usage.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In clinical testing methods such as blood tests and polymerase chain reaction (PCR) tests, discovering infected patients from a large population requires significant operating costs. Because of limitations in the number of available devices, reagents, and technologists, a high demand exists for more efficient methods of testing. Group testing is one of the approaches for the reduction of the operating costs by performing tests on pools of specimens obtained from patients [1, 2]. It is known that with the assumption that the rate of the infected patients in the population is sufficiently small, in principle, one can identify the infected patients from the tests on pools whose number is smaller than that of the population. Originally, group testing was developed for blood testing during World War II, and is now applied to various fields such as quality control in product testing [3], estimation of the content of genetically mutated organisms in maize grains [4], and multiple access communication [5].

The identification of the infected patients from the results of group test is mathematically formulated as a channel coding problem to reconstruct the original signal from a codeword transferred through noisy channel [6], where the original signal, codeword, and noisy channel correspond to the state of patients, states of the pools, and errors in the tests, respectively. Further, group testing can be regarded as a variant of compressed sensing [7, 8] with discrete variables and logical sums. Hence, the progress in the last decade in sparse estimation including compressed sensing has revived interest in group testing, and information-theory approaches have achieved bound evaluation of group testing at a limiting case [9, 10].

Recently, in response to the epidemic infection of COVID-19 that requires testing on large populations, the idea of group testing has attracted increasing attention [11, 12, 13] from the viewpoint of practical application rather than mathematics. In practice, clinical testing sometimes results in errors even when the operation is precise. For example, in PCR tests, false negative probabilities of up to 3% and false positive (FP) probabilities of up to 5% have been observed [14]. Moreover, the bacterial or viral load in the specimen depends on the specimen collection method and timing [15]. Therefore, a specimen sometimes does not contain a sufficient amount of the pathogen to exceed the detection limit, even when the patient is infected. Further, post-collection contamination of pathogens into a specimen can cause a positive result even when the patient is not infected. Statistical inference can contribute to the correction of errors by estimating the true state of patients from noisy test data, and quantifying the credibility of the estimation.

In this paper, we introduce Bayesian inference to identify the infected patients in the group testing problem considering the finite false probabilities in the test. The infection probability of each patient is approximately calculated by a belief propagation (BP) algorithm because of its low computational cost [16], although the BP algorithm does not achieve the information theoretic bound [10]. Our contributions with regard to the BP algorithm are as follows:

(i) The true positive (TP) ratio, namely the ratio of infected patients reconstructed as positive, is improved to be greater than the TP probability in the test by considering the confidence interval of the point estimate, which corresponds to the infection probability. We introduce a bootstrap method to construct the confidence interval.

(ii) In our framework, prevalence, which is the fraction of the infected patients in the population, is introduced into the prior distribution of the patients’ state. Prevalence is one of the fundamental measures in epidemiology, and group testing has been applied to its estimation [17, 18, 19]. We construct the estimator of prevalence in addition to the identification of the infected patients by combining an expectation-maximization (EM) method with a BP algorithm. Following the same procedure, the TP and FP probabilities of the test are also estimated.

(iii) As another approach, we introduce the hierarchical Bayes model to identify the infected patients and estimate the prevalence. We apply the BP algorithm to the hierarchical Bayes model and evaluate the performance in comparison with the approach described in (ii), and find that the computational cost of the hierarchical Bayes approach is lesser than that required in (ii).

The remainder of the paper is organized as follows. In Section II, we explain the problem setting of group testing and introduce Bayesian inference. In Section III, we introduce the BP algorithm and show its reconstruction performance under finite false probabilities. In Section IV, we propose an estimator based on the confidence interval of the estimated infection probability. In Section V, we discuss the estimation of the unknown parameters in group testing by using the likelihood calculated by the BP algorithm. In section VI, we introduce the hierarchical Bayes model for group testing and discuss the estimation of the prevalence by applying the BP algorithm to the hierarchical model. Section VII presents a summary and discussion, and supplementary comments are provided in Section VIII.

Ii Problem setting

Figure 1: Matrix representation of group testing, where each pool size is and each overlap is . The summation in the usual matrix product is replaced with a logical sum.

We consider a population of patients and groups on which the test is performed. Let us denote the state of -patients by , where and means that -th patient is infected and not infected, respectively. The grouping of the patients is determined by the pooling matrix , where and means that the -th patient is in the -th group and not, respectively. The true state of the -th group, denoted by , is given by

(1)

where denotes the logical sum of components. Namely, when -th pool contains at least one infected patient, the state of the -th pool is 1 (positive), and 0 (negative) otherwise.

The test returns a false result with finite probability. We assume that the errors in the tests are independent from each other and model the observation (result of the test) as

(2)

where is the probabilistic function whose behavior is given by [10]

(3)
(4)

Here, and correspond to the TP and FP probabilities in the test, respectively, and these values are common for all tests. Fig.1 shows matrix representation of the group testing, where we note that the summations in the usual matrix factorization are replaced with a logical sum. We focus on the case , where the number of tests is smaller than that of the patients. Hereafter, we consider that the size of group is fixed to . Further, the overlap, which is the number of groups that each patient belongs to, is fixed at ; hence, the relationship holds.

ii.1 Bayesian inference

From the property of , the generative model of is given by

(5)

where

(6)

and . The purpose is to infer the true states of patients from the observation .

In general, the true generative process of

is unknown, but it is reasonable to assume that the process is expressed by the conditional Bernoulli distribution eq.(

5), as the variables and are binary, although the value of true parameters and are not known in advance. Bayesian inference is a preferred method in the presence of the reasonable model. As prior distribution of the patient states, we use following distribution:

(7)

where is the prevalence, which is not known in advance. Following the Bayes rule, the posterior distribution is given by

(8)

where , and are the assumed TP probability, FP probability, and prevalence, respectively. The -th patient’s state is identified on the basis of the marginal distribution given by

(9)

where denotes components of other than . As the variable is binary, we can represent the marginal distribution using a Bernoulli probability as

(10)

and corresponds to the infection probability, namely, the probability that . The simplest estimate of is the maximum a posteriori (MAP) estimator given by

(11)

where is the indicator function whose value is 1 when is true, and 0 otherwise.

Iii Belief propagation

Figure 2: Factor graph representation of group testing for , , and .

The computation of the marginal distribution requires an exponential order of the sums, and thus is intractable. We approximately calculate the marginal distribution using the BP algorithm on the factor graph representation of the group testing [10, 20]. Fig.2 shows the factor graph representation of the group testing for , , and . Here, we denote and as the indices of the patients in the -th pool, and that of the pools in which the -th patient is included, respectively. The conditional probability depends on  (); hence, the posterior distribution can be expressed as a bipartite graph, as shown in Fig.2. For the edge that connects the -th factor (test) and the -th variable (patient), two kinds of messages are defined as

(12)
(13)

which correspond to posterior information and output information, respectively. Intuitively, the messages and represent marginal distributions of before and after the -th test is performed, respectively. As , these messages are represented by one parameter as

(14)
(15)

where and are given by

(16)
(17)

and

(18)
(19)
(20)
(21)

Using these messages, we can approximate the marginal distribution as

(22)

and thus the infection probability is approximated as

(23)

and the MAP estimator is given by

(24)
1: and
2:
3: initial values from
4: initial values from
5:
6:
7:for  do
8:     for all combinations of such that  do
9:         
10:         
11:         
12:         
13:     end for
14:end for
15:for  do
16:     
17:end for
Algorithm 1 BP for Bayesian Group Testing

First, we consider the case where we know the correct parameters; , and . The pseudocode of the BP algorithm for group testing with known parameters is shown in Algorithm 1. We check the performance of BP algorithm for randomly constructed pooling matrix under the constraint as and . The true state of patients is also randomly generated under the constraint that . The accuracy of the MAP estimator is measured by the TP rate and FP rate given by

(25)
(26)

respectively. A TP value larger than and an FP value smaller than indicates that the performance of the BP-based identification is better than the parallel test of -patients. Fig.3 shows -dependence of (a) TP and (b) FP at , , and , respectively. Each data point represents the averaged value with respect to 100 realizations of and . The horizontal lines in (a) and (b) indicate 0.95 and 0.02, which are the TP and FP probabilities of the test, respectively. As increases, that is, as the number of tests increases, TP increases and the region where TP is larger than extends. FP also has a smaller value than for small values of , and this success region extends as increases. For large , converges to because , and both TP and FP decrease to zero. A similar tendency is shown for different values of and . As an example, we show TP and FP for , , , and in Fig.4. The FP probability has large influence on TP, which is obvious from the comparison of Figs. 3 and 4. Intuitively, an increase in (or decrease in ) causes uncertainty of the identification and decreases ; hence, the MAP estimator tends to be zero.

Figure 3: -dependence of (a) TP and (b) FP at and for various . Error probabilities on the test are fixed at and . The horizontal lines in (a) and (b) represent and , respectively.
Figure 4: -dependence of (a) TP and (b) FP at and for various . Error rates are fixed at and . The horizontal line in (a) represents . Note that the FP region shown in (b) is below .

The dependence of TP on and is shown in Fig.5(a) at , (), and . The solid line indicates , and a TP value over the solid line means the reconstruction by the BP algorithm achieves a higher TP than the parallel test of -patients, and this situation is achieved at a sufficiently small FP probability in this parameter region. The reconstruction performance also depends on . Fig.5 (b) Shows the -dependence of TP at , , , and . TP increases as increases without increasing the number of tests.

Figure 5: (a) -dependence of TP at , and for different values of . (b) -dependence of TP at and . The solid line indicates .

In practical testing, one of the objective is to identify the infected patients in order to prevent spreading of the disease. Therefore, increasing TP is a priority issue, and we mainly focus on the improvement of TP using the BP algorithm.

iii.1 BP algorithm needs “decision threshold”

Before proceeding to the improvement of TP, we discuss the trivial fixed points of the BP algorithm and introduce the idea of “decision threshold” for the identification of infected patients.

From eq.(23), when for , we obtain irrespective of the value of . This situation arises when for at . Therefore, is achieved when the patients are estimated as negative before -th test is performed, where . This is the case in which the -th patient is trivially identified as positive. In other words, the BP algorithm does not return for general cases; hence, to determine the infected patients from an estimate at the BP fixed point , we need a “decision threshold” such as a MAP estimator, where is the threshold for determining the infected patients. TP and FP depend on this threshold, and our strategy for the improvement of TP is the appropriate choice of the threshold as discussed in next section.

The threshold at is expected to lead conservative result, but it is not appropriate for general value of . Following similar logic, is obtained when at least one of the components of among takes the value 0, which is achieved at and or and . The former case means that all patients belong to the -th test are negative when and . In the latter case, means because ; Hence, all patients belonging to the positive test are negative. In other words, is always larger than zero when is less than 1 or larger than 1. Therefore, all of the patients are judged as positive under the threshold at , which corresponds to .

Iv Improvement of true-positive rate considering fluctuation of the estimates

The estimated Bernoulli probability is a function of , and fluctuates depending on the probabilistic observation. The quantification of the credibility of helps in determining the infected patients under conditions of noisy observation data. The confidence interval is one of the guides in inference considering the input fluctuation [21]. For convenience, we introduce the following statistic:

(27)

which gives the MAP estimator as

(28)

Here, we assume that the generative model has the corresponding “true value” . Following the normal theory, the confidence interval of the true value of is constructed as

(29)

where is the quantile of the standard normal distribution, and

is the estimate of the standard error

. We resort to the nonparametric bootstrap method to estimate the standard error [22]. We generate bootstrap samples and as

(30)

where is the

-th row vector of

and is the empirical distribution of given and defined by

(31)

We perform the BP algorithm for every bootstrap sample, and denote the estimate under the -th bootstrap sample as . Using , the estimate of the standard error of is given by

(32)

where is the average over the bootstrap samples.

We define the bootstrap estimate of the -th patient’s state as

(33)

which means that patients whose confidence interval runs over the region are regarded as infected. In comparison with the MAP estimator eq.(28), the decision threshold over which the patients are estimated as infected is lower by . Further, when , always; Hence, eq.(33) can change the result of patients who are judged by the MAP estimator as non-infected. Fig.6 shows the (a) TP and (b) FP of the bootstrap estimate at , , , and , . We generate bootstrap samples for each set of sample , and each point is averaged over 100 samples. The MAP estimator cannot achieve a higher TP than for any , but the bootstrap estimator improves TP to be greater than . Meanwhile, FP of the bootstrap estimator is higher than that of the MAP estimator, which is caused by the reduced decision threshold compared with that of the MAP estimator. However, FP is smaller than for sufficiently small ; hence, we consider the bootstrap estimator as practicable. The situation is the same for other parameter region. As an example, we show TP and FP of bootstrap estimator at , , , , in Fig.7.

Figure 6: -dependence of (a) TP and (b) FP bootstrap estimator for , , , , .
Figure 7: -dependence of (a) TP and (b) FP bootstrap estimator for , , , , .
Figure 8: Examples of bootstrap distribution at , , , , and . The width of the bin is set at 0.5. The solid line and dashed lines represent and the confidence interval, respectively. (a) Bootstrap distribution of for an infected patient where and . (b) Bootstrap distribution of for a non-infected patient where and .

For the intuitive understanding of the bootstrap estimator, we show examples of the bootstrap distributions of in Fig.8 at , , , , and , where the solid line represents and the two dashed lines indicate the confidence interval. This histogram was obtained from 1000 bootstrap samples; note that the confidence interval eq.(29) is not that for the bootstrap distribution. Fig.8(a) shows the bootstrap distribution of an infected patient who is judged as non-infected by the MAP estimator and as infected by the bootstrap estimator. Fig.8(b) shows the same for a non-infected patient. The patients shown in Fig.8 (a) and (b) contribute to the increase of TP and FP of bootstrap estimate, respectively.

V Estimation of unknown parameters by expectation maximization

In this section, we consider the estimation of the unknown parameters: prevalence , TP probability , and FP probability . We construct their estimator by the maximum likelihood method, where the likelihood is given by

(34)

and the estimators are given by

(35)
(36)
(37)

An approximation of the log-likelihood is given by the BP algorithm as Bethe free entropy [20], defined as

(38)

where

(39)
(40)
(41)

and

(42)

We derive the maximum-likelihood estimator by the stationary condition of the Bethe free entropy [23]. After the calculation shown in the appendix, we obtain

(43)
(44)
(45)

where denotes the expectation of according to the posterior distribution with respect to the -th test defined by

(46)

and

(47)
(48)
(49)
(50)

Eqs. (44)–(45) always have trivial fixed points at and , and to avoid these solutions, we solve following expressions:

(51)
(52)

which are equivalent to eqs. (44)–(45). Note that the extremization conditions of eqs.(43)-(45) correspond to the Nishimori line [24, 25].

We calculate the infected probability of patients and the estimators of the unknown parameters by the expectation-maximization (EM) method; in the E-step, the fixed point of BP, and , is achieved by recursive updating under a fixed and ; in the M-step, these parameters are updated according to the extremization conditions of eqs. (43)-(45). We term this method the BP+EM algorithm, which is summarized in Algorithm 2.

Fig.9 show the comparison between estimated parameters and true parameters at , , and for (a) at and , (b) at and , and (c) at and . The gradient of the diagonal lines is 1; Hence, the point on this line indicates that accurate estimation of the unknown parameters is achieved. In all of the figures, parameters that are not shown in the figure are also estimated at the same time. For the whole parameter region, the M-step converges to the true parameter. TP rate and FP rate of the BP+EM algorithm are the same as those of BP algorithm where the parameters are known.

We note that the behavior of the BP+EM algorithm heavily depends on the initial condition of and . When the initial conditions of and are close to their true values, the BP+EM algorithm are stable; hence, the proposed method should be treated as a correction of the experimentally estimated values. Meanwhile, the estimation of is insensitive to the initial condition, which is smaller than .

Figure 9:

Comparison between true values and estimated values of hyperparameters. The gradient of the diagonal lines is 1, and system sizes are

, . (a) vs. plot at and . The parameters and are also estimated by the EM method. (b) vs. plot at , and , where and are also estimated by the EM method. (c) vs. plot at and , where and are also estimated by the EM method.
1: and
2:
3:Initialize:
4:      { initial value from initial value from initial value from
5:for  do
6:     for  do
7:         for all combinations of such that  do
8:              
9:              
10:              
11: