# Bayesian inference for asymptomatic COVID-19 infection rates

To strengthen inferences meta analyses are commonly used to summarize information from a set of independent studies. In some cases, though, the data may not satisfy the assumptions underlying the meta analysis. Using three Bayesian methods that have a more general structure than the common meta analytic ones, we can show the extent and nature of the pooling that is justified statistically. In this paper, we re-analyze data from several reviews whose objective is to make inference about the COVID-19 asymptomatic infection rate. When it is unlikely that all of the true effect sizes come from a single source researchers should be cautious about pooling the data from all of the studies. Our findings and methodology are applicable to other COVID-19 outcome variables, and more generally.

• 5 publications
• 3 publications
04/13/2020

### Estimating the COVID-19 Infection Rate: Anatomy of an Inference Problem

As a consequence of missing data on tests for infection and imperfect ac...
07/14/2021

### Total Effect Analysis of Vaccination on Household Transmission in the Office for National Statistics COVID-19 Infection Survey

We investigate the distribution of numbers of secondary cases in househo...
05/05/2020

### Simple models for COVID-19 death and fatal infection profiles

Simple smooth additive models for the observed death-with-COVID-19 serie...
03/05/2021

### DOPE: D-Optimal Pooling Experimental design with application for SARS-CoV-2 screening

Testing individuals for the presence of severe acute respiratory syndrom...
08/24/2020

### Machine Reasoning to Assess Pandemics Risks: Case of USS Theodore Roosevelt

Assessment of risks of pandemics to communities and workplaces requires ...
09/04/2020

### Evaluating the effect of city lock-down on controlling COVID-19 propagation through deep learning and network science models

The special epistemic characteristics of the COVID-19, such as the long ...
04/09/2019

### Meta-analysis of Bayesian analyses

Meta-analysis aims to combine results from multiple related statistical ...

## 1 Introduction

Meta-analyses are commonly used to summarize information from a set of independent experiments, observational studies or sample surveys. Doing this may strengthen inferences when there are deficiencies in the individual studies such as small sample sizes. Methodology for combining findings from repeated research studies has a long history and, in particular, meta-analyses have become very popular over the past thirty years. From an online search for ‘books meta-analysis’ we found forty-nine books. Thus, it was natural that early in 2020 several meta-analyses were conducted (and subsequently published) about infection rates from the novel coronavirus. Looking at several early review papers we were concerned whether the meta-analyses were carried out in an appropriate manner. Even after careful evaluation to include only studies thought to be comparable, there may be subsets of the collection of studies where the true (subset) effects are very different. If this is so, pooling the data from all of the studies may result in misleading conclusions. Borenstein et al. (2010) add: “If the variation is substantial, then we might want to shift our focus …. Rather it should be on the fact that the … effect differs from study to study. Hopefully, it would be possible to identify reasons … that might explain the dispersion.”

In this paper we consider three Bayesian methods that have a more general structure. One can use these methods to check the validity of the more standard approaches by investigating whether the set of true effect sizes come from a common source. If the assumptions underlying the standard approaches are not met, our proposed methodology will lead to more appropriate inferences.

From five review papers we selected for further analysis several studies that have different features. In each of these cases the objective is to make inference about the asymptomatic infection rate.

Please note that we are not evaluating specific meta-analytic methods. Our concern is about appropriate aggregation of possibly disparate data.

Buitrago-Garcia et al. (2020)

explain the importance of a review: “Accurate estimates of the proportions of true asymptomatic and presymptomatic infections are needed urgently because their contribution to overall SARS-CoV-2 transmission at the population level will determine the appropriate balance of control measures. If the predominant route of transmission is from people who have symptoms, then strategies should focus on testing, followed by isolation of infected individuals and quarantine of their contacts. If, however, most transmission is from people without symptoms, social distancing measures that reduce contact with people who might be infectious should be prioritized, enhanced by active case-finding through testing of asymptomatic people.” Referring to a narrative review report

(Oran and Topol, 2020) that presents a range (over studies) of 6% to 96% for the proportion of individuals positive for SARS-CoV-2 but asymptomatic, the authors point out the need for a careful review.

Standard meta-analyses typically assume that the true effect sizes, come from a common source. Even after including only those studies thought to be comparable, may be composed of distinct subsets, each with a different underlying distribution. This seems likely for some of the reviews, e.g., the seventy-nine rates in Buitrago-Garcia et al. (2020) that range from 0.01 to 0.92. To make appropriate inferences the three Bayesian methods have a more general structure than that assumed in a standard meta-analysis. The principal method, termed uncertain pooling, is flexible in that it can identify distinct subsets of : e.g., for subsets there would be true effect sizes, . Then, pooling the data from all of the studies may lead to misleading inferences. This methodology will also indicate when true effect sizes have a common source, thus leading to an appropriate inference. The more general structure should ensure greater concordance of the data with our model than with a more restricted model. A better fitting model should lead to better inference. Specifically, only similar studies will be combined. It is not surprising that there is strong statistical evidence that in three of the four data sets that we analyze (Section 4) the true effect sizes do not come from a single source. Then the analyst should be cautious about combining the data from all of the studies.

For a general discussion of Bayesian methods for meta-analysis see Schmid, Carlin and Welton (2021). Borenstein et al. (2010) is a basic treatment of fixed and random effects models for a meta-analysis while Rice, Higgins and Lumley (2018) re-evaluate fixed effect(s) meta-analysis.

Section 2 has brief descriptions of the data sets that we analyze, together with background information. The methodology is introduced in Section 3 while the results are summarized in Section 4. Section 5 has a brief summary and an extension that accommodates study level covariates with notes about the availability of covariates in the reviews we investigate.

## 2 Study Descriptions

This section has brief descriptions of the meta-analyses that we have analyzed together with some background information. The definitions of asymptomatic infection rate and the conditions required for including individual studies in the meta-analysis differ considerably, and are too detailed to present all of them in this paper.

The first paper that we considered was by He, Yi and Zhu (2020). Using data from six studies they obtain estimates of the asymptomatic infection rate, noting that these measures differ considerably over the six studies, and explaining that this may be due to “differences in data collection, sample size, and the conditions.” Since the information from one of the six studies is inconsistent with that in the other five studies we include only the latter in our analyses. As seen in Table 1 the sample proportions range from 0.22 to 0.78 with little or no clustering.

The first meta-analysis concerning only the asymptomatic coronavirus disease rate is He et al. (2021). In their Section 1 the authors note the importance of studying the asymptomatic rate and that this rate is not “well characterized.” They conduct meta-analyses for all 41 studies and five subsets. The rates are markedly heterogeneous (proportions from 0.02 to 0.75 and total numbers of cases from 4 to 44,627), suggesting concern about the aggregation. Our analysis is for the children subgroup, with eleven studies. As seen in Table 2, the sample proportions range from 0.11 to 0.57 while the SEs range from 0.01 to 0.16. Unlike the first meta-analysis, there is some apparent clustering.

The third data set is a subset of six of the eleven studies in He et al. (2021). These six studies were chosen to illustrate properties when there is considerable separation. From Table 3, there are two apparent clusters with cluster proportions of about 0.13 and 0.55; overall, the SEs range from 0.01 to 0.16.

Buitrago-Garcia et al. (2020) has several important features, i.e., the authors consider rates associated with both asymptomatic and presymptomatic cases and only include studies that document follow-up and symptom status at the beginning and end of follow-up or modeling. Their meta-analyses are based on seventy-nine studies, summarized for the entire set and seven subsets. The seventy-nine rates are markedly heterogeneous (proportions from 0.01 to 0.92 and total numbers of cases from 2 to 1012). Our analysis is for the screening subgroup with seven studies, noted by Buitrago-Garcia et al. (2020) as being of special interest. Here, the proportions range from 0.17 to 0.50 while the SEs range from 0.04 to 0.35.

For additional background we give the criteria that He et al. (2021) used to select the studies that they used in their meta-analyses. Then we summarize features of the eleven children study, from He et al. (2021), that we have analyzed. He et al. (2021) searched two databases, PubMed and Embase, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline. They included the following items: “COVID-19” and analogous phrases and “Asymptomatic.” They included “articles reporting a specific number of asymptomatic infection cases in confirmed COVID-19 patients, information describing the epidemiological and clinical features of COVID-19.” There is no evidence of a risk-of-bias assessment or consideration of a sufficient follow-up period. Byambasuren et al. (2020) identify these as characteristics essential when making decisions about which studies to include in a meta-analysis. Since He et al. (2021) is the first meta-analysis to make inferences about the asymptomatic infection rate, one may conjecture that they were motivated to publish their results quickly.

Of the eleven studies of children five papers are published in Chinese, so there are only summaries in English. Only three of the eleven papers give the age distribution of the children, although most give the mean age and some give the range. Seven of the papers give the sex distribution. In five studies all cases were associated with a single hospital while four studies summarized the results from many hospitals. There was no information for two studies. Several papers noted that most of the patients had a history of close contact with adults with COVID-19.

## 3 Methodology

A common assumption in situations where combining data is plausible is: For the are independent

 ¯¯¯¯Yi∼N(μi,σ2i/ni) (1)

where , the are known, is the number of studies and is the number of replicates. Note that all of the analyses of COVID-19 data that we consider make assumptions like (1).

### 3.1 Uncertain pooling

The first method, uncertain pooling, is based on Malec and Sedransk (1992) and Evans and Sedransk (2001). Since this method may be unfamiliar, we describe it in some detail. They showed that a prior for can be selected to reflect the beliefs that there are subsets of such that the in each subset are “similar”, and that there is uncertainty about the composition of such subsets of Let be the total number of partitions of the set Denote a particular partition by , let denote the number of subsets of in the th partition , and let denote the set of study labels in subset for For our analyses and with and respectively. For other values of the total number of partitions of an -element set is given by the Bell number, Recent work (e.g., Dahl, Day and Tsai, 2017) proposed using prior information to place a prior on the set of partitions This will increase the complexity of the computations but avoid the need to consider the partitions explicitly.

To specify a prior for first condition on Malec and Sedransk (1992) and Evans and Sedransk (2001) assume that there is independence between subsets, and within the are independent with

 μi|νk(g)∼N(νk(g),δ2k(g))i∈Sk(g). (2)

Also, the are mutually independent with

 νk(g)|θk(g)∼N(θk(g),γ2k(g)).

Conditioning on the variances above (but suppressing them in our notation), and letting

leads to the following expected results. Defining and letting

 E(μi|y,g,Δ)={λi(g)}^μi+{1−λi(g)}^μk(g)i∈Sk(g) (3)

and

 cov(μi,μt|y,g,Δ)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩δ2k(g){1−λi(g)}+{1−λi(g)}2δ2k(g)∑i∈Sk(g)λi(g),i=t;i,t∈Sk(g){1−λi(g)}{1−λt(g)}δ2k(g)∑i∈Sk(g)λi(g),i≠t;i,t∈Sk(g)0,i∈Sk1(g),t∈Sk2(g),k1≠k2, (4)

where

 λi(g)=δ2k(g)δ2k(g)+σ2i/ni,^μk(g)=∑t∈Sk(g)λt(g)^μt∑t∈Sk(g)λt(g),and^μi=¯¯¯¯Yi. (5)

 f(μ|y)=∫f(μ|y,g,Δ)f(g,Δ|y)dgdΔ (6)

where the notation is simplified by using integration rather than summation for

To evaluate (6) we need One must be careful about specifying how the because the models corresponding to the partitions have different numbers of parameters. We use a method described in Section 3 of Evans and Sedransk (2001) that postulates little prior, relative to sample, information about the and is invariant to changes in the scale of . Let and be the Kullback-Leibler information about . With prior and letting the subject to = constant

 f(g,Δ|y)∝ f(Δ)f(g)exp{−d(g)2}d(g)∏k=1∏i∈Sk(g){1−λi(g)}1/2 ×exp⎡⎣−12d(g)∑k=1∑i∈Sk(g){λi(g)δ2k(g)}{^μi−^μk(g)}2⎤⎦, (7)

The factor in the exponent,

 Q{d(g)}=d(g)∑k=1∑i∈Sk(g){λi(g)δ2k(g)}{^μi−^μk(g)}2,

a consequence of the limit process just described, is the usual within sum of squares from a conventional, weighted, analysis of variance. Now, is likely to decrease as increases, for example for a new partition of obtained by creating subsets of the existing . Since increases as decreases, it is helpful to have the second term, , that penalizes partitions with larger values of .

For our analysis we take and write Inference for is made using (6) and (3.1) with

 μ|y,g,δ2∼N(E(μ|y,g,δ2),V(μ|y,g,δ2)) (8)

where the conditional posterior moments of

are given in (3) and (4).

Our analyses will indicate whether the true effect sizes come from a common source. If so, then using a standard meta-analysis will provide appropriate inference. If not, several alternatives should be considered, as discussed below.

If a prior evaluation indicates that one of the studies, , can be regarded as a gold standard we can consider the posterior distribution corresponding to study to be the object of inference. Then, using the posterior expected value for illustration,

 E(μi|y)=Eδ2,g|yE(μi|y,g,δ2)

where is defined in (3). Thus, inference for is a function of together with data from the other studies as determined by the form of (3), and, critically, by the likelihood associated with the set of subsets, , containing study See Evans and Sedransk (2001) for additional details and an application to a notable study of the effect of using aspirin by patients following a myocardial infarction.

Otherwise, one must rely on substantive evaluation to decide whether any distinct subsets identified in the analysis should be analyzed separately, e.g., by separate standard meta-analyses. If there are no covariates that can distinguish the subsets, then the distribution of the true effect for study

is a mixture distribution with unknown probabilities associated with the components.

If there are distinct subsets and a single analysis is presented it is important to include a credible interval for the overall true effect. Taken together with the presence of distinct subsets, a very wide interval for the overall true effect would be a strong indication that the single summary rate is not informative.

### 3.2 Dirichlet process mixture

An alternative to the uncertain pooling method is to use a Dirichlet process mixture (DPM), one of the most popular nonparametric Bayesian methods. This methodology, presented in detail in Sections 2.1 and 2.2 of Mller et al. (2015) is outlined below, paraphrasing the text in Mller et al. (2015). Suppose that there is an observed i.i.d. sample

 yi|Hiid∼H,i=1,...,n.

For Bayesian inference a prior probability model on

is needed. Starting with the basic Dirichlet process (DP), let and , the base distribution, be a probability measure defined on the sample space . A DP with parameters , is a random probability measure defined on which assigns probability to every measurable set The DP is an infinite-dimensional analogue of the finite-dimensional Dirichlet prior. In particular, . If , the precision parameter, is large is highly concentrated around ; as the process is essentially

With DP is a discrete measure, so it is typical to extend the DP to DPM, by using a mixture over a simple parametric form such as a pdf. Let be a finite dimensional parameter space. For each , let

be a continuous pdf. Given a probability distribution

defined on , a mixture of with respect to has the pdf

 fH(y)=∫fθ(y)dH(θ).

This mixture model can be expressed as an equivalent hierarchical model, especially relevant for our application, i.e.,

 yi|θiiid∼fθi (9)
 θi|Hiid∼H (10)

with .

For our analyses have used the R function DPmeta from the package DPpackage: see Jara et al. (2011) for additional details.

The model in DPmeta, for fixed, is (9) and (10) with the pdf of a random variable, and

The (independent) hyperparameters are

 M|a0,b0∼Gamma(a0,b0)
 η|ηb,Sb∼N(ηb,Sb)
 τ−2|ϕ1,ϕ2∼Gamma(ϕ1/2,ϕ2/2).

The uncertain pooling method requires only that one specify a prior distribution for and By contrast DPmeta requires substantial prior input, i.e., values for and We have concerns about the sensitivity of inferences to some of the choices of distributions in DPmeta and possible over-fitting since there are more quantities to be specified than in the uncertain pooling method. Moreover, our analyses include data from only five, six, seven and eleven studies. So, we have omitted the specification and made inferences for a selected set of values of as suggested by Escobar (1994). Also, we omitted the step, , and replaced and with their maximum likelihood estimates.

### 3.3 Binomial-beta model and reversible jump Markov chain Monte Carlo

As noted by a reviewer, a limitation of our uncertain pooling method is that the sample standard errors are assumed to be known, as is typically done. There is contemporary research

(Yao et al., 2021)

that models both the sample mean and the log of the sample standard error. However, they assume a bivariate normal distribution for these two statistics, a questionable assumption for our (binomial) case. An alternative that we have investigated assumes a binomial likelihood together with beta and uniform prior distributions in a hierarchical model. The limitation, here, is that one must use Reversible Jump Markov Chain Monte Carlo (RJMCMC) for the computations, and implementation is substantially more difficult than the method presented in Section 3.1.

Recall that the methodology used in Section 3.1 is based on Evans and Sedransk (2001) who use a constraint in the limit process to overcome the problem that the partitions, , have different sizes, i.e., that varies with . Without this adjustment would not be invariant to changes in the scale of the outcome variable, RJMCMC addresses the problem of varying by introducing additional random variables that enable the matching of parameter space dimensions across the partitions. For a concise outline of the RJMCMC method see Gelman et al. (2004). The pioneering paper (Green, 1995) provides the theoretical background for RJMCMC, but also includes, as an example, an application to the uncertain pooling methodology. We have used this model and RJMCMC to analyze the data in the four tables in Section 4.

Assume independent responses, , with Within

 θiiid∼beta(qαk,q(1−αk))i∈Sk(g);k=1,...,d(g). (11)

The group mean parameters, , are drawn independently from (0,1) while log (log , log Finally, .

Then the joint distribution of all the variables is

with

 p∗ = p(g)p(α,q|g)p(θ|g,α,q)p(y|g,α,q,θ) = p(g)p(α|g)p(q)p(θ|g,α,q)p(y|θ) = p(g)[d(g)∏k=11]p(q)d(g)∏k=1∏i∈Sk(g)θqαk−1i(1−θi)q(1−αk)−1)B(qαk,q(1−αk))L∏i=1(niyi)θyii(1−θi)ni−yi.

where the

in the last expression is the pdf of the (assumed) uniform distribution for

Section 6.2 of Green (1995) gives the full conditionals for and The step involving a possible move from partition to a new partition is much more complicated. Green (1995) uses a process that jumps between partitions making only the changes of splitting a group (a birth) and combining two groups (a death). There is an algorithm to select the groups to split and merge. Then births are attempted with probability and deaths with probability

Jumping to a new partition requires a change in the vector

since its length must increase or decrease by one unit. Several steps are required to develop the associated proposal, finally leading to a complicated acceptance probability.

For the pool-all partition Hamza (2008) suggest

 yiind∼bin(ni,θi)

with and

 ηiiid∼N(ξ,κ2).

As an alternative it may be feasible to adapt this to the specification in (11), and implement it using RJMCMC.

## 4 Results and Discussion

Our inferences are for the asymptomatic rates, i.e., population proportions, and Tables 1 - 4 use this representation. That is, in Tables 1-4 we present the sample proportions and the standard errors (SEs). The posterior means and intervals from basic uncertain pooling (Section 3.1) and summaries from DPmeta (Section 3.2) and RJMCMC (Section 3.3) are for the population proportion. However, for the computations for basic uncertain pooling and DPmeta we have used two different representations of in (1), i.e., is the sample proportion, , and

is the sample log odds,

. While we only present results for the latter representation, our findings are similar for both choices. From (1), is replaced by

 (log[^pi/(1−^pi)],log[pi/(1−pi)],1/[nipi(1−pi)]). (12)

As is typically done in applications such as this, e.g., DerSimonian and Laird (1986), we have replaced with an estimate from the sample. For each of the meta-analyses we give in the Appendix the number of asymptomatic cases and observations for each of the component studies together with references where these data can be found.

For uncertain pooling inference for is made using (6). To start, evaluate the right side of (3.1) for

 {g:g=1,...,G;Dgrid points forδ2},

then standardize by dividing the individual terms in the grid by their sum. This provides an approximation for Then select a random sample of size from the normalized values of For each selection, sample from For the logit case, transform to at each step. Starting with a grid for and with a very large range for , we reduced the space to make the 2D-grid sampler faster. Specifically, we retained 99.2% of the probability associated with the extensive grid. We generated values of for the eleven study case and for the other cases. Finally, note that approximations for the marginal posterior distributions, i.e., and can be obtained directly from the grid approximation of .

As noted in Section 1, our study was motivated by a meta-analysis by He, Yi and Zhu (2020) that included early studies of the COVID-19 asymptomatic infection rate. He, Yi and Zhu (2020) carried out a standard meta-analysis using a normal-based random effects model with the effect sizes as the outcome random variable and assuming fixed SEs. Since there is considerable variation in the effect sizes it is prudent to be cautious and investigate whether the true effect sizes are from a single source.

Our analysis starts with the basic uncertain pooling method (Section 3.1) and DPmeta (Section 3.2). We summarize the results using the binomial-beta model and RJMCMC (Section 3.3) at the end of this section. For the basic uncertain pooling method we assumed, a priori, that all 37 partitions have equal probability, i.e., For the prior on , independent of , we used two distributions:

For a standard random effects model, Gelman (2006) recommends using a half-Cauchy prior distribution for While our situation is very different, i.e., many partitions and weighting depending on (not ), we have adopted this suggestion in (a) by transforming the half-Cauchy pdf to obtain the Inverse Beta (InvBeta) pdf of Previous research has shown that there are benefits to having the prior distribution for concentrated near 0, leading to the choice in (b) with . In the following we present results only for (a) as those for (b) are similar.

As described in Section 2, a complete specification of DPmeta requires estimates of many hyperparameters. This seems inappropriate for meta-analyses such as these. So, we have replaced and with their maximum likelihood estimates, thus eliminating the need to specify the other hyperparameters. We have adopted the suggestion of Escobar (1994) to use a small set of values for , i.e., , typically augmented by a value for much smaller than and one larger than

We start by discussing the meta-analysis that motivated our investigation, i.e., He, Yi, and Zhu.He, Yi and Zhu (2020) We analyze the data (sample proportions, numbers of observations) from five of their studies, omitting one (Mizumoto et al., 2020) whose estimates are based on extensive modelling. As such, Buitrago-Garcia et al. (2020) “considered … [this] study … separately, because of the different method of analysis,” and He et al. (2021) excluded it entirely. Note that the numbers of asymptomatic cases and observations are included in the Appendix.

In Table 1 we present for each of the five studies the sample effect size (sample proportion, ), posterior expected value of the true effect size, standard error (SE), and 95% credible interval for the true effect size. Here, SE = where is the total number of observations. From DPmeta, there are the posterior expected values of the true effect size corresponding to and (a good representation of the six values we used in our analysis). The remaining columns give the posterior means and credible intervals obtained by using the Reversible Jump MCMC method. We summarize our results from the basic uncertain pooling method and DPmeta first, adding brief comments about the results obtained by using the RJMCMC method at the end of this section. In general, the results are consistent.

The posterior probability, , associated with the pool-all partition, is minuscule, i.e., Now suppose that we consider the probability that there is any single large cluster. That is, we include all partitions where and for where for The sum of these probabilities is , a very small quantity.

A standard way (Gelman, 2006) to assess the likelihood that the true effect sizes come from a common source is to assume the pool-all model,

, and evaluate the posterior predictive p-value using a standard

(6.4 of Gelman, 2006), discrepancy measure. With overall effect define where is now the sample log odds, is as in (12), and is . The discrepancy measure,

 T(y,ν,δ2)=L∑i=1(¯¯¯¯Yi−ν)2((^σ2i/ni)+δ2), (13)

is based on the pool-all model, as defined in (1) and (2). Then the posterior predictive p-value is

 Pr{T(yrep,ν,δ2)≥T(yobs,ν,δ2)|yobs}

with denoting a replication from the pool-all model. For these data the p-value is , showing that the observed data are not concordant with the pool-all model. These results show that it is highly unlikely that the five true effect sizes come from a single source.

Another way to analyze these data is to construct a similarity matrix. For each pair of studies, and , the similarity matrix gives the posterior probability that and are in the same cluster. We present in Figure 1 (top) a visual representation of the similarity matrix using the basic uncertain pooling method (Section 3.1) while Figure 1 (bottom) is the corresponding similarity plot using the RJMCMC methodology in Section 3.3. The study numbers are given on both the and axes and there is a legend showing, for each cell, the value of the pairwise probability. Figure 1 (top) shows limited strong clustering, {2,4}, while Figure 1 (bottom) shows some additional clustering, {1,3}, {2,4} and {4,5}. Thus, limited aggregation seems appropriate. This may not be surprising since the He, Yi and Zhu (2020) review only included observations from the very early part of the COVID-19 pandemic.

From Table 1 the 95% credible intervals are very wide, a further indication of considerable uncertainty. Moreover, a 95% credible interval for the overall true effect size, , is . This wide interval is a strong indication that a single summary value such as the posterior mean of would not be informative.

The results from applying DPmeta provide additional insight. When is small, e.g., = 1/5, the posterior expected values show considerable clustering, i.e., {2,4,5} and {1,3}, similar to those seen in Figure 1(bottom). This clustering is not surprising since choosing a very small value of means that there is very limited sampling from the parametric centering measure . Using a moderate choice, = 5, the five posterior expected values are close to those obtained using the basic uncertain pooling methodology as are the 95% credible intervals (intervals are not shown in Table 1).

The results in Table 1 and Figure 1 suggest that one should not pool the data from these five studies. The next review that we consider, He et al. (2021) also shows the same issue, i.e., questionable pooling of the data from all of the eleven studies. However, in this case, our analysis provides evidence of considerable clustering. Looking at the characteristics of the eleven studies could reveal the reasons for this clustering, and the direction to take to make appropriate inferences.

The analysis presented below uses a subset of the data in He et al. (2021), namely the asymptomatic infection rate in eleven studies of children. The results are summarized in Table 2 and Figure 2, analogous to Table 1 and Figure 1.

From Table 2 and Figure 2 it is apparent that there are several distinct subsets. Further investigation could reveal features that separate these subsets, leading to advances in understanding the differences in the asymptomatic rates.

The posterior probability, , associated with the pool-all partition, is minuscule, i.e., As described for the He, Yi and Zhu (2020) study, the sum of the posterior probabilities associated with partitions having only a single cluster and 0, 1, 2, 3, 4 or 5 singleton subsets is . This result suggests that it is unlikely that the eleven true effect sizes come from a single source. Moreover, using the discrepancy measure in (13), the posterior predictive p-value is , showing that the observed data are not concordant with the pool-all model. Finally, a 95% credible interval for the overall true effect size, , is . This interval, together with the clustering, is a strong indication that a single summary value such as the posterior mean of would not be informative.

From Figure 2 the most likely cluster is {3,6,7,11}, while the next most likely one is {1,2,5,8,9,10}. In this case, there is considerable clustering but it does not extend to all eleven studies. Without additional evidence this analysis suggests conducting separate standard meta-analyses for the two large subsets.

In Table 2 the posterior expected values for the individual studies in cluster {3,6,7,11} are, essentially, an average of the corresponding four sample effect sizes, but reduced in magnitude. This reduction reflects the contributions from the data from the other seven studies.

The clustering seen in Figure 2 is also evident in the posterior expected values from DPmeta with small ; see Table 2 with . With there is good agreement between the posterior expected values for the two methods. For most of the eleven studies there is good agreement in the credible intervals for the two methods with for DPmeta (not shown in Table 2).

The next analysis uses the data from a subset of six of the eleven studies in He et al. (2021). These six studies were chosen to illustrate properties of the methodology when there is considerable separation. The results are summarized in Table 3 and Figure 3.

From Figure 3 and Table 3 it is apparent that there are two distinct subsets, i.e., and . Presumably this separation reflects different characteristics of the two populations and/or different ways that the studies were carried out. As expected, is minuscule, i.e., Proceeding as described for the He et al. (2021) study, the sum of the posterior probabilities associated with partitions having only a single large cluster (i.e., with at least four members) is . Thus, there is no evidence that the true effect sizes from these six studies come from a single source. Moreover, using the discrepancy measure in (13), the posterior predictive p-value is , showing that the observed data are not concordant with the pool-all model. Finally, a 95% credible interval for the overall true effect size, , is . This wide interval, together with the clustering, is a strong indication that a single summary value such as the posterior mean of would not be informative. The results from DPmeta are consistent with those given above. For a wide range of values of the posterior expected values for {1,2,5} are about 0.14 while those for {6,7,11} are about 0.50. With the posterior expected values and intervals for {1,2,5} are quite close to those obtained by using the basic uncertain pooling methodology. For {6,7,11} almost all of the expected values and interval endpoints are within 0.06 of those from the uncertain pooling methodology.

The final analysis uses the data from the seven study screening subset in Buitrago-Garcia et al. (2020), identified as being of special interest. The results are summarized in Table 4 and Figure 4. In Table 4 the sample SEs play a large role. That is, for the studies {1,2} with the large SEs the posterior expected values are much smaller than the sample effect sizes. For the studies {5,6,7} with the small SEs the posterior expected values and sample effect sizes are approximately equal. Thus, apart from study 4, the posterior expected values suggest greater commonality than was apparent from the corresponding set of sample effect sizes. However, the individual posterior credible intervals are quite different. Using the discrepancy measure in (13), the posterior predictive p-value is , indicating that there may be a common source for the true effects. This result is supported by Figure 4 which suggests relatively uniform clustering (except for studies and ).

Since the data in each of Tables 2-4 are for a subset of a much larger set of studies they are likely to be substantially more homogeneous than the data in the full set of studies. For example, the seven screening studies (Table 4) are a subset of seventy-nine studies with sample proportions ranging from 0.01 to 0.92.

Results from using DPmeta are, for the most part, consistent with these observations. For all of the posterior expected values are approximately 0.31. For the posterior expected values range from 0.25 to 0.35, similar to the posterior means in Table 4. For the intervals corresponding to {5,6,7}, the studies with the smallest SEs, are similar to those from the basic uncertain pooling methodology.

With a few modifications we have implemented the RJMCMC procedure outlined in Sections 6.2 and 6.3 of Green (1995), but expand the range for the prior for by taking log (log 100, log 1000) throughout. We summarize by using the posterior means and 95% credible intervals (bottom of Tables 1-4), and similarity plots (bottom of Figures 1-4).

Since the sample likelihoods and prior distributions differ between the two approaches, i.e., those based on the likelihoods and priors in Sections 3.1 and 3.3, comparisons of the results may not be especially meaningful. However, examining Tables 1-4 it is notable that the summaries (posterior means and intervals) from the two approaches are consistent, generally close when the standard errors (SEs) are small, less so when the SEs are very large. There are no major differences between the comparable similarity plots corresponding to basic uncertain pooling and RJMCMC.

## 5 Concluding Remarks

The importance of good inferences for the COVID-19 asymptomatic infection rates is clear, as noted in the quotation from Buitrago-Garcia et al. (2020) in Section 1. Conducting meta-analyses is a common, often useful, way to summarize information from a collection of studies. However, inference will be misleading if there is pooling of data from studies that are not concordant. For example, Byambasuren et al. (2020) note: “A recent review by the Centre for Evidence Based medicine in Oxford found a range of estimates of asymptomatic COVID-19 cases which ranged from 5% to 80%. However, many of the identified studies were either poorly executed or poorly documented, making the validity of these estimates questionable.”

In this paper, we re-analyze data from three review papers, using three Bayesian methods that have a more general structure than the common meta-analytic ones. This methodology shows, in a principled manner, the extent and nature of the pooling that can be justified statistically. The more general structure should ensure greater concordance of the data with our model than with a more restricted model.

If the authors of a review have screened the studies so that the ones remaining for analysis have no markedly aberrant characteristics then an analysis showing distinct clusters should prompt a further review, and careful consideration of the inferences to make and the method to use.

In some situations there may be covariates associated with the studies that may help to explain differences in the outcomes. To illustrate, use the basic notation and a linear regression offset. That is, replace (

1) with

 ¯Yi∼N(μi+xtiβ,σ2i).

Then inference for can be made using the extension of (6)

 f(μ,β|y)=∫f(β|μ,y,g,δ2)f(μ|y,g,δ2)f(g,δ2|y)dδ2dg

where it is easily shown that

 β|μ,y,δ2,g∼MVN(d,A−1)

with an matrix of covariates, a vector of regression coefficients, an diagonal matrix with th element , , and

In the reviews we have considered, only He et al. (2021) gives more than one covariate for each study, i.e., the number of confirmed cases and percent male. Using the covariates, an exploratory analysis of the residuals showed that such an augmented analysis will not improve inferences.

The three methods can be implemented. For DPmeta there is an R package (DPpackage) and the code for DPmeta is included as Supplementary Material. R packages are being developed for the two uncertain pooling methods, i.e., basic uncertain pooling and RJMCMC.

Finally, for basic uncertain pooling and DPmeta both the sample proportion and logit of the sample proportion are used in applications. While we have presented results only for the latter, our findings are similar for both choices.

## B References

• Borenstein et al. (2010) Borenstein M, Hedges L, Higgins P, Rothstein H. A basic introduction to fixed-effect and random-effects models for meta-analysis, Res Synth Methods, 2010;1: 97-111. doi:10.1002/jsrm.12
• Buitrago-Garcia et al. (2020) Buitrago-Garcia D, Egli-Gany D, Counotte M, et al. Occurrence and transmission potential of asymptomatic and presymptomatic SARSCoV-2 infections: A living systematic review and meta-analysis, PLOS Medicine
• Oran and Topol (2020) Oran D, Topol E. Prevalence of asymptomatic Sars-Cov-2 infection: a narrative review. Annals of Internal Medicine
• Schmid, Carlin and Welton (2021) Schmid C, Carlin B, Welton N. Bayesian methods for meta-analysis. In Handbook of Meta-Analysis (Schmid, Stijnen, White, eds). Boca Raton FL: Chapman and Hall/CRC Press, 2021;91-127.
• Rice, Higgins and Lumley (2018) Rice K, Higgins J, Lumley T. A re-evaluation of fixed effect(s) meta-analysis. J. Royal Statistical Society, A. 2018;81(1):205-227.
• He, Yi and Zhu (2020) He W, Yi G, Zhu Y. Estimation of the basic reproduction number, average incubation time, asymptomatic infection rate, and case fatality rate for COVID-19: Meta-analysis and sensitivity analysis. J Med Virol. 2020;92(11):2543-2550. doi: 10.1002/jmv.26041
• He et al. (2021) He J, Guo Y, Mao R, Zhang J. Proportion of asymptomatic coronavirus disease 2019: A systematic review and meta-analysis. J Med Virol. 2021;93(2):820–830. doi: 10.1002/jmv.26326.
• Byambasuren et al. (2020) Byambasuren O, Cardona M, Bell K, Clark J, McLaws M, Glasziou P. Estimating the extent of true asymptomatic COVID-19 and its potential for community transmission: systematic review and meta-analysis. Official Journal of the Association of Medical Microbiology and Infectious Disease Canada. 2020. doi: 10.3138/jammi-2020-0030.
• Malec and Sedransk (1992) Malec D, Sedransk J. Bayesian methodology for combining the results from different experiments when the specifications for pooling are uncertain. Biometrika. 1992;79(3):593-601.
• Evans and Sedransk (2001) Evans R, Sedransk J. Combining data from experiments that may be similar. Biometrika. 2001;88(3):643-656.
• Dahl, Day and Tsai (2017) Dahl D, Day R, Tsai J. Random partition distribution indexed by pairwise information. Journal of the American Statistical Association. 2017; 112: 721-732.
• Mller et al. (2015) Mller P, Quintana FA, Jara A, Hanson T. Bayesian Nonparametric Data Analysis, Springer Series in Statistics; 2015.
• Jara et al. (2011) Jara A, Hanson T, Quintana FA, Mller P, Rosner G. DPpackage: Bayesian semi- and nonparametric modeling in R, J. Stat. Softw. 2011;40(5). doi: 10.18637/jss.v040.i05.
• Escobar (1994) Escobar M. Estimating normal means with a Dirichlet process prior. J Am Stat Assoc. 1994;89(425):268-277.
• Yao et al. (2021) Yao Y, Ogden T, Zeng C, Chen Q. Bivariate hierarchical Bayesian model for combining summary measures and their uncertainties from multiple sources. 2021. arXiv:2109.07560[stat.ME].
• Gelman et al. (2004) Gelman A, Carlin B, Stern H, Rubin D. Bayesian Data Analysis. 2nd ed. Boca Raton FL: Chapman and Hall/CRC; 2004.
• Green (1995) Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711-732.
• Hamza (2008)

Hamza T, van Houwelingen H, Stijnen T. The binomial distribution of meta-analysis was preferred to model within-study variability.

Jour. Clinical Epidemiology. 2008; 61: 41-51.
• DerSimonian and Laird (1986) DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7(3):177-88. doi: 10.1016/0197-2456(86)90046-2. PMID: 3802833.
• Gelman (2006) Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal, 2006;1(3):515-533.
• Mizumoto et al. (2020) Mizumoto K, Kagaya K, Zarebski A, Chowell G. Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokahama, Japan, 2020. Euro Surveill. 2020: 25(10). doi: 10.2807/1560-7917.
• Nishiura et al. (2020) Nishiura H, Kobayashi T, Miyama T, et al. Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19). Int J Infect Dis 2020; 94: 154-155. doi: 10.1016/j.ijid.2020.03.020.
• Kimball et al. (2020) Kimball A, Hatfield K, Arons M, et al. Asymptomatic and pre-symptomatic SARS-CoV-2 infections in residents of a long-term care skilled nursing facility - King County, Washington, March 2020. MMWR Morbidity Mortality Weekly Report 2020; 69: 377-381.
• Song et al. (2020) Song H, Xiao J, Qiu J, et al. A considerable proportion of individuals with asymptomatic SARS-CoV-2 infection in Tibetan population MedRxiv. 2020 doi: 10.1101/2020.03.27.20043836.
• Serra (2020) Serra M. Coronavirus, Castiglione d’Adda e un caso di studio: ’ll 70% dei donatori di sangue ’e positivo’. Top News. Lastampa, Italy. 2020. https://it.notizie.yahoo.com/coronavirus-castiglione-d-adda-il-065252105.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAI4TytcLgIOu9fy3U2v5N8NhGzb0_B0BAu-9y4CITl4pJ9JfDLdo0UQ0dBAUKbQoLOr94nKkh6l5fFEUlhjTYiu4YtSQ-UgXuQle0TpbgPhCk3UbhlnqtHcRmXZKxtJFUcsawQ1hcXL94MnTucZEc1RaUpuct9OURAMuirKvhWFz
• Day (2020) Day M. COVID-19: four fifths of cases are asymptomatic, China figures indicate. BMJ 2020; 369: m1375. doi: 10.1136/bmj.m1375.

## Acknowledgments

The authors are grateful to the reviewers for their comments which have improved the focus of the paper and motivated further methodological development. They are also grateful to Professor Peter Green for his assistance in applying Reversible Jump MCMC to our data. They also appreciate research allocation grants from XSEDE’s Pittsburgh Supercomputing Center.

None reported.

### Conflict of interest

The authors declare no potential conflict of interests.