# Standard error estimation in meta-analysis of studies reporting medians

We consider the setting of an aggregate data meta-analysis of a continuous outcome of interest. When the distribution of the outcome is skewed, it is often the case that some primary studies report the sample mean and standard deviation of the outcome and other studies report the sample median along with the first and third quartiles and/or minimum and maximum values. To perform meta-analysis in this context, a number of approaches have recently been developed to impute the sample mean and standard deviation from studies reporting medians. Then, standard meta-analytic approaches with inverse-variance weighting are applied based on the (imputed) study-specific sample means and standard deviations. In this paper, we illustrate how this common practice can severely underestimate the within-study standard errors, which results in overestimation of between-study heterogeneity in random effects meta-analyses. We propose a straightforward bootstrap approach to estimate the standard errors of the imputed sample means. Our simulation study illustrates how the proposed approach can improve estimation of the within-study standard errors and between-study heterogeneity. Moreover, we apply the proposed approach in a meta-analysis to identify risk factors of a severe course of COVID-19.

• 6 publications
• 1 publication
• 1 publication
• 5 publications
• 7 publications
• 4 publications
03/25/2019

### Estimating the sample mean and standard deviation from commonly reported quantiles in meta-analysis

Researchers increasingly use meta-analysis to synthesize the results of ...
09/05/2018

### Two-sample aggregate data meta-analysis of medians

We consider the problem of meta-analyzing two-group studies that report ...
09/24/2020

### A new multivariate meta-analysis model for many variates and few studies

Studies often estimate associations between an outcome and multiple vari...
07/16/2019

### Outliers in meta-analysis: an asymmetric trimmed-mean approach

The adaptive asymmetric trimmed mean is a known way of estimating centra...
10/01/2019

### Generalized inferential models for meta-analyses based on few studies

Meta-analysis based on only a few studies remains a challenging problem,...
04/05/2020

### ABCMETAapp: R Shiny Application for Simulation-based Estimation of Mean and Standard Deviation for Meta-analysis via Approximate Bayesian Computation (ABC)

Background and Objective: In meta-analysis based on continuous outcome, ...
05/17/2019

### Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects

A critical decision point when training predictors using multiple studie...

## 1 Introduction

Meta-analysis is a statistical approach that synthesizes data across studies addressing a common research question. We consider the setting where the outcome of interest in a meta-analysis is continuous. In this setting, primary studies typically report the sample mean and standard deviation of the outcome. However, when the distribution of the outcome is skewed, primary studies often instead report the sample median of the outcome along with the first and third quartiles and/or the minimum and maximum values [higgins2020cochrane].

A number of approaches have recently been developed to perform meta-analysis when some primary studies report sample means and others report sample medians. The most commonly applied approach is to first impute the sample mean and standard deviation from studies that report medians along with the first and third quartiles and/or minimum and maximum values. Then, one applies standard meta-analytic approaches with inverse-variance weighting based on the (imputed) study-specific sample means and standard deviations. Such approaches, which we refer to as transformation-based approaches, were first proposed and systematically evaluated by Hozo et al. [hozo2005estimating] and have been further developed by a number of authors in recent years [bland2015estimating, wan2014estimating, kwon2015simulation, luo2018optimally, mcgrath2020estimating, shi2020optimally, shi2020estimating, walter2022estimation, cai2021estimating, yang2021generalized]. Reflecting their widespread application, Google Scholar lists over 10,000 articles citing these transformation-based approaches (i.e., [hozo2005estimating, bland2015estimating, wan2014estimating, kwon2015simulation, luo2018optimally, mcgrath2020estimating, shi2020optimally, shi2020estimating, walter2022estimation, cai2021estimating, yang2021generalized]) as of May 1 2022.

A fundamental problem of the standard application of transformation-based approaches is that the sampling variability of the imputed sample means is not appropriately accounted for when performing an inverse-variance weighted meta-analysis. For instance, consider the simple case of a meta-analysis of one-group studies that report the sample median of the outcome. While the standard deviation divided by the square root of the sample size is the standard error (SE) of the sample mean, this is not necessarily equal to the SE of the mean estimator of the transformation-based approaches. In particular, such mean estimators are typically much less efficient than the sample mean because they are based on less data. Consequently, the estimated standard deviation divided by the square root of the sample size – which we refer to as the naïve SE estimator – may underestimate the SE of the mean estimator for studies reporting medians.

Underestimation of within-study SEs can have several downstream consequences in meta-analysis, which we investigate in this paper. For instance, consider the between-study variance estimator in a random effects meta-analysis. Since the total variance of the study-specific means decomposes into the sum of the within-study variance and between-study variance, systematically underestimating the within-study SEs results in overestimating the between-study variance. This in turn introduces bias in estimating other parameters that depend on the between-study variance, especially those quantifying between-study heterogeneity such as the index [higgins2002quantifying, higgins2003measuring]. More generally, poor estimation of between-study heterogeneity may strongly affect the conclusions meta-analyses, as the degree of between-study heterogeneity helps researchers understand the generalizability of their conclusions, helps guide additional analyses to identify sources of heterogeneity (e.g., subgroup analyses, meta-regression analyses), and even helps determine whether it makes sense to pool the data in a meta-analysis at all [higgins2020cochrane].

The use of the naïve SE estimators in the literature presumably arose because most of the papers proposing transformation-based approaches suggested that data analysts use the imputed study-specific sample means and standard deviations in place of the actual, unreported sample means and standard deviations in standard meta-analytic approaches. Even more explicitly, Hozo et al. [hozo2005estimating], Luo et al. [luo2018optimally], McGrath et al. [mcgrath2020estimating], and Shi et al. [shi2020estimating] used the the naïve SE estimator in data applications illustrating how to apply their proposed transformation-based approaches, for instance. While some have raised concerns over potential issues of the naïve SE estimator (e.g., see the open peer review report of Hozo et al. [hozo2005estimating]), the downstream consequences of using naïve SE estimators has been largely unexplored in the literature because most simulation studies evaluating the transformation-based approaches (e.g., see [hozo2005estimating, bland2015estimating, wan2014estimating, kwon2015simulation, luo2018optimally, mcgrath2020estimating, shi2020optimally, shi2020estimating, walter2022estimation, cai2021estimating]) have almost entirely focused on the performance of these methods at the study-level (i.e., for estimating the mean and standard deviation of the outcome distribution for a given study) rather than at the meta-analytic level.

Analytically deriving the SE of the mean estimator of the transformation-based approaches is case-specific and often challenging, as several of the currently best-performing approaches involve model selection and are solutions to ad-hoc optimization problems. However, since most of these transformation-based approaches assume parametric models, parametric bootstrap

[efron1994introduction] is a natural approach to consider for estimating the SE of the mean estimator, which we explore in this paper.

The methodological contributions of this paper are as follows: 1) Focusing on the recently proposed transformation-based approaches of McGrath et al. [mcgrath2020estimating] and Cai et al. [cai2021estimating], we illustrate that the standard application of transformation-based approaches severely underestimates the within-study SEs and consequently overestimates between-study heterogeneity in random effects meta-analyses. 2) We describe a parametric bootstrap approach to estimate within-study SEs in this context and systematically evaluate its performance at both the study- and meta-analytic level.

In Section 2, we describe the transformation-based approaches and the proposed bootstrap SE estimator. In Section 3, we perform a simulation study evaluating the performance of the naïve and bootstrap SE estimators at both the study- and meta-analytic level. We perform an empirical comparison of the naïve and bootstrap SE estimators in a meta-analysis of risk factors of a severe course of COVID-19 in Section 4. We conclude with a discussion in Section 5.

## 2 Methods

### 2.1 Meta-analytic models, estimands, and estimators

Consider an aggregate data meta-analysis of a continuous outcome of interest. For each primary study (), let denote the point estimate of the outcome of interest denote the estimate of its sampling variance. For instance, may represent the sample mean of the outcome of interest. The study-specific estimates are assumed to be distributed as

 yk∼Normal(μk,σ2k),k=1,…,K

where the are regarded as known quantities. The common effect (also referred to as fixed effect) model assumes that the study-specific means are identical (i.e., for , ). The random effects

model assumes that the study-specific means are random variables that are distributed as

 μk∼Normal(μpool,τ2),k=1,…,K.

We refer to as the pooled mean and as the between-study variance.

Throughout, we adopt the framework of a random effects model, although the approaches introduced in this paper are similarly applicable to the common effect model. The classic inverse-variance weighted estimator of and its estimated SE are given by

 ^μpool=∑Kk=1ykσ2k+^τ2∑Kk=11σ2k+^τ2,ˆSE(^μpool)=  ⎷1∑Kk=11σ2k+^τ2.

where denotes an estimate of the between-study variance. A number of between-study variance estimators have been proposed and systematically compared, which are discussed at length elsewhere [viechtbauer2005bias, veroniki2016methods, langan2019comparison]. A commonly used relative measure of between-study heterogeneity is the index. The index describes the proportion of the variability of the study-specific means that is due to the between-study variance rather than the within-study variance (i.e., ) [higgins2002quantifying, higgins2003measuring]. The index is typically estimated by

 ^I2=^τ2^τ2+^σ2

where is an estimator of the “typical” within-study variance. Higgins et al. [higgins2002quantifying] discuss approaches for obtaining .

### 2.2 Transformation-based approaches

As previously discussed, some primary studies in a meta-analysis may report the sample mean of the outcome of interest and other primary studies may report the median. Transformation-based approaches are conventionally applied in such settings to impute the sample means and standard deviations from primary studies reporting medians. In this subsection, we describe such transformation-based approaches. For simplicity, we consider the setting of one-group primary studies, although the methods described here are applicable to studies with multiple groups (e.g., treatment and control groups) where the outcome of interest may be the difference of means across groups.

We use the following notation to describe data arising in a primary study. Suppose we observe independent and identically distributed samples of the outcome of interest . Let denote the underlying distribution and denote the underlying model. The primary study may report the following summary statistics of the outcome of interest: minimum value (), first quartile (), median (), third quartile (), and maximum value ().

Transformation-based approaches typically consider that one of the following sets of summary statistics is reported:

 S1 ={qmin,q2,qmax,n} S2 ={q1,q2,q3,n} S3 ={qmin,q1,q2,q3,qmax,n}.

In the following subsections, we describe the transformation-based approaches proposed by McGrath et al. [mcgrath2020estimating] and Cai et al. [cai2021estimating]

. We focus on these approaches because they were developed for situations when the data were suspected to be non-normal – which is often the case when primary studies report the median of the outcome – and these approaches yielded excellent performance for estimating the mean and standard deviation in simulation studies under non-normal distributions.

#### 2.2.1 Quantile Matching Estimation (QE)

The quantile (matching) estimation (QE) approach

[mcgrath2020estimating] assumes that the underlying distribution belongs to one of following candidate parametric distributions: normal, log-normal, gamma, beta, or Weibull. This approach estimates the parameters of each candidate distribution by minimizing the distance between the sample quantiles with the distribution quantiles. More specifically, let denote the th candidate parametric distribution. This approach obtains an estimate of by minimizing where

 Sj(θj) :=(F−1j,θj(1/n)−qmin)2+(F−1j,θj(1/2)−q2)2+(F−1j,θj(1−1/n)−qmax)2 in S1 Sj(θj) :=(F−1j,θj(1/4)−q1)2+(F−1j,θj(1/2)−q2)2+(F−1j,θj(3/4)−q3)2 in S2 Sj(θj) :=(F−1j,θj(1/n)−qmin)2+(F−1j,θj(1/4)−q1)2+(F−1j,θj(1/2)−q2)2 +(F−1j,θj(3/4)−q3)2+(F−1j,θj(1−1/n)−qmax)2 in S3

and

denotes the inverse cumulative distribution function (CDF) of distribution

. The candidate distribution with the best fit (i.e., smallest value of ) is selected. The mean and standard deviation of the selected distribution are used as the estimated mean and standard deviation.

#### 2.2.2 Box-Cox (BC)

The Box-Cox (BC) approach [mcgrath2020estimating] assumes that the underlying distribution is normal after applying a suitable Box-Cox transformation [box1964analysis]. A Box-Cox transformation with power parameter is defined as

 gλ(x):={xλ−1λif λ≠0logxif λ=0

for . In , the BC approach estimates by finding the value such that the transformed minimum and maximum values (i.e., and ) are equidistant from the transformed median (i.e., ). Similarly, in , is estimated such that the transformed first and third quartiles are equidistant from the transformed median. Since there does not necessarily exist a value of such that both (i) the transformed minimum and maximum values are equidistant from the transformed median, and (ii) the transformed first and third quartiles are equidistant from the transformed median, the BC approach estimates in by

 ^λ=argminλ{[(gλ(q3)−gλ(q2))−(gλ(q2)−gλ(q1))]2+[(gλ(q3)−gλ(q2))−(gλ(q2)−gλ(q1))]2}.

Then, the methods of Luo et al. [luo2018optimally] and Wan et al. [wan2014estimating] are applied to estimate the mean and standard deviation of the transformed data, respectively. Last, the inverse transformation is applied to estimate the mean and standard deviation of the original, untransformed data.

#### 2.2.3 Method for Unknown Non-Normal Distributions (MLN)

Like the BC approach, the Method for Unknown Non-Normal Distributions (MLN) [cai2021estimating] assumes that the underlying distribution is normal after applying a suitable Box-Cox transformation (i.e., ). The following maximum likelihood approach is used to estimate . For , let denote the estimated mean and standard deviation based on the methods of Luo et al. [luo2018optimally] and Wan et al. [wan2014estimating] applied to the Box-Cox transformed quantiles of summary statistics with power parameter (which we omit for notational simplicity). Let denote the CDF and

denote the probability density function of a

distribution, . The conditional likelihood in is given by

 L(λ|S1,~μ1,~σ1)∝(F1(q2)−F1(qmin))n/2−1(F1(qmax)−F1(q2))n/2−1f1(qmin)f1(q2)f1(qmax).

Similarly, the conditional likelihood in is given by

 L(λ|S2,~μ2,~σ2)∝F2(q1)n/4(F2(q2)−F2(q1))n/4−1(F2(q3)−F2(q2))n/4−1(1−F2(q3))n/4f2(q1)f2(q2)f2(q3)

and in is given by

 L(λ|S3,~μ3,~σ3) ∝(F3(q1)−F3(qmin))n/4−1(F3(q2)−F3(q1))n/4−1(F3(q3)−F3(q2))n/4−1(F3(qmax)−F3(q3))n/4−1 ×f3(qmin)f3(q1)f3(q2)f3(q3)f3(qmax).

This approach estimates by minimizing the log conditional likelihood. Then, this approach finds the maximum likelihood estimate of and conditional on the sample quantiles and . Last, the inverse transformation is applied to estimate the mean and standard deviation of the untransformed data.

### 2.3 Standard error estimation of transformation-based approaches

As a slight abuse of notation, in this subsection we let denote the standard deviation of the underlying distribution of the outcome in a given study. Let denote the sample standard deviation of the outcome and denote the estimated standard deviation of the outcome based on an arbitrary transformation-based approach.

#### 2.3.1 Naïve approach

Recall from Section 2.1 that standard meta-analytic approaches require that each primary study contributes a point estimate of the outcome of interest and an estimate of its SE. When a primary study reports the sample mean of the outcome of interest, the usual SE estimate is (as the true SE is ). When primary studies report medians, data analysts typically apply transformation-based approaches and use the estimated mean as the point estimate and as its SE estimate.

However, is not necessarily equal to the true SE of the mean estimator of the transformation-based approaches. That is, is the wrong target parameter for the within-study SE when using transformation-based approaches. Unlike the sample mean, the mean estimator of the transformation-based approaches is not based on the full data set. Consequently, one may expect the true SE of the mean estimator of transformation-based approaches to be systematically higher than . Further, one may expect this discrepancy to be particularly large for the more complicated transformation-based approaches involving model selection and/or solving various optimization problems, such as those described in the previous subsections. For these reasons, we refer to as the naïve SE estimate of the transformation-based approaches.

As a simple illustration of the bias of the naïve SE estimates, consider the following example. We generated 1,000 independent data sets of size under a distribution. For each data set, we calculated summary statistics and applied the QE, BC, and MLN approaches to the summary data. Figure 1 displays the distributions of the naïve SE estimates along with the true SEs of the three approaches (obtained by Monte Carlo integration with samples). The naïve SE estimators of all three approaches severely underestimated the true SEs. This bias is not due to poor estimation of : The value of (i.e., representing the best-case scenario for the naïve approaches) is approximately 1.2, which is considerably smaller than the true SEs. Similar results hold when applying other transformation-based approaches to this example (see Appendix A).

#### 2.3.2 Parametric bootstrap

To remedy the problems of the naïve approaches, we propose a straightforward application of parametric bootstrap [efron1994introduction] to estimate the SEs of the transformation-based approaches. For the sake of concreteness, we describe the parametric bootstrap approach for the MLN method under , noting that only trivial modifications are needed for the other transformation-based methods and scenarios (i.e., , ).

First, estimate the parameters of the underlying distribution by applying the MLN approach to the summary data. Let denote the fitted distribution. For where is some large number (e.g., , perform the following two steps: (1) Simulate i.i.d. observations under , (2) Apply the MLN approach to obtain an estimate of the mean () based on minimum value, median, and maximum value of the simulated data in the first step. The parametric bootstrap estimate of the SE of the MLN approach is the sample standard deviation of the (i.e., where ).

The distributions of the SE estimates when applying the parametric bootstrap approach to the illustrative example in the previous subsection are given in Appendix A.

## 3 Simulations

We systematically evaluated the performance of the naïve and parametric bootstrap SE estimators of the QE, BC, and MLN approaches in a simulation study. We performed simulations at both the study-level (Section 3.1) and at the meta-analytic level (Section 3.2). The code used for these simulations is available at https://github.com/stmcg/se-est.

### 3.1 Study-level simulations

#### 3.1.1 Simulation Settings

For consistency with previous simulation studies evaluating the performance of the QE, BC, and MLN methods [mcgrath2020estimating, cai2021estimating], we generated independent and identically distributed data sets of size under the following distributions: , , , and

. Unlike previous studies, we also included the half-normal distribution with a mean of 10 to evaluate the performance of the all three approaches under model misspecification. The probability density functions of these five distributions are illustrated in Appendix

B. We also varied the sample size () and the summary statistics reported (i.e., , , or ). We performed repetitions for each of the combinations of distribution, sample size, and summary statistics reported.

We applied the QE, BC, and MLN approaches to obtain the naïve and parametric bootstrap SE estimates. We used bootstrap samples.

We evaluated the median percent error, mean percent error, and root mean squared error (RMSE) of the SE estimators. Letting denote an estimate of the SE in the th repetition (), the median percent error is defined to be the sample median of the and the mean percent error is defined to be the sample mean of the . The true values of the SEs were obtained by Monte Carlo integration with samples.

#### 3.1.2 Results

The median percent error of the naïve and bootstrap SE estimators in scenarios , , and are given in Tables 1, 2, and 3, respectively. The naïve approaches underestimated the SEs in all simulation settings. These approaches generally performed worse as increased, which may be expected since these approaches estimate the wrong target parameter. In one of the least favorable simulation settings for the naïve approaches, the naïve SE estimators of the QE, BC, and MLN approaches had median percent error of -70%, -59%, and -54%, respectively, under the in scenario when .

The bootstrap SE estimators had considerably smaller median percent error compared to the naïve approaches in most simulation settings. These approaches also generally performed better as increased. There was not a clear trend in the performance of these methods when varying the summary statistics reported (i.e., , , ). In general, the bootstrap SE estimator of the MLN approach performed best. For instance, this approach had median percent error smaller than 2% in most simulation settings.

The simulation results for the mean percent error and RMSE are given in Appendix B for parsimony. We observed similar trends: the bootstrap approaches often performed better than the naïve approaches. However, there were occasionally very large bootstrap SE estimates which resulted in the bootstrap approaches having larger mean percent error and RMSE compared to the naïve approach in some simulation settings. This occurred most often for the QE approach (especially in with small ).

### 3.2 Meta-analysis simulations

#### 3.2.1 Simulation settings

We simulated meta-analyses with primary studies reporting sample medians of the outcome of interest as follows. We simulated the outcome of interest for the th individual in the th primary study by

 Xik ∼Log-Normal(5,0.252)+γk,i=1,…,nk,k=1,…,K

where denotes a study-specific random effect. We let for where was set to 6. This value of was set to obtain a reasonable degree of between-study heterogeneity (e.g.,

if all studies report sample means). The study-specific sample sizes were drawn from a discrete uniform distribution with a minimum value of

and a maximum value of .

We included 20 different simulation settings by varying number of primary studies the proportion of primary studies reporting the sample median of the outcome () and the summary statistics reported by the primary studies reporting medians (i.e., ). When primary studies did not report the sample median of the outcome, they reported the sample mean, standard deviation, and sample size. We performed repetitions in each of the 20 simulation settings.

We applied the QE, BC, and MLN approaches to all primary studies reporting medians. We applied each of the three transformation-based approaches with the naïve SE estimator and the parametric bootstrap SE estimator with bootstrap samples. After obtaining all study-specific points estimates and estimates of their SEs, we applied the standard inverse-variance approach to estimate the pooled mean. We used the Restricted Maximum Likelihood (REML) approach to estimate the between-study variance. To construct 95% CIs around the pooled mean and between-study variance estimates, we used the Wald method for the pooled mean and the Q-profile method for the between-study variance, as implemented in the metafor R package [metafor].

We evaluated the bias, variance, and coverage of the 95% confidence intervals (CIs) of the pooled mean and between-study variance estimators. We also evaluated the bias of the

estimators. The true values of were obtained by Monte Carlo integration with samples.

Note that we included two simulation settings where all primary studies report the sample mean of the outcome (i.e., settings with ), in which case transformation-based approaches are not applicable. We included these settings to clarify the extent to which sub-optimal performance of the estimators (e.g., any bias or below nominal coverage of 95% CIs) may be attributed to sources such as a small number of primary studies or small within-study sample sizes rather than poor performance of the transformation-based approaches.

#### 3.2.2 Results

##### Pooled mean estimators

The bias, variance, and coverage of the 95% CIs of the pooled mean estimators are given in Appendix C. The pooled mean estimators using the transformation-based approaches with bootstrap SE estimates often had similar bias, variance, and coverage compared to when using the naïve SE estimates. The main exception to these trends was the QE approach in , which had smaller bias and variance as well as better coverage when using bootstrap SE estimates compared to the naïve SE estimates.

##### Between-study variance estimators

Tables 4 and 5 give the bias and coverage of the 95% CIs of the between-study variance estimators. The variance of the between-study variance estimators is given in Appendix C.

The naïve approaches overestimated the between-study variance in all simulation scenarios, especially when the proportion of studies reporting medians was large. Consequently, they did not usually attain nominal coverage of their 95% CIs for between-study variance. As one may expect, the coverage of the naïve approaches decreased as the total number of primary studies and the proportion of primary studies reporting medians increased.

The bootstrap approaches performed considerably better than the naïve approaches for estimating the between-study variance. These approaches had smaller bias and variance, and they usually attained nominal or near-nominal coverage of their 95% CIs. There were not clear trends in the performance of these approaches when varying the summary statistics reported (i.e., , , ), the number of primary studies, or the proportion of primary studies reporting medians. In general, the MLN approach with bootstrap SEs performed best for estimating the between-study variance.

##### I2 estimators

The bias of the estimators are given in Table 6. We observed similar trends as those for estimating the between-study variance.

More specifically, the naïve approaches overestimated in all scenarios, especially when the proportion of studies reporting medians was large and summary statistics were reported. For instance, the QE, BC, and MLN naïve approaches overestimated by 58, 30, and 22 percentage points, respectively, when the number of primary studies was 30 and all primary studies reported summary statistics. The bootstrap approaches often performed considerably better. These approaches had bias smaller than 5 percentage points in most scenarios.

There were some scenarios (e.g., often when ) in which the MLN approach with naïve SEs had smaller bias for estimating compared to when using the bootstrap SEs. This typically occurred when the bias of both the naïve the bootstrap MLN approaches was small. When evaluating the median bias of the estimators (i.e., median of ), this phenomenon rarely occurred. See Appendix C for details.

## 4 Data application

We performed an empirical comparison of the naïve and parametric bootstrap SE estimators in a meta-analysis of clinical indicators of a severe course of COVID-19 [katzenschlager2021can]. One of the primary analyses of Katzenschlager et al. [katzenschlager2021can] compared demographic variables, clinical values, and laboratory values between COVID-19 infected patients who died and those who survived. The original analyses estimated the pooled difference of medians of 26 continuous outcomes across the two groups of patients and reported estimates of to describe between-study heterogeneity.

In our analyses, we applied transformation-based approaches to estimate the pooled difference of means of continuous outcomes between COVID-19 infected patients who died and those who survived. All primary studies reported summary statistics or reported the sample mean, standard deviation, and sample size for continuous outcomes. A detailed description of our data processing is given in Appendix D.

We meta-analyzed all outcomes that had at least 6 primary studies reporting appropriate summary data after performing the data processing. We applied the QE, BC, and MLN approaches with the naïve and bootstrap approaches for estimating the within-study SEs. As used in the simulation study, we used bootstrap samples and used REML to estimate between-study variance. For each outcome, we estimated the pooled difference of means and its 95% CI, between-study variance, and . The data and code used for these analyses are available at https://github.com/stmcg/se-est.

The estimated pooled difference of means of all outcomes based on the MLN method are given in Table 7 and those based on the QE and BC methods are given in Appendix D for parsimony. Figure 2 illustrates the extent to which the values changed when using within-study bootstrap SEs compared to the naïve SEs. The estimates of and the are listed in Appendix D.

For most outcomes, the pooled estimates and their 95% CIs were very similar when using bootstrap estimates of the within-study SEs compared to the naïve estimates. The estimated between-study heterogeneity was nearly always smaller when using bootstrap estimates of the within-study SEs compared to the naïve estimates. The estimates of between-study variance decreased on average by 25% for the QE method, 30% for the BC method, and 13% for the MLN method when using bootstrap SEs compared to when using the naïve SEs. Moreover, the values decreased on average by 15 percentage points for the QE method, 19 percentage points for the BC method, and 8 percentage points for the MLN method when using bootstrap SEs compared to when using the naïve SEs.

The use of bootstrap SEs was most consequential in the analyses of the Interleukin-6 (IL-6) outcome. The estimates of the pooled difference of mean IL-6 (in pg/mL) between patients who died and patients who survived were 47.92 [13.66, 82.18], 41.73 [7.82, 75.64], and 40.64 [7.23, 74.05] when using the QE, BC, and MLN approaches, respectively, with naïve SEs. However, when using bootstrap SEs, the estimates of the pooled difference of mean IL-6 were 4.56 [3.13, 5.99], 4.34 [3.09, 5.60], and 4.77 [3.26, 6.28], respectively. The reason for this large discrepancy is that studies with large differences of mean IL-6 values had considerably smaller weights when using bootstrap SEs (Table 8). Estimation of heterogeneity was also strongly affected when using bootstrap SEs. The estimates of between-study variance were greater than 2,000 for the three transformation-based approaches when using the naïve SEs and were close to 1 when using bootstrap SEs. Similarly, the values were greater than 99% for these three transformation-based approaches when using the naïve SEs and were 34%, 20%, and 45%, respectively, when using bootstrap SEs. These results may be attributed to the study-specific SEs being generally larger when using the bootstrap approach compared to the naïve approach.

## 5 Discussion

In this paper, we illustrated that the standard application of transformation-based approaches in meta-analysis can severely underestimate within-study SEs, which in turn can result in poor inference at the meta-analytic level. We proposed a straightforward application of parametric bootstrap to estimate within-study SEs in this context. Focusing on the recently proposed transformation-based approaches of McGrath et al. [mcgrath2020estimating] and Cai et al. [cai2021estimating], we showed in a simulation study that (i) the bootstrap SE estimators often perform considerably better than the naïve SE estimators and (ii) inference on between-study heterogeneity improved when using the study-specific bootstrap SE estimators. The data application illustrated the extent to which conclusions may change when using the bootstrap SE estimators in real data applications. For instance, if one were to use the naïve SEs in our data application, one may have questioned whether it makes sense to meta-analyze some of the outcomes (e.g., Interleukin-6) due to the extremely large heterogeneity. However, when using the bootstrap SEs in subsequent analyses, we found that the large degree of heterogeneity was likely an artifact of the bias of the naïve SE estimators.

The performance of the pooled mean estimators was not strongly affected by whether the naïve or bootstrap within-study SE estimator was used, which may be expected. The bias of the pooled mean estimators was not strongly affect since the pooled mean estimator is unbiased for any choice of study-specific weights under the assumptions of the meta-analytic model. Moreover, the coverage of the 95% CIs for the pooled mean was not strongly affected since the CIs depend on the total variance of the study-specific means.

While we focused on applying the proposed bootstrap SE estimators to meta-analyze the mean of an outcome of interest and meta-analyze the difference of means an outcome across two groups, the bootstrap SE estimators are applicable in much more general settings. In particular, these approaches can be directly applied for other effect measures (e.g., the standardized difference of means, ratio of means) and in more general meta-analytic settings (e.g., multivariate meta-analysis, network meta-analysis, multi-level meta-analysis, meta-regression). Exploring the impact of using the proposed parametric bootstrap SE estimators in these more complex settings would be an interesting area of further research.

We considered some variations of the parametric bootstrap SE estimators in preliminary analyses. For instance, we considered estimating the standard deviation of the distribution of the bootstrap replicates based on the median absolute deviation (with appropriate scaling to ensure consistency under normality) instead of the sample standard deviation to gain robustness to outliers. Since these modifications did not clearly improve the performance of the estimator in our simulations, we did not include them for parsimony.

The proposed parametric bootstrap SE estimators are of course applicable to other transformation-based approaches, which may be another interesting avenue of further research. We focused on the methods of McGrath et al. [mcgrath2020estimating] and Cai et al. [cai2021estimating] because they have been shown to perform well for estimating the mean and standard deviation when data are skewed, which is often the case when primary studies report medians. For simpler transformation-based approaches whose mean and standard deviation estimators are linear combinations of sample quantiles, the SE can be analytically derived under parametric assumptions (e.g., see [yang2021generalized]).

This study has a few limitations that are important to consider. First, the proposed bootstrap SE estimators may perform poorly when applied to primary studies with small sample sizes (e.g., less than 50). Second, since the proposed bootstrap SE estimators make parametric assumptions, one may expect some degree of model misspecification in practice. While the bootstrap SE estimators often performed reasonably well under model misspecification in our simulations, results may vary in different settings.

Owing to their broad applicability, transformation-based approaches are most commonly applied when a meta-analysis includes primary studies that report the median of the outcome. However, it should be noted that a number of other approaches have been proposed in more case-specific contexts. One line of approaches estimates a pooled median or the difference of medians across two groups [mcgrath2019one, mcgrath2020meta, ozturk2020meta]. Another approach estimates a global location parameter or the difference in global location parameters under a suitable location-scale model [yang2021generalized].

In summary, we recommend using the proposed bootstrap SE estimator when applying transformation-based approaches to meta-analyze studies reporting medians. While our results suggest that the MLN approach may be the preferred transformation-based approach when data are suspected to be non-normal, we encourage data analysts to perform sensitivity analyses evaluating the extent to which conclusions differ when applying other transformation-based approaches. To facilitate their application, we implemented the QE, BC, and MLN approaches with parametric bootstrap SEs in the estmeansd R package available on the Comprehensive R Archive Network (CRAN) [estmeansd] and in the Shiny application available at https://smcgrath.shinyapps.io/estmeansd/.

## Acknowledgements

The authors thank Claudia Denkinger for helping collect the data set used in the data application and Siyu Cai for providing code implementing the MLN approach. The simulations in this work were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University. This work was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

## Appendix A Additional results for the illustrative example

In this section, we give additional simulation results for the illustrative example in Section 2.3.1.

### a.1 Naïve standard errors when applying other transformation-based approaches

We applied the transformation-based approaches suggested by Hozo et al. [hozo2005estimating], Wan et al. [wan2014estimating], Luo et al. [luo2018optimally], Shi et al. [shi2020estimating], and Yang et al. [yang2021generalized] to the illustrative example. The approach of Yang et al. [yang2021generalized] requires specifying a location-scale distribution for the outcome. We applied the approach of Yang et al. [yang2021generalized] under the assumption of normality, as implemented in the metaBLUE R package [metaBLUE].

Figure 3 gives the distribution of the naïve SE estimates along with the true SEs when applying these transformation-based approaches. As one may expect, the naïve SE estimator underestimated the true SE when applying the approaches of Wan et al. [wan2014estimating], Luo et al. [luo2018optimally], Shi et al. [shi2020estimating], and Yang et al. [yang2021generalized]. However, naïve SE estimator overestimated the true SE when applying the approach of Hozo et al. [hozo2005estimating]. This is simply due to the approach of Hozo et al. [hozo2005estimating] severely overestimating the standard deviation of the underlying distribution. This can be seen more explicitly by noting that the standard deviation of the underlying distribution divided by (i.e., the best-case scenario for this approach) is approximately 1.2, which is smaller than the true SE of the approach.

### a.2 Parametric bootstrap standard errors when applying the QE, BC, and MLN approaches

Figure 4 includes the distributions of bootstrap SE estimates of the QE, BC, and MLN approaches.

## Appendix B Additional results for the study-level simulations

Figure 5 illustrates the probability density functions of the distributions included in the simulations. Tables 9, 10, and 11 give the study-level simulation results for the mean percent error. Tables 12, 13, and 14 give the study-level simulation results for the RMSE.

## Appendix C Additional results for the meta-analysis simulations

Tables 15, 16, and 17 gives the meta-analysis simulation results for the bias, variance, and coverage of the 95% CIs of the pooled mean estimators. Table 18 gives the meta-analysis simulation results for the variance of the estimators. Table 19 gives the median bias (i.e., median of ) of the estimators