Empirical Likelihood Based Summary ROC Curve for Meta-Analysis of Diagnostic Studies

03/11/2018 ∙ by ShengLi Tzeng, et al. ∙ 0

Objectives: This study provides an effective model selection method based on the empirical likelihood approach for constructing summary receiver operating characteristic (sROC) curves from meta-analyses of diagnostic studies. Methods: We considered models from combinations of family indices and specific pairs of transformations, which cover several widely used methods for bivariate summary of sensitivity and specificity. Then a final model was selected using the proposed empirical likelihood method. Simulation scenarios were conducted based on different number of studies and different population distributions for the disease and non-disease cases. The performance of our proposal and other model selection criteria was also compared. Results: Although parametric likelihood-based methods are often applied in practice due to its asymptotic property, they fail to consistently choose appropriate models for summary under the limited number of studies. For these situations, our proposed method almost always performs better. Conclusion: When the number of studies is as small as 10 or 5, we recommend choosing a summary model via the proposed empirical likelihood method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Summarizing information on test performance metrics, such as sensitivity, specificity, and diagnostic odds ratios (DOR), is an important part of a systematic review of a medical test performance on clinical outcomes. Through a meta-analysis on clinical studies of diagnostic tests, we may investigate hypotheses about the test performance that cannot be answered by an individual study.

Sotiriadis et al. (2016) recommended a guideline for systematic review of diagnostic test accuracy studies, as a counterpart to Cochrane Handbook (Higgins and Green, 2011) and PRISMA (Shamseer et al., 2015) widely used in general systematic review.

Most diagnostic tests are used to separate patients into two groups as test positive and test negative (i.e., T+/T-). The fundamental diagnostic result can be a real value (continuous output), in which case the classification boundary between the two groups must be determined by a threshold value, usually based on a receiver operating characteristic (ROC) curve (Tripepi et al., 2009). Accordingly, there are four possible outcomes from a dichotomized test when a gold standard is available. If the true disease status of a subject is positive (D+), a T+ classification is called a true positive (TP), while a T- result is called a false negative (FN). Conversely, given a true negative disease status of a subject (D-), a T- classification results in a true negative (TN) and a T+ result gets a false positive (FP).

In such cases that a gold standard exists, test accuracy is estimated as the proportion of diseased individuals to be “test positive” (sensitivity) and of non-diseased individuals to be “test negative” (specificity); see

Honest and Khan (2002), Irwig et al. (1995), and Altman (2001)

. For systematic reviews of dichotomous diagnostic studies, we have “data” merely consisting of numbers of nTP, nFN, nFP, and nTN for each study involved, summing up the number of subjects classified as TP, FN, FP, or TN, respectively. The corresponding results are usually reported based on a particular threshold, as used in clinical practice (

Gatsonis and Paliwal, 2006). It is improper to simply use the sums across studies of these four numbers to derive summary estimates of sensitivity, specificity and DOR, where the summary statistics would be dominantly affected by several studies in the largest study sizes.

Another naive summary is to pool sensitivity and specificity separately using standard meta-analyses for proportions. However, Walter and Jadad (1999) and Moses et al. (1993) showed that sensitivity and specificity are often negatively correlated, usually because of different thresholds among studies to define T+ and T-. Even though this “separate summary” method is sometimes recommended (e.g., Chappell et al., 2009, Trikalinos et al., 2012), ignoring the correlation between sensitivity and specificity would result in a biased inference or even misleading in the claims.

Ideally, each study has its own (empirical) ROC curve, and a summary ROC curve provides an overall description of the test performance, such as the problems dealt with in Kester and Buntinx (2000). But as is often the case, many studies reported merely a table consisting of nTP, nFN, nFP, and nTN. It will be hard to distinguish the following three sources of uncertainty involved in observed sensitivities and specificities across studies: (a) the dispersion within a study due to sampling variability, (b) heterogeneity from varying cutoff values between studies, or (c) different characteristics of populations for individual studies; refer to Gatsonis and Paliwal (2006), Chappell et al. (2009) and Chu et al. (2010) for detailed discussions. Consequently, it is almost impossible to recover each individual ROC curve based on the limited data points without additional assumptions.

Chappell et al. (2009) and Trikalinos et al. (2012) discussed that helpful ways about summarizing medical test studies include “separate summary”, “summary point”, and “summary line”. Some advised procedures regarding when to use which kind of summary representations were also provided in their studies. These procedures, however, partly rely on the effectiveness of the bivariate ROC model of Reitsma et al. (2005), which allows for either a “summary point” or a “summary line”. In fact, various models have been proposed in the literature for a meaningful summary “point” or “line” across all studies (e.g., Moses et al., 1993, Reitsma et al., 2005, Rutter and Gatsonis, 2001, and Holling et al., 2012). Reitsma et al. (2005) and Rutter and Gatsonis (2001) have become almost the de facto standard for a “summary point” or a “summary line”, and Harbord et al. (2006) showed their equivalence when no covariates are included. If more studies are available (usually with a number larger than 30), some sophisticated extensions attempt to incorporate other sources of heterogeneities, such as disease prevalence (Chu et al., 2009), latent subgroups (Schlattmann et al., 2015) or measurement errors (Guolo, 2017).

In view of so many alternative methods, a natural and important question is how to select a suitable model. Recently, Doebler et al. (2012)

tried to integrate a wide range of models into a unified parametric linear mixed model framework after transformation, upon which the likelihood-maximizing approach can be utilized to estimate model parameters. It covers

Chu et al. (2010), Reitsma et al. (2005), Rutter and Gatsonis, 2001, and Holling et al., 2012 as special cases. Furthermore, some likelihood-based criteria can be used for selecting a “best” model, among which the Akaike information criterion (AIC) with nice asymptotic properties is the most often used method; see Burnham and Anderson (2003) for more details. When a set of candidate models is considered, we may choose the model with the smallest AIC value, and then make statistical inference based on it.

Nevertheless, the practical essence of such meta-analyses is restricted to a small sample size more often than not. Note that is the number of studies under consideration. For the same medical test design, there are usually not many compatible studies and hence the number of data points is too limited to apply the relative asymptotic theories. Vaida and Blanchard (2005), Liang et al. (2008), and Greven and Kneib (2010) proposed conditional AIC (cAIC) in linear mixed models, which is a tailored model selection method for small . In addition, an empirical likelihood (EL) method analogous to Owen (1990) can also be used for small in practice. In this study, we focus on the issue of model selection for ”summary line” situations. The key questions in this article include: (a) whether AIC gives an acceptable result, (b) which selection criterion (e.g., AIC, cAIC, or EL) has better performance, and (c) does there exist a criterion that performs satisfactorily under various situations especially when is small?

The rest of the article is organized as follows. Section 2 first reviews several commonly used models and then describes some existing model selection criteria, followed by our proposed criterion. The effectiveness of our proposal is shown through simulations comparing to other criteria in Section 3. An example of its application to colorectal cancer detection is given in Section 4. Finally, we conclude with Section 5.

2 Method

2.1 Two families of models and their special cases

This section briefly describes the two model families under transformation as illustrated in Doebler et al. (2012) and the relations with other approaches. The details of models can be found in the original literature.

Doebler et al. (2012) introduced a class of monotonic transformation functions controlled by for given as

Let and be the unobserved true sensitivity and false positive rate (1-specificity) for the -th study, respectively. With a pair of transformation parameters , the two transformed variables

are then assumed to follow a bivariate normal distribution with mean

and covariance matrix


Following Doebler et al. (2012), we consider two families of models by setting and in (1) to different values. The first family of models uses fixed

while the second family of models takes study heteroscedasticity into account with


, i.e., estimated variances of sensitivity and specificity for individual studies.

Doebler et al. (2012) pointed out that

respectively corresponds to logit transformation with

and log transformation with . Moreover, is also approximately proportional to the complementary logarithmic function when is around 0.6. On the other hand, can be regarded as and complementary if and , respectively.

When , the first family of models corresponds to the Lehmann family or proportional hazard models of Holling et al. (2012). If and are assumed to follow a bivariate normal distribution, the first family of models with coincides with the summary ROC method of Moses et al. (1993) but based on different parameterizations. Hereafter, the model corresponding to Moses et al. (1993) is called MSL method.

When , the second family of models is equivalent to bivariate models of Reitsma et al. (2005) and the hierarchical summary ROC method (HSROC) of Rutter and Gatsonis (2001). Furthermore, the second family of models approaches the complementary logarithmic models of Chu et al. (2010) when . Apart from , there are five common parameters involved in each family. To estimate model parameters, maximum likelihood (ML) or restricted maximum likelihood (REML) methods can be used. For a larger , can also be estimated as additional parameters.

2.2 Existing model selection criteria and our proposal

The work of Doebler et al. (2012) reviewed in the previous subsection generalized several widely used models. For , they also showed that it is possible to recover by treating it as free parameters. Nevertheless, they admitted that it is hard to estimate for , and they suggested these two quantities should be fixed for a small . In practice, treating as fixed or free parameters does not respectively make the problem easier or harder; an analyst still needs a way to determine suitable values of for transformation. Additionally, although they proposed two useful families of models, little has been known about how to select among them especially when the sample size is small. Therefore, a good model selection strategy is important.

Model selection can be viewed as a selection of both the model assumptions and the estimated parameters, which amounts to a choice of underlying probabilistic mechanism. Most of works for model selection (or variable selection) in linear regression and generalized linear models have been studied extensively (

Burnham and Anderson, 2003, Claeskens et al., 2008, and Fan and Lv, 2010). Unfortunately the models considered here have no variables to be selected and the key structure, the covariance matrix , is heavily affected by . Selection of the covariance structure in a linear mixed model is still a very open research area. Yet, the two families in Doebler et al. (2012) are linear mixed models only after transformation, which raises another challenge for us. Therefore, the results would be doubtful if one directly applies the existing model selection methods to select among the two families.

In what follows, a “model” indicates a triplet of , , and an index of family (i.e., 1st or 2nd), so and are no longer free parameters. We shall consider several model selection criteria, and compare their performance based on simulations. The first one is AIC (Akaike, 1998), which was also inspected in Doebler et al. (2012). Let be the log-likelihood for a model and be the corresponding estimates of model parameters. Then AIC is defined as with being the number of parameters in the model . Since each model has 5 parameters, selecting the minimum AIC model amounts to choosing the model having the largest , where is obtained by ML or REML. Note that AIC has been shown to have nice asymptotic properties for model selection (Burnham and Anderson, 2003), but the focus in this work is the small problem. Although Cavanaugh (1997) is a corrected version of AIC with penalty instead of for small , it selects an identical model as AIC for models considered here.

The second and third one follow the conditional AIC studied in Vaida and Blanchard (2005), Liang et al. (2008), and Greven and Kneib (2010). Let ;

be the vector of observed sensitivities and 1-specificities. For a specific model

, define , where

and . Thus is the empirical best linear unbiased predictor of based on the model . Then the conditional AIC (cAIC) is defined as + penalty, where with being the log-likelihood for a model evaluated at and the observations replaced by , and the penalty in cAIC was discussed in Vaida and Blanchard (2005), Liang et al. (2008), and Greven and Kneib (2010). In particular, Vaida and Blanchard (2005) assumed to be known, while Liang et al. (2008) and Greven and Kneib (2010) took the uncertainty of estimation into consideration. The major difference between Liang et al. (2008) and Greven and Kneib (2010) is that the former calculated the penalty approximately, while the latter provided an exact method. In our simulation studies, we will compare the performance of Vaida and Blanchard (2005) and Greven and Kneib (2010), and refer to them as cAIC-VB and cAIC-GK, respectively. Also note that a model in the first family does not consider the random effect, hence cAIC reduces to AIC in this case.

The fourth criterion is EL approach (Owen, 1990), which was primarily a method for constructing a confidence region for mean parameters, and Baggerly (1998) pointed out its connection to goodness-of-fit measures. Denote for an arbitrary , and hence is the (back-transformed) mean of summarized sensitivity and 1-specificity for a model . The empirical likelihood for a model having as mean parameters is given by


under the constraints , , and . The empirical likelihood for the saturated model is

under the constraints , and . To assess the hypothesis that is the mean of independent data , we should first find the weight of each datum with (2). Then,

can be obtained, where is a constant and

has a chi-square limiting distribution with a degree of freedom equal to the rank of

(Owen, 1990). Thus, a larger value of indicates a model’s deficiency. We refer to this empirical likelihood method as EL-fix.

Note that EL-fix cannot differentiate the covariance structure with merely . We propose a simple modification as the fifth criterion in the following. Based on an idea similar to aforementioned cAIC, we incorporate the information of ; , into the above empirical likelihood method. Specifically, for calculating (2) the original constraint remains the same for the first family, but we simply replace it with for a model in the second family. Our modification is referred to as EL-blup. We will show the effectiveness of our proposed method through simulation studies.

3 Simulation Studies

3.1 Setup

We shall compare the criteria described in the last section via simulation studies. It is not fair to conduct simulations from any model for meta-analysis in Section 2.1

. Generating data from a certain model would be in favor of a specific approach, e.g., nTP and nFP come from a bivariate binomial distribution, or logit(sensitivity) and logit(1-specificity) come from a bivariate normal distribution. Instead, we imitate a typical data collection process, and set up simulations similar to common demonstrations among those methodologies for the ROC curve of a single study as in

Ren et al. (2004), Du and Tang (2009), or Rufibach (2012).

We consider meta-analyses of or primary studies in the diagnostic test, and for simplicity, candidate models are restricted to those within in combination with the two model families. Therefore, there are candidate models under this setting. To generate data for the -th study, the primary test values of non-disease participants are drawn independently and identically from a distribution , and the values of diseased participants are from , where and

are integers sampled from Poisson distributions with means 160 and 40, respectively. Then a threshold is determined by maximizing Youden’s index (

Youden, 1950) for the participants, and we obtain the corresponding . Based on ; , the standard estimation procedures for a model (a triplet of , and index of family) is applied, and some competing model selection criteria introduced in Section 2.2 will also be used. For each criterion, a “best” model was chosen, i.e., a model with the smallest criterion’s value is selected among all the candidate models.

We assess a selection criterion as follows. Let be the theoreticalROC curve in space, and be the corresponding area under the curve (i.e., AUC). For a model , let and be the estimates of and , respectively. Then, and are used for assessment based on the following four measures,

  1. RMSE(): rooted mean squared error of ,

  2. rank1: the ascending ranking of RMSE() among the 50 candidate models,

  3. MIAE(): the mean integrated absolute deviation between and ,

  4. rank2: the ascending ranking of MIAE() among the 50 candidate models.

The simulation experiment is replicated 500 times for each combination of , , and , where and are considered under the following four scenarios:


logistic distribution with location and scale parameters (0,1) for and (1.8,1.2) for ,


normal distribution with mean and standard deviation parameters (0,1) for

and (1.5,1.2) for ,


skew normal distribution with location, scale and shape parameters (0,1,1) for and (0.25,2,5) for ,


truncated normal distribution with mean and standard deviation parameters (0,1) for and (1,1.25) for , and the truncated minimum and maximum are a standard deviation from the mean.

In addition to the most popular distribution ND, LD is a heavy tail distribution, while TND is a short tail distribution and SND is asymmetric. These distributions are used to generate participants’ situations under study and to test the performance of various model selection criteria.

3.2 Results

Since Doebler et al. (2012) concluded that the ML estimator of the covariance is always biased, we shall merely report results based on REML estimators. In fact, results based on ML estimator give the same conclusion. For reference of the upper and lower bounds, we also calculate the four assessment measures for “AIC-noJ” and “BEST”, where AIC-noJ is the criterion similar to AIC but not using the Jacobian of the transformation, and BEST collects the model having the smallest MIAE() value among the 50 candidate models for each replication. As expected, AIC-noJ performs the worst while BEST is superior to the others. Also note that AUC values can be close even if two ’s shapes are very different, and thus comparison of is more meaningful for “summary line” situation.

Tables 1 and 2 summarize the performance of different criteria for and , respectively. It is obvious that our proposed method, EL-blup, always holds the best two places for RMSE(), and outperforms others for MIAE(). In contrast, AIC tends to choose worse models in many scenarios, and cAIC corrects it to some extent. Note that a random selecting mechanism would result in rankings of RMSE() and MIAE() with an average value of 25.5. In practice, we have little knowledge about the underlying true distributions for and , so a stable and robust method is critical. We notice that our proposed method, EL-blup, is the only criterion steadily beating a random selection.

Distributions Measures AIC-noJ AIC cAIC-VB cAIC-GK EL-fix EL-blup BEST
RMSE 10.98% 4.18% 6.63% 6.98% 8.44% 5.57% 3.89%
(0.12%) (0.19%) (0.21%) (0.15%) (0.17%) (0.18%) (0.13%)
Rank1 47.98 16.27 25.73 29.09 37.28 21.48 15.43
LD(0,1) vs. (0.24) (0.94) (1.04) (0.5) (0.77) (0.73) (0.63)
LD(1.8,1.2) MAIE 11.04% 8.18% 8.52% 8.13% 9.24% 7.71% 5.12%
(0.11%) (0.17%) (0.17%) (0.18%) (0.16%) (0.2%) (0.14%)
Rank2 44.50 26.25 29.14 25.08 34.04 20.99 1.00
(0.61) (0.81) (0.76) (0.64) (0.86) (0.82) (0)
RMSE 7.55% 4.07% 4.7% 4.85% 5.99% 4.13% 2.95%
(0.09%) (0.23%) (0.16%) (0.13%) (0.12%) (0.14%) (0.11%)
Rank1 46.12 21.00 25.14 28.04 36.92 22.91 17.14
ND(0,1) vs. (0.45) (1.02) (1.01) (0.68) (0.75) (0.75) (0.58)
ND(1.5,1.2) MAIE 7.67% 7.51% 7.31% 6.65% 6.98% 6.24% 3.85%
(0.09%) (0.22%) (0.21%) (0.2%) (0.16%) (0.22%) (0.11%)
Rank2 40.31 32.96 32.58 22.16 32.09 20.27 1.00
(0.81) (0.8) (0.79) (0.77) (0.88) (0.85) (0)
RMSE 7.78% 4.46% 4.93% 4.83% 6.22% 4.33% 3.63%
(0.12%) (0.2%) (0.16%) (0.14%) (0.14%) (0.15%) (0.11%)
Rank1 47.22 24.98 27.51 27.34 38.02 23.33 21.66
SND(0,1,1) vs. (0.46) (0.97) (0.95) (0.69) (0.72) (0.69) (0.56)
SND(0.25,2,5) MAIE 8.21% 7.81% 7.64% 6.57% 6.68% 6.34% 4.47%
(0.12%) (0.17%) (0.16%) (0.16%) (0.13%) (0.2%) (0.11%)
Rank2 37.01 35.94 31.69 18.70 25.69 16.03 1.00
(0.82) (0.86) (0.91) (0.76) (0.92) (0.83) (0)
RMSE 7.21% 6.35% 5.56% 5.11% 6.03% 4.78% 3.48%
(0.14%) (0.29%) (0.23%) (0.17%) (0.15%) (0.19%) (0.13%)
Rank1 40.78 34.61 28.15 26.45 29.94 23.58 19.97
TND(0,1) vs. (0.73) (0.95) (0.88) (0.86) (0.86) (0.85) (0.66)
TND(1,1.25) MAIE 7.39% 9.39% 8.27% 7.04% 6.71% 6.36% 4.36%
(0.16%) (0.24%) (0.23%) (0.26%) (0.2%) (0.17%) (0.16%)
Rank2 28.17 34.86 29.59 19.92 18.06 17.24 1.00
(0.87) (0.94) (0.99) (0.91) (0.8) (0.84) (0)
Table 1: Comparison between several model selection criteria for four different population distributions in

cases (the values in parentheses are the corresponding standard errors).

Distributions Measures AIC-noJ AIC cAIC-VB cAIC-GK EL-fix EL-blup BEST
RMSE 11.09% 4.6% 5.88% 6.04% 8.38% 5.56% 3.00%
(0.08%) (0.19%) (0.18%) (0.17%) (0.13%) (0.17%) (0.09%)
Rank1 47.63 16.80 22.10 23.89 36.68 19.83 12.02
LD(0,1) vs. (0.17) (0.85) (0.87) (0.82) (0.61) (0.7) (0.44)
LD(1.8,1.2) MAIE 11.1% 6.83% 7.20% 6.94% 8.63% 6.81% 4.04%
(0.08%) (0.14%) (0.12%) (0.11%) (0.13%) (0.16%) (0.09%)
Rank2 46.90 22.50 24.90 24.06 34.20 20.58 1.00
(0.23) (0.75) (0.69) (0.54) (0.73) (0.8) (0)
RMSE 7.80% 3.66% 4.44% 4.53% 5.73% 4.13% 2.36%
(0.07%) (0.12%) (0.14%) (0.11%) (0.09%) (0.14%) (0.06%)
Rank1 47.50 21.96 25.23 26.86 35.82 23.94 15.76
ND(0,1) vs. (0.27) (0.88) (0.96) (0.6) (0.64) (0.76) (0.41)
ND(1.5,1.2) MAIE 7.79% 6.26% 6.03% 5.36% 6.11% 5.29% 3.10%
(0.07%) (0.14%) (0.12%) (0.13%) (0.13%) (0.18%) (0.06%)
Rank2 45.38 30.92 29.40 18.96 30.38 18.76 1.00
(0.42) (0.78) (0.72) (0.76) (0.75) (0.88) (0)
RMSE 7.85% 4.33% 4.66% 4.57% 6.04% 4.07% 3.21%
(0.08%) (0.14%) (0.13%) (0.11%) (0.1%) (0.13%) (0.07%)
Rank1 45.78 24.40 27.20 26.89 37.38 22.91 21.01
SND(0,1,1) vs. (0.19) (0.86) (0.78) (0.56) (0.52) (0.74) (0.41)
SND(0.25,2,5) MAIE 7.83% 6.75% 6.09% 5.28% 6.14% 5.24% 3.98%
(0.08%) (0.13%) (0.09%) (0.09%) (0.10%) (0.11%) (0.06%)
Rank2 41.86 27.62 23.55 14.36 25.71 13.14 1.00
(0.5) (0.98) (0.81) (0.78) (0.85) (0.7) (0)
RMSE 7.1% 4.72% 4.97% 4.66% 5.73% 4.57% 2.9%
(0.1%) (0.2%) (0.13%) (0.14%) (0.12%) (0.15%) (0.06%)
Rank1 45.04 27.36 27.51 26.87 36.28 22.43 19.59
TND(0,1) vs. (0.39) (0.9) (0.78) (0.86) (0.71) (0.94) (0.52)
TND(1,1.25) MAIE 7.46% 7.13% 6.35% 6.10% 5.88% 5.84% 3.84%
(0.09%) (0.19%) (0.20%) (0.13%) (0.12%) (0.11%) (0.07%)
Rank2 33.12 27.74 21.61 20.75 19.66 18.44 1.00
(0.63) (1.01) (0.88) (0.84) (0.83) (0.78) (0)
Table 2: Comparison between several model selection criteria for four different population distributions in cases (the values in parentheses are the corresponding standard errors).

4 Example in real applications

In this section, we demonstrate various methods for a dataset provided in Zhou et al. (2013), which evaluated the ability of microRNAs in the detection of colorectal cancer. There were only 13 studies and the authors applied the MSL method and pooled these studies together in their meta-analysis. Here, we conjecture that the precision of the diagnostic tools may keep improving with time. Thus, 13 studies are separated into two groups by their published years in our analysis (i.e., year and year ). Results are shown in Figure 1.

Even though the shapes of sROC curves are very different, the values of AUC are similar. In the first group, HSROC and EL-blup gave almost identical results, while the model selected by AIC, MSL and the Lehmann model gave slightly broader confidence and prediction regions. In the second group, we notice that the model selected by AIC produced a curve close to that of the HSROC model. However, both of them failed to capture the characteristics for data points, because they gave too narrow confidence and prediction regions for sensitivity and 1-specificitiy and both models estimated very high negative correlations between them. In contrast, EL-blup, MSL and Lehmann methods gave more reasonable confidence and prediction regions. Results from the EL-blup method somewhat support our conjecture, though the precision difference between the two groups is not significant.

Figure 1: Summary ROC curves based on different methods for the two time periods. Left panel: year; right panel: year . Parentheses are the corresponding AUC values.

5 Conclusion

The model selection problem for meta-analyses of diagnostic studies can be very difficult, not only because of the small sample size, but also due to the probabilistic mechanism of models not perfectly coinciding with the data collection process. We can almost conclude that there is no true model. The common criteria based on asymptotic theories do not have acceptable performance in such challenging cases. Our method can provide a more credible inference as demonstrated in the simulation studies and the real data example, even though we do not know the underlying distributions.


  • Akaike (1998) Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.
  • Altman (2001) Altman, D. G. (2001). Systematic reviews in health care: Systematic reviews of evaluations of prognostic variables. BMJ, 323(7306):224.
  • Baggerly (1998) Baggerly, K. A. (1998). Empirical likelihood as a goodness-of-fit measure. Biometrika, 85(3):535–547.
  • Burnham and Anderson (2003) Burnham, K. P. and Anderson, D. R. (2003). Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media.
  • Cavanaugh (1997) Cavanaugh, J. E. (1997). Unifying the derivations for the akaike and corrected akaike information criteria. Stat Probab Lett, 33(2):201–208.
  • Chappell et al. (2009) Chappell, F., Raab, G., and Wardlaw, J. (2009). When are summary roc curves appropriate for diagnostic meta-analyses? Stat Med, 28(21):2653–2668.
  • Chu et al. (2010) Chu, H., Guo, H., and Zhou, Y. (2010). Bivariate random effects meta-analysis of diagnostic studies using generalized linear mixed models. Med Decis Making, 30(4):499–508.
  • Chu et al. (2009) Chu, H., Nie, L., Cole, S. R., and Poole, C. (2009). Meta-analysis of diagnostic accuracy studies accounting for disease prevalence: alternative parameterizations and model selection. Stat Med, 28(18):2384–2399.
  • Claeskens et al. (2008) Claeskens, G., Hjort, N. L., et al. (2008). Model selection and model averaging, volume 330. Cambridge University Press Cambridge.
  • Doebler et al. (2012) Doebler, P., Holling, H., and Böhning, D. (2012). A mixed model approach to meta-analysis of diagnostic studies with binary test outcome. Psychol Methods, 17(3):418.
  • Du and Tang (2009) Du, P. and Tang, L. (2009). Transformation-invariant and nonparametric monotone smooth estimation of roc curves. Stat Med, 28(2):349–359.
  • Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Stat Sin, 20(1):101.
  • Gatsonis and Paliwal (2006) Gatsonis, C. and Paliwal, P. (2006). Meta-analysis of diagnostic and screening test accuracy evaluations: methodologic primer. Am J Roentgenol, 187(2):271–281.
  • Greven and Kneib (2010) Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789.
  • Guolo (2017) Guolo, A. (2017). A double simex approach for bivariate random-effects meta-analysis of diagnostic accuracy studies. BMC Med Res Methodol, 17(1):6.
  • Harbord et al. (2006) Harbord, R. M., Deeks, J. J., Egger, M., Whiting, P., and Sterne, J. A. (2006). A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics, 8(2):239–251.
  • Higgins and Green (2011) Higgins, J. P. and Green, S. (2011). Cochrane handbook for systematic reviews of interventions, volume 4. John Wiley & Sons.
  • Holling et al. (2012) Holling, H., Böhning, W., and Böhning, D. (2012). Likelihood-based clustering of meta-analytic sroc curves. Psychometrika, 77(1):106–126.
  • Honest and Khan (2002) Honest, H. and Khan, K. S. (2002). Reporting of measures of accuracy in systematic reviews of diagnostic literature. BMC Health Serv Res, 2(1):4.
  • Irwig et al. (1995) Irwig, L., Macaskill, P., Glasziou, P., and Fahey, M. (1995). Meta-analytic methods for diagnostic test accuracy. J Clin Epidemiol, 48(1):119–130.
  • Kester and Buntinx (2000) Kester, A. D. and Buntinx, F. (2000). Meta-analysis of roc curves. Med Decis Making, 20(4):430–439.
  • Liang et al. (2008) Liang, H., Wu, H., and Zou, G. (2008). A note on conditional aic for linear mixed-effects models. Biometrika, 95(3):773–778.
  • Moses et al. (1993) Moses, L. E., Shapiro, D., and Littenberg, B. (1993). Combining independent studies of a diagnostic test into a summary roc curve: data-analytic approaches and some additional considerations. Stat Med, 12(14):1293–1316.
  • Owen (1990) Owen, A. (1990). Empirical likelihood ratio confidence regions. Ann Stat, 18(1):90–120.
  • Reitsma et al. (2005) Reitsma, J. B., Glas, A. S., Rutjes, A. W., Scholten, R. J., Bossuyt, P. M., and Zwinderman, A. H. (2005). Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol, 58(10):982–990.
  • Ren et al. (2004) Ren, H., Zhou, X.-H., and Liang, H. (2004). A flexible method for estimating the roc curve. J Appl Stat, 31(7):773–784.
  • Rufibach (2012) Rufibach, K. (2012). A smooth roc curve estimator based on log-concave density estimates. Int J Biostat, 8(1).
  • Rutter and Gatsonis (2001) Rutter, C. M. and Gatsonis, C. A. (2001). A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med, 20(19):2865–2884.
  • Schlattmann et al. (2015) Schlattmann, P., Verba, M., Dewey, M., and Walther, M. (2015).

    Mixture models in diagnostic meta-analyses—clustering summary receiver operating characteristic curves accounted for heterogeneity and correlation.

    J Clin Epidemiol, 68(1):61–72.
  • Shamseer et al. (2015) Shamseer, L., Moher, D., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., Shekelle, P., and Stewart, L. A. (2015). Preferred reporting items for systematic review and meta-analysis protocols (prisma-p) 2015: elaboration and explanation. BMJ, 349:g7647.
  • Sotiriadis et al. (2016) Sotiriadis, A., Papatheodorou, S., and Martins, W. (2016). Synthesizing evidence from diagnostic accuracy tests: the sedate guideline. Ultrasound Obstet Gynecol, 47(3):386–395.
  • Trikalinos et al. (2012) Trikalinos, T. A., Balion, C. M., Coleman, C. I., Griffith, L., Santaguida, P. L., Vandermeer, B., and Fu, R. (2012). meta-analysis of test performance when there is a “gold standard”. J Gen Intern Med, 27(1):56–66.
  • Tripepi et al. (2009) Tripepi, G., Jager, K. J., Dekker, F. W., and Zoccali, C. (2009). Diagnostic methods 2: receiver operating characteristic (roc) curves. Kidney Int., 76(3):252–256.
  • Vaida and Blanchard (2005) Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370.
  • Walter and Jadad (1999) Walter, S. and Jadad, A. (1999). Meta-analysis of screening data: a survey of the literature. Stat Med, 18(24):3409–3424.
  • Youden (1950) Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1):32–35.
  • Zhou et al. (2013) Zhou, X.-J., Dong, Z.-G., Yang, Y.-M., Du, L.-T., Zhang, X., and Wang, C.-X. (2013). Limited diagnostic value of micrornas for detecting colorectal cancer: a meta-analysis. Asian Pac J Cancer Prev, 14(8):4699–4704.