1 Introduction
Summarizing information on test performance metrics, such as sensitivity, specificity, and diagnostic odds ratios (DOR), is an important part of a systematic review of a medical test performance on clinical outcomes. Through a metaanalysis on clinical studies of diagnostic tests, we may investigate hypotheses about the test performance that cannot be answered by an individual study.
Sotiriadis et al. (2016) recommended a guideline for systematic review of diagnostic test accuracy studies, as a counterpart to Cochrane Handbook (Higgins and Green, 2011) and PRISMA (Shamseer et al., 2015) widely used in general systematic review.Most diagnostic tests are used to separate patients into two groups as test positive and test negative (i.e., T+/T). The fundamental diagnostic result can be a real value (continuous output), in which case the classification boundary between the two groups must be determined by a threshold value, usually based on a receiver operating characteristic (ROC) curve (Tripepi et al., 2009). Accordingly, there are four possible outcomes from a dichotomized test when a gold standard is available. If the true disease status of a subject is positive (D+), a T+ classification is called a true positive (TP), while a T result is called a false negative (FN). Conversely, given a true negative disease status of a subject (D), a T classification results in a true negative (TN) and a T+ result gets a false positive (FP).
In such cases that a gold standard exists, test accuracy is estimated as the proportion of diseased individuals to be “test positive” (sensitivity) and of nondiseased individuals to be “test negative” (specificity); see
Honest and Khan (2002), Irwig et al. (1995), and Altman (2001). For systematic reviews of dichotomous diagnostic studies, we have “data” merely consisting of numbers of nTP, nFN, nFP, and nTN for each study involved, summing up the number of subjects classified as TP, FN, FP, or TN, respectively. The corresponding results are usually reported based on a particular threshold, as used in clinical practice (
Gatsonis and Paliwal, 2006). It is improper to simply use the sums across studies of these four numbers to derive summary estimates of sensitivity, specificity and DOR, where the summary statistics would be dominantly affected by several studies in the largest study sizes.Another naive summary is to pool sensitivity and specificity separately using standard metaanalyses for proportions. However, Walter and Jadad (1999) and Moses et al. (1993) showed that sensitivity and specificity are often negatively correlated, usually because of different thresholds among studies to define T+ and T. Even though this “separate summary” method is sometimes recommended (e.g., Chappell et al., 2009, Trikalinos et al., 2012), ignoring the correlation between sensitivity and specificity would result in a biased inference or even misleading in the claims.
Ideally, each study has its own (empirical) ROC curve, and a summary ROC curve provides an overall description of the test performance, such as the problems dealt with in Kester and Buntinx (2000). But as is often the case, many studies reported merely a table consisting of nTP, nFN, nFP, and nTN. It will be hard to distinguish the following three sources of uncertainty involved in observed sensitivities and specificities across studies: (a) the dispersion within a study due to sampling variability, (b) heterogeneity from varying cutoff values between studies, or (c) different characteristics of populations for individual studies; refer to Gatsonis and Paliwal (2006), Chappell et al. (2009) and Chu et al. (2010) for detailed discussions. Consequently, it is almost impossible to recover each individual ROC curve based on the limited data points without additional assumptions.
Chappell et al. (2009) and Trikalinos et al. (2012) discussed that helpful ways about summarizing medical test studies include “separate summary”, “summary point”, and “summary line”. Some advised procedures regarding when to use which kind of summary representations were also provided in their studies. These procedures, however, partly rely on the effectiveness of the bivariate ROC model of Reitsma et al. (2005), which allows for either a “summary point” or a “summary line”. In fact, various models have been proposed in the literature for a meaningful summary “point” or “line” across all studies (e.g., Moses et al., 1993, Reitsma et al., 2005, Rutter and Gatsonis, 2001, and Holling et al., 2012). Reitsma et al. (2005) and Rutter and Gatsonis (2001) have become almost the de facto standard for a “summary point” or a “summary line”, and Harbord et al. (2006) showed their equivalence when no covariates are included. If more studies are available (usually with a number larger than 30), some sophisticated extensions attempt to incorporate other sources of heterogeneities, such as disease prevalence (Chu et al., 2009), latent subgroups (Schlattmann et al., 2015) or measurement errors (Guolo, 2017).
In view of so many alternative methods, a natural and important question is how to select a suitable model. Recently, Doebler et al. (2012)
tried to integrate a wide range of models into a unified parametric linear mixed model framework after transformation, upon which the likelihoodmaximizing approach can be utilized to estimate model parameters. It covers
Chu et al. (2010), Reitsma et al. (2005), Rutter and Gatsonis, 2001, and Holling et al., 2012 as special cases. Furthermore, some likelihoodbased criteria can be used for selecting a “best” model, among which the Akaike information criterion (AIC) with nice asymptotic properties is the most often used method; see Burnham and Anderson (2003) for more details. When a set of candidate models is considered, we may choose the model with the smallest AIC value, and then make statistical inference based on it.Nevertheless, the practical essence of such metaanalyses is restricted to a small sample size more often than not. Note that is the number of studies under consideration. For the same medical test design, there are usually not many compatible studies and hence the number of data points is too limited to apply the relative asymptotic theories. Vaida and Blanchard (2005), Liang et al. (2008), and Greven and Kneib (2010) proposed conditional AIC (cAIC) in linear mixed models, which is a tailored model selection method for small . In addition, an empirical likelihood (EL) method analogous to Owen (1990) can also be used for small in practice. In this study, we focus on the issue of model selection for ”summary line” situations. The key questions in this article include: (a) whether AIC gives an acceptable result, (b) which selection criterion (e.g., AIC, cAIC, or EL) has better performance, and (c) does there exist a criterion that performs satisfactorily under various situations especially when is small?
The rest of the article is organized as follows. Section 2 first reviews several commonly used models and then describes some existing model selection criteria, followed by our proposed criterion. The effectiveness of our proposal is shown through simulations comparing to other criteria in Section 3. An example of its application to colorectal cancer detection is given in Section 4. Finally, we conclude with Section 5.
2 Method
2.1 Two families of models and their special cases
This section briefly describes the two model families under transformation as illustrated in Doebler et al. (2012) and the relations with other approaches. The details of models can be found in the original literature.
Doebler et al. (2012) introduced a class of monotonic transformation functions controlled by for given as
Let and be the unobserved true sensitivity and false positive rate (1specificity) for the th study, respectively. With a pair of transformation parameters , the two transformed variables
are then assumed to follow a bivariate normal distribution with mean
and covariance matrix(1) 
Following Doebler et al. (2012), we consider two families of models by setting and in (1) to different values. The first family of models uses fixed
while the second family of models takes study heteroscedasticity into account with
and, i.e., estimated variances of sensitivity and specificity for individual studies.
Doebler et al. (2012) pointed out that
respectively corresponds to logit transformation with
and log transformation with . Moreover, is also approximately proportional to the complementary logarithmic function when is around 0.6. On the other hand, can be regarded as and complementary if and , respectively.When , the first family of models corresponds to the Lehmann family or proportional hazard models of Holling et al. (2012). If and are assumed to follow a bivariate normal distribution, the first family of models with coincides with the summary ROC method of Moses et al. (1993) but based on different parameterizations. Hereafter, the model corresponding to Moses et al. (1993) is called MSL method.
When , the second family of models is equivalent to bivariate models of Reitsma et al. (2005) and the hierarchical summary ROC method (HSROC) of Rutter and Gatsonis (2001). Furthermore, the second family of models approaches the complementary logarithmic models of Chu et al. (2010) when . Apart from , there are five common parameters involved in each family. To estimate model parameters, maximum likelihood (ML) or restricted maximum likelihood (REML) methods can be used. For a larger , can also be estimated as additional parameters.
2.2 Existing model selection criteria and our proposal
The work of Doebler et al. (2012) reviewed in the previous subsection generalized several widely used models. For , they also showed that it is possible to recover by treating it as free parameters. Nevertheless, they admitted that it is hard to estimate for , and they suggested these two quantities should be fixed for a small . In practice, treating as fixed or free parameters does not respectively make the problem easier or harder; an analyst still needs a way to determine suitable values of for transformation. Additionally, although they proposed two useful families of models, little has been known about how to select among them especially when the sample size is small. Therefore, a good model selection strategy is important.
Model selection can be viewed as a selection of both the model assumptions and the estimated parameters, which amounts to a choice of underlying probabilistic mechanism. Most of works for model selection (or variable selection) in linear regression and generalized linear models have been studied extensively (
Burnham and Anderson, 2003, Claeskens et al., 2008, and Fan and Lv, 2010). Unfortunately the models considered here have no variables to be selected and the key structure, the covariance matrix , is heavily affected by . Selection of the covariance structure in a linear mixed model is still a very open research area. Yet, the two families in Doebler et al. (2012) are linear mixed models only after transformation, which raises another challenge for us. Therefore, the results would be doubtful if one directly applies the existing model selection methods to select among the two families.In what follows, a “model” indicates a triplet of , , and an index of family (i.e., 1st or 2nd), so and are no longer free parameters. We shall consider several model selection criteria, and compare their performance based on simulations. The first one is AIC (Akaike, 1998), which was also inspected in Doebler et al. (2012). Let be the loglikelihood for a model and be the corresponding estimates of model parameters. Then AIC is defined as with being the number of parameters in the model . Since each model has 5 parameters, selecting the minimum AIC model amounts to choosing the model having the largest , where is obtained by ML or REML. Note that AIC has been shown to have nice asymptotic properties for model selection (Burnham and Anderson, 2003), but the focus in this work is the small problem. Although Cavanaugh (1997) is a corrected version of AIC with penalty instead of for small , it selects an identical model as AIC for models considered here.
The second and third one follow the conditional AIC studied in Vaida and Blanchard (2005), Liang et al. (2008), and Greven and Kneib (2010). Let ;
be the vector of observed sensitivities and 1specificities. For a specific model
, define , whereand . Thus is the empirical best linear unbiased predictor of based on the model . Then the conditional AIC (cAIC) is defined as + penalty, where with being the loglikelihood for a model evaluated at and the observations replaced by , and the penalty in cAIC was discussed in Vaida and Blanchard (2005), Liang et al. (2008), and Greven and Kneib (2010). In particular, Vaida and Blanchard (2005) assumed to be known, while Liang et al. (2008) and Greven and Kneib (2010) took the uncertainty of estimation into consideration. The major difference between Liang et al. (2008) and Greven and Kneib (2010) is that the former calculated the penalty approximately, while the latter provided an exact method. In our simulation studies, we will compare the performance of Vaida and Blanchard (2005) and Greven and Kneib (2010), and refer to them as cAICVB and cAICGK, respectively. Also note that a model in the first family does not consider the random effect, hence cAIC reduces to AIC in this case.
The fourth criterion is EL approach (Owen, 1990), which was primarily a method for constructing a confidence region for mean parameters, and Baggerly (1998) pointed out its connection to goodnessoffit measures. Denote for an arbitrary , and hence is the (backtransformed) mean of summarized sensitivity and 1specificity for a model . The empirical likelihood for a model having as mean parameters is given by
(2) 
under the constraints , , and . The empirical likelihood for the saturated model is
under the constraints , and . To assess the hypothesis that is the mean of independent data , we should first find the weight of each datum with (2). Then,
can be obtained, where is a constant and
has a chisquare limiting distribution with a degree of freedom equal to the rank of
(Owen, 1990). Thus, a larger value of indicates a model’s deficiency. We refer to this empirical likelihood method as ELfix.Note that ELfix cannot differentiate the covariance structure with merely . We propose a simple modification as the fifth criterion in the following. Based on an idea similar to aforementioned cAIC, we incorporate the information of ; , into the above empirical likelihood method. Specifically, for calculating (2) the original constraint remains the same for the first family, but we simply replace it with for a model in the second family. Our modification is referred to as ELblup. We will show the effectiveness of our proposed method through simulation studies.
3 Simulation Studies
3.1 Setup
We shall compare the criteria described in the last section via simulation studies. It is not fair to conduct simulations from any model for metaanalysis in Section 2.1
. Generating data from a certain model would be in favor of a specific approach, e.g., nTP and nFP come from a bivariate binomial distribution, or logit(sensitivity) and logit(1specificity) come from a bivariate normal distribution. Instead, we imitate a typical data collection process, and set up simulations similar to common demonstrations among those methodologies for the ROC curve of a single study as in
Ren et al. (2004), Du and Tang (2009), or Rufibach (2012).We consider metaanalyses of or primary studies in the diagnostic test, and for simplicity, candidate models are restricted to those within in combination with the two model families. Therefore, there are candidate models under this setting. To generate data for the th study, the primary test values of nondisease participants are drawn independently and identically from a distribution , and the values of diseased participants are from , where and
are integers sampled from Poisson distributions with means 160 and 40, respectively. Then a threshold is determined by maximizing Youden’s index (
Youden, 1950) for the participants, and we obtain the corresponding . Based on ; , the standard estimation procedures for a model (a triplet of , and index of family) is applied, and some competing model selection criteria introduced in Section 2.2 will also be used. For each criterion, a “best” model was chosen, i.e., a model with the smallest criterion’s value is selected among all the candidate models.We assess a selection criterion as follows. Let be the theoreticalROC curve in space, and be the corresponding area under the curve (i.e., AUC). For a model , let and be the estimates of and , respectively. Then, and are used for assessment based on the following four measures,

RMSE(): rooted mean squared error of ,

rank1: the ascending ranking of RMSE() among the 50 candidate models,

MIAE(): the mean integrated absolute deviation between and ,

rank2: the ascending ranking of MIAE() among the 50 candidate models.
The simulation experiment is replicated 500 times for each combination of , , and , where and are considered under the following four scenarios:
 (LD)

logistic distribution with location and scale parameters (0,1) for and (1.8,1.2) for ,
 (ND)
 (SND)

skew normal distribution with location, scale and shape parameters (0,1,1) for and (0.25,2,5) for ,
 (TND)

truncated normal distribution with mean and standard deviation parameters (0,1) for and (1,1.25) for , and the truncated minimum and maximum are a standard deviation from the mean.
In addition to the most popular distribution ND, LD is a heavy tail distribution, while TND is a short tail distribution and SND is asymmetric. These distributions are used to generate participants’ situations under study and to test the performance of various model selection criteria.
3.2 Results
Since Doebler et al. (2012) concluded that the ML estimator of the covariance is always biased, we shall merely report results based on REML estimators. In fact, results based on ML estimator give the same conclusion. For reference of the upper and lower bounds, we also calculate the four assessment measures for “AICnoJ” and “BEST”, where AICnoJ is the criterion similar to AIC but not using the Jacobian of the transformation, and BEST collects the model having the smallest MIAE() value among the 50 candidate models for each replication. As expected, AICnoJ performs the worst while BEST is superior to the others. Also note that AUC values can be close even if two ’s shapes are very different, and thus comparison of is more meaningful for “summary line” situation.
Tables 1 and 2 summarize the performance of different criteria for and , respectively. It is obvious that our proposed method, ELblup, always holds the best two places for RMSE(), and outperforms others for MIAE(). In contrast, AIC tends to choose worse models in many scenarios, and cAIC corrects it to some extent. Note that a random selecting mechanism would result in rankings of RMSE() and MIAE() with an average value of 25.5. In practice, we have little knowledge about the underlying true distributions for and , so a stable and robust method is critical. We notice that our proposed method, ELblup, is the only criterion steadily beating a random selection.
Distributions  Measures  AICnoJ  AIC  cAICVB  cAICGK  ELfix  ELblup  BEST 

RMSE  10.98%  4.18%  6.63%  6.98%  8.44%  5.57%  3.89%  
(0.12%)  (0.19%)  (0.21%)  (0.15%)  (0.17%)  (0.18%)  (0.13%)  
Rank1  47.98  16.27  25.73  29.09  37.28  21.48  15.43  
LD(0,1) vs.  (0.24)  (0.94)  (1.04)  (0.5)  (0.77)  (0.73)  (0.63)  
LD(1.8,1.2)  MAIE  11.04%  8.18%  8.52%  8.13%  9.24%  7.71%  5.12% 
(0.11%)  (0.17%)  (0.17%)  (0.18%)  (0.16%)  (0.2%)  (0.14%)  
Rank2  44.50  26.25  29.14  25.08  34.04  20.99  1.00  
(0.61)  (0.81)  (0.76)  (0.64)  (0.86)  (0.82)  (0)  
RMSE  7.55%  4.07%  4.7%  4.85%  5.99%  4.13%  2.95%  
(0.09%)  (0.23%)  (0.16%)  (0.13%)  (0.12%)  (0.14%)  (0.11%)  
Rank1  46.12  21.00  25.14  28.04  36.92  22.91  17.14  
ND(0,1) vs.  (0.45)  (1.02)  (1.01)  (0.68)  (0.75)  (0.75)  (0.58)  
ND(1.5,1.2)  MAIE  7.67%  7.51%  7.31%  6.65%  6.98%  6.24%  3.85% 
(0.09%)  (0.22%)  (0.21%)  (0.2%)  (0.16%)  (0.22%)  (0.11%)  
Rank2  40.31  32.96  32.58  22.16  32.09  20.27  1.00  
(0.81)  (0.8)  (0.79)  (0.77)  (0.88)  (0.85)  (0)  
RMSE  7.78%  4.46%  4.93%  4.83%  6.22%  4.33%  3.63%  
(0.12%)  (0.2%)  (0.16%)  (0.14%)  (0.14%)  (0.15%)  (0.11%)  
Rank1  47.22  24.98  27.51  27.34  38.02  23.33  21.66  
SND(0,1,1) vs.  (0.46)  (0.97)  (0.95)  (0.69)  (0.72)  (0.69)  (0.56)  
SND(0.25,2,5)  MAIE  8.21%  7.81%  7.64%  6.57%  6.68%  6.34%  4.47% 
(0.12%)  (0.17%)  (0.16%)  (0.16%)  (0.13%)  (0.2%)  (0.11%)  
Rank2  37.01  35.94  31.69  18.70  25.69  16.03  1.00  
(0.82)  (0.86)  (0.91)  (0.76)  (0.92)  (0.83)  (0)  
RMSE  7.21%  6.35%  5.56%  5.11%  6.03%  4.78%  3.48%  
(0.14%)  (0.29%)  (0.23%)  (0.17%)  (0.15%)  (0.19%)  (0.13%)  
Rank1  40.78  34.61  28.15  26.45  29.94  23.58  19.97  
TND(0,1) vs.  (0.73)  (0.95)  (0.88)  (0.86)  (0.86)  (0.85)  (0.66)  
TND(1,1.25)  MAIE  7.39%  9.39%  8.27%  7.04%  6.71%  6.36%  4.36% 
(0.16%)  (0.24%)  (0.23%)  (0.26%)  (0.2%)  (0.17%)  (0.16%)  
Rank2  28.17  34.86  29.59  19.92  18.06  17.24  1.00  
(0.87)  (0.94)  (0.99)  (0.91)  (0.8)  (0.84)  (0) 
cases (the values in parentheses are the corresponding standard errors).
Distributions  Measures  AICnoJ  AIC  cAICVB  cAICGK  ELfix  ELblup  BEST 

RMSE  11.09%  4.6%  5.88%  6.04%  8.38%  5.56%  3.00%  
(0.08%)  (0.19%)  (0.18%)  (0.17%)  (0.13%)  (0.17%)  (0.09%)  
Rank1  47.63  16.80  22.10  23.89  36.68  19.83  12.02  
LD(0,1) vs.  (0.17)  (0.85)  (0.87)  (0.82)  (0.61)  (0.7)  (0.44)  
LD(1.8,1.2)  MAIE  11.1%  6.83%  7.20%  6.94%  8.63%  6.81%  4.04% 
(0.08%)  (0.14%)  (0.12%)  (0.11%)  (0.13%)  (0.16%)  (0.09%)  
Rank2  46.90  22.50  24.90  24.06  34.20  20.58  1.00  
(0.23)  (0.75)  (0.69)  (0.54)  (0.73)  (0.8)  (0)  
RMSE  7.80%  3.66%  4.44%  4.53%  5.73%  4.13%  2.36%  
(0.07%)  (0.12%)  (0.14%)  (0.11%)  (0.09%)  (0.14%)  (0.06%)  
Rank1  47.50  21.96  25.23  26.86  35.82  23.94  15.76  
ND(0,1) vs.  (0.27)  (0.88)  (0.96)  (0.6)  (0.64)  (0.76)  (0.41)  
ND(1.5,1.2)  MAIE  7.79%  6.26%  6.03%  5.36%  6.11%  5.29%  3.10% 
(0.07%)  (0.14%)  (0.12%)  (0.13%)  (0.13%)  (0.18%)  (0.06%)  
Rank2  45.38  30.92  29.40  18.96  30.38  18.76  1.00  
(0.42)  (0.78)  (0.72)  (0.76)  (0.75)  (0.88)  (0)  
RMSE  7.85%  4.33%  4.66%  4.57%  6.04%  4.07%  3.21%  
(0.08%)  (0.14%)  (0.13%)  (0.11%)  (0.1%)  (0.13%)  (0.07%)  
Rank1  45.78  24.40  27.20  26.89  37.38  22.91  21.01  
SND(0,1,1) vs.  (0.19)  (0.86)  (0.78)  (0.56)  (0.52)  (0.74)  (0.41)  
SND(0.25,2,5)  MAIE  7.83%  6.75%  6.09%  5.28%  6.14%  5.24%  3.98% 
(0.08%)  (0.13%)  (0.09%)  (0.09%)  (0.10%)  (0.11%)  (0.06%)  
Rank2  41.86  27.62  23.55  14.36  25.71  13.14  1.00  
(0.5)  (0.98)  (0.81)  (0.78)  (0.85)  (0.7)  (0)  
RMSE  7.1%  4.72%  4.97%  4.66%  5.73%  4.57%  2.9%  
(0.1%)  (0.2%)  (0.13%)  (0.14%)  (0.12%)  (0.15%)  (0.06%)  
Rank1  45.04  27.36  27.51  26.87  36.28  22.43  19.59  
TND(0,1) vs.  (0.39)  (0.9)  (0.78)  (0.86)  (0.71)  (0.94)  (0.52)  
TND(1,1.25)  MAIE  7.46%  7.13%  6.35%  6.10%  5.88%  5.84%  3.84% 
(0.09%)  (0.19%)  (0.20%)  (0.13%)  (0.12%)  (0.11%)  (0.07%)  
Rank2  33.12  27.74  21.61  20.75  19.66  18.44  1.00  
(0.63)  (1.01)  (0.88)  (0.84)  (0.83)  (0.78)  (0) 
4 Example in real applications
In this section, we demonstrate various methods for a dataset provided in Zhou et al. (2013), which evaluated the ability of microRNAs in the detection of colorectal cancer. There were only 13 studies and the authors applied the MSL method and pooled these studies together in their metaanalysis. Here, we conjecture that the precision of the diagnostic tools may keep improving with time. Thus, 13 studies are separated into two groups by their published years in our analysis (i.e., year and year ). Results are shown in Figure 1.
Even though the shapes of sROC curves are very different, the values of AUC are similar. In the first group, HSROC and ELblup gave almost identical results, while the model selected by AIC, MSL and the Lehmann model gave slightly broader confidence and prediction regions. In the second group, we notice that the model selected by AIC produced a curve close to that of the HSROC model. However, both of them failed to capture the characteristics for data points, because they gave too narrow confidence and prediction regions for sensitivity and 1specificitiy and both models estimated very high negative correlations between them. In contrast, ELblup, MSL and Lehmann methods gave more reasonable confidence and prediction regions. Results from the ELblup method somewhat support our conjecture, though the precision difference between the two groups is not significant.
5 Conclusion
The model selection problem for metaanalyses of diagnostic studies can be very difficult, not only because of the small sample size, but also due to the probabilistic mechanism of models not perfectly coinciding with the data collection process. We can almost conclude that there is no true model. The common criteria based on asymptotic theories do not have acceptable performance in such challenging cases. Our method can provide a more credible inference as demonstrated in the simulation studies and the real data example, even though we do not know the underlying distributions.
References
 Akaike (1998) Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.
 Altman (2001) Altman, D. G. (2001). Systematic reviews in health care: Systematic reviews of evaluations of prognostic variables. BMJ, 323(7306):224.
 Baggerly (1998) Baggerly, K. A. (1998). Empirical likelihood as a goodnessoffit measure. Biometrika, 85(3):535–547.
 Burnham and Anderson (2003) Burnham, K. P. and Anderson, D. R. (2003). Model selection and multimodel inference: a practical informationtheoretic approach. Springer Science & Business Media.
 Cavanaugh (1997) Cavanaugh, J. E. (1997). Unifying the derivations for the akaike and corrected akaike information criteria. Stat Probab Lett, 33(2):201–208.
 Chappell et al. (2009) Chappell, F., Raab, G., and Wardlaw, J. (2009). When are summary roc curves appropriate for diagnostic metaanalyses? Stat Med, 28(21):2653–2668.
 Chu et al. (2010) Chu, H., Guo, H., and Zhou, Y. (2010). Bivariate random effects metaanalysis of diagnostic studies using generalized linear mixed models. Med Decis Making, 30(4):499–508.
 Chu et al. (2009) Chu, H., Nie, L., Cole, S. R., and Poole, C. (2009). Metaanalysis of diagnostic accuracy studies accounting for disease prevalence: alternative parameterizations and model selection. Stat Med, 28(18):2384–2399.
 Claeskens et al. (2008) Claeskens, G., Hjort, N. L., et al. (2008). Model selection and model averaging, volume 330. Cambridge University Press Cambridge.
 Doebler et al. (2012) Doebler, P., Holling, H., and Böhning, D. (2012). A mixed model approach to metaanalysis of diagnostic studies with binary test outcome. Psychol Methods, 17(3):418.
 Du and Tang (2009) Du, P. and Tang, L. (2009). Transformationinvariant and nonparametric monotone smooth estimation of roc curves. Stat Med, 28(2):349–359.
 Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Stat Sin, 20(1):101.
 Gatsonis and Paliwal (2006) Gatsonis, C. and Paliwal, P. (2006). Metaanalysis of diagnostic and screening test accuracy evaluations: methodologic primer. Am J Roentgenol, 187(2):271–281.
 Greven and Kneib (2010) Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789.
 Guolo (2017) Guolo, A. (2017). A double simex approach for bivariate randomeffects metaanalysis of diagnostic accuracy studies. BMC Med Res Methodol, 17(1):6.
 Harbord et al. (2006) Harbord, R. M., Deeks, J. J., Egger, M., Whiting, P., and Sterne, J. A. (2006). A unification of models for metaanalysis of diagnostic accuracy studies. Biostatistics, 8(2):239–251.
 Higgins and Green (2011) Higgins, J. P. and Green, S. (2011). Cochrane handbook for systematic reviews of interventions, volume 4. John Wiley & Sons.
 Holling et al. (2012) Holling, H., Böhning, W., and Böhning, D. (2012). Likelihoodbased clustering of metaanalytic sroc curves. Psychometrika, 77(1):106–126.
 Honest and Khan (2002) Honest, H. and Khan, K. S. (2002). Reporting of measures of accuracy in systematic reviews of diagnostic literature. BMC Health Serv Res, 2(1):4.
 Irwig et al. (1995) Irwig, L., Macaskill, P., Glasziou, P., and Fahey, M. (1995). Metaanalytic methods for diagnostic test accuracy. J Clin Epidemiol, 48(1):119–130.
 Kester and Buntinx (2000) Kester, A. D. and Buntinx, F. (2000). Metaanalysis of roc curves. Med Decis Making, 20(4):430–439.
 Liang et al. (2008) Liang, H., Wu, H., and Zou, G. (2008). A note on conditional aic for linear mixedeffects models. Biometrika, 95(3):773–778.
 Moses et al. (1993) Moses, L. E., Shapiro, D., and Littenberg, B. (1993). Combining independent studies of a diagnostic test into a summary roc curve: dataanalytic approaches and some additional considerations. Stat Med, 12(14):1293–1316.
 Owen (1990) Owen, A. (1990). Empirical likelihood ratio confidence regions. Ann Stat, 18(1):90–120.
 Reitsma et al. (2005) Reitsma, J. B., Glas, A. S., Rutjes, A. W., Scholten, R. J., Bossuyt, P. M., and Zwinderman, A. H. (2005). Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol, 58(10):982–990.
 Ren et al. (2004) Ren, H., Zhou, X.H., and Liang, H. (2004). A flexible method for estimating the roc curve. J Appl Stat, 31(7):773–784.
 Rufibach (2012) Rufibach, K. (2012). A smooth roc curve estimator based on logconcave density estimates. Int J Biostat, 8(1).
 Rutter and Gatsonis (2001) Rutter, C. M. and Gatsonis, C. A. (2001). A hierarchical regression approach to metaanalysis of diagnostic test accuracy evaluations. Stat Med, 20(19):2865–2884.

Schlattmann et al. (2015)
Schlattmann, P., Verba, M., Dewey, M., and Walther, M. (2015).
Mixture models in diagnostic metaanalyses—clustering summary receiver operating characteristic curves accounted for heterogeneity and correlation.
J Clin Epidemiol, 68(1):61–72.  Shamseer et al. (2015) Shamseer, L., Moher, D., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., Shekelle, P., and Stewart, L. A. (2015). Preferred reporting items for systematic review and metaanalysis protocols (prismap) 2015: elaboration and explanation. BMJ, 349:g7647.
 Sotiriadis et al. (2016) Sotiriadis, A., Papatheodorou, S., and Martins, W. (2016). Synthesizing evidence from diagnostic accuracy tests: the sedate guideline. Ultrasound Obstet Gynecol, 47(3):386–395.
 Trikalinos et al. (2012) Trikalinos, T. A., Balion, C. M., Coleman, C. I., Griffith, L., Santaguida, P. L., Vandermeer, B., and Fu, R. (2012). metaanalysis of test performance when there is a “gold standard”. J Gen Intern Med, 27(1):56–66.
 Tripepi et al. (2009) Tripepi, G., Jager, K. J., Dekker, F. W., and Zoccali, C. (2009). Diagnostic methods 2: receiver operating characteristic (roc) curves. Kidney Int., 76(3):252–256.
 Vaida and Blanchard (2005) Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixedeffects models. Biometrika, 92(2):351–370.
 Walter and Jadad (1999) Walter, S. and Jadad, A. (1999). Metaanalysis of screening data: a survey of the literature. Stat Med, 18(24):3409–3424.
 Youden (1950) Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1):32–35.
 Zhou et al. (2013) Zhou, X.J., Dong, Z.G., Yang, Y.M., Du, L.T., Zhang, X., and Wang, C.X. (2013). Limited diagnostic value of micrornas for detecting colorectal cancer: a metaanalysis. Asian Pac J Cancer Prev, 14(8):4699–4704.
Comments
There are no comments yet.