# Confidence intervals for the area under the receiver operating characteristic curve in the presence of ignorable missing data

Receiver operating characteristic (ROC) curves are widely used as a measure of accuracy of diagnostic tests and can be summarized using the area under the ROC curve (AUC). Often, it is useful to construct a confidence intervals for the AUC, however, since there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this manuscript, we compare different methods of constructing Wald-type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation (MI) based on logistic regression (LR) gives the most robust coverage probability and the choice of CI method is less important. However, when missingness rate is less severe (e.g. less than 70 constructing confidence intervals along with multiple imputation using predictive mean matching (PMM).

## Authors

• 3 publications
• 3 publications
• 2 publications
• ### Empirical Likelihood Inference for Area under the ROC Curve using Ranked Set Samples

The area under a receiver operating characteristic curve (AUC) is a usef...
10/23/2020 ∙ by Chul Moon, et al. ∙ 0

• ### Imputation estimators for unnormalized models with missing data

We propose estimation methods for unnormalized models with missing data....
03/08/2019 ∙ by Masatoshi Uehara, et al. ∙ 0

• ### Distribution-free multivariate inference about diagnostic classifiers based on partial areas under their receiver operating characteristic curves

The receiver operating characteristic curve is a widely used performance...
04/19/2021 ∙ by Maximilian Wechsung, et al. ∙ 0

• ### Confidence interval for the AUC of SROC curve and some related methods using bootstrap for meta-analysis of diagnostic accuracy studies

Background: The area under the curve (AUC) of summary receiver operating...
04/09/2020 ∙ by Hisashi Noma, et al. ∙ 0

• ### Between a ROC and a Hard Place: Using prevalence plots to understand the likely real world performance of biomarkers in the clinic

The Receiver Operating Characteristic (ROC) curve and the Area Under the...
10/25/2018 ∙ by B Clare Lendrem, et al. ∙ 0

• ### Fixed-time descriptive statistics underestimate extremes of epidemic curve ensembles

Across the world, scholars are racing to predict the spread of the novel...
07/09/2020 ∙ by Jonas L. Juul, et al. ∙ 0

• ### New robust confidence intervals for the mean under dependence

The goal of this paper is to indicate a new method for constructing norm...
12/30/2017 ∙ by Martial Longla, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Diagnostic tests are often used to classify subjects as either diseased or non-diseased. Assuming that large values of test results are indicative of diseased status, a subject will be classified as diseased or non-diseased if their test results are above or below a certain threshold, respectively. Then, for a specific threshold value, the performance of the test can be evaluated using measures such as sensitivity and specificity, where sensitivity is the probability of a true positive or true positive rate (TPR) (i.e. given a positive test result, the subject actually has the disease) and specificity is the probability of a true negative or true negative rate (TNR) (i.e. given a negative test results, the subject is actually non-diseased). A more comprehensive way to measure the accuracy of a diagnostic test is to use the receiver-operating characteristic (ROC) curve (Pepe, 2003), which graphically compares specificity and sensitivity levels; specifically, the ROC curve is a plot of sensitivity against 1 - specificity across all possible cutoff values. When comparing two diagnostic tests to one another, if the ROC curve for one test is located uniformly higher than the other test across all possible thresholds, the test is said to be strictly more accurate and therefore preferred over the other test. It is, however, not straight forward to compare two ROC curves that intersect at some location other than the end points.

One common way to summarize the ROC curve is to compute the area under the ROC curve (AUC). The AUC ranges in value from 0.5 (essentially a meaningless test) to 1 (a perfect test) and can be interpreted as the probability that a randomly chosen diseased subject has a higher test value than that of a randomly chosen healthy subject (given higher test value is indicative of the disease) (Hanley and McNeil, 1982). The AUC is, in most cases, calculated empirically by calculating the area under the sample ROC curve, and it is often of interest to present an interval estimate of the AUC rather than simply a point estimate.

There exist a number of methods that have been proposed to compute an interval estimator for the AUC. Many of these have been derived based on the equivalence relation between the AUC and Wilcoxon’s rank-sum test statistic (Hollander et. al., 2014) such as Bamber (1975), Hanley and McNeil (1982), DeLong et. al. (1988), Mee (1990), and Newcombe (2006). Other proposals, such as Reiser and Guttman (1986) and Newcombe (2006), developed confidence interval estimators of the AUC by assuming a parametric distribution for the test scores and deriving the interval in that manner. Cortes and Mohri (2005) suggested a confidence interval method using a combinatoric approach and Obuchowski and Lieber (2002) introduced an exact method, which can be used when the estimated AUC is close to 1.

These methods can be easily applied when the true disease status and the test values for all subjects in the sample are known. However, in practice, data may be incomplete. This often happens when the true disease status of subjects is unknown, which commonly occurs when a patient does not go through a so-called ”gold standard test” that, in many cases, is invasive and/or expensive. Missing data can also occur on patients’ test scores for many different reasons including they simply did not have the test performed. When observations with missing values are removed from analysis (i.e. complete case analysis), the estimator is potentially subject to bias, and when bias is caused specifically by missing disease status, this is referred to as verification bias.

Many corrected estimators of AUC have been proposed to adjust biases caused by missing values. Begg and Greenes (1983), Alonzo and Pepe (2005), He et. al. (2009), along with others, came up with AUC estimators correcting verification bias directly (i.e. “direct approach”). In order to construct confidence intervals based on those methods, bootstrap techniques are used. Alternatively, confidence interval methods for complete data settings can be used after applying multiple imputation (MI) techniques to the incomplete dataset (i.e “MI approach”) as in Harel and Zhou (2006), which sought to correct the verification bias for measuring sensitivity and specificity, and Harel and Zhou (2007a). To address the problem of missing biomarker values, Long et. al. (2011a) and Long et. al. (2011b) came up with a robust approach and an approach using nonparametric MI, respectively. All the estimators correcting for bias introduced here are based on the assumption that the data are missing at random (MAR) (Rubin, 1976), and exploit the covariates to predict the missing values.

Although quite a few estimators have been proposed, relatively little work has been done reviewing the performance of the confidence interval methods for the AUC in the presence of missing data. In this article, we compare the performance of several Wald-type interval estimators for the AUC in datasets where missing data exist. In our example, the incomplete data will be true disease statuses, and it will be assumed that the missingness mechanism is MAR.

## 2 Wald-type confidence interval methods for the AUC

Consider a test such that higher test scores are associated with a higher probability that the subject is diseased and vice versa. By definition, the AUC () is the integral of the TPR of the theoretical ROC curve by its false positive rate (FPR = 1 - TNR), that is, where TPR(t) and FPR(t) are the TPR and FPR for a threshold value . As the AUC can be interpreted as the probability that a randomly chosen diseased subject has higher biomarker value than a randomly chosen healthy subject’s biomarker value (plus half of the probability of ties is added, if any), it can also be expressed as: where and

are the biomarker of the randomly chosen non-diseased and diseased subjects respectively. So it is straightforward to have an unbiased estimator of the AUC as:

 ^θ=∑nYi=1∑nXj=1{I(yi>xj)+12I(yi=xj)}nYnX,

where and are the test scores for the -th and -th individual in the non-diseased and diseased group, respectively, and j = 1, …, and i = 1, …, .

As the variance estimation of the AUC is not obvious, various variance estimators of the AUC have been proposed. Given a variance estimator, confidence intervals (CI’s) can be built in several ways. The simplest and the most commonly used form is the Wald-type interval. This type of CI relies on large sample theory and evaluates the standard error at a point estimate to have upper and lower bounds. Another popular method for CI construction is the score-type (Wilson and Hilferty, 1929) interval. This method is also an asymptotic approach, but the standard errors are evaluated at the candidates of the confidence bounds, not at the point estimate only. Other types of confidence interval constructions include likelihood-ratio method and exact methods.

Despite some drawbacks with Wald-type intervals, such as inappropriateness to use when the sample is small or when the population parameter for proportional data are extreme, this type of interval is computationally simple and applicable to almost all types of variance estimators. Additional assumptions should be posed to get other types of intervals other than Wald-type intervals under missing data settings. In this article, we consider the Wald-type CI’s only, presenting some of them below.

### 2.1 Bamber’s method

Bamber (1975) presented the variance estimator of using the equivalence relation between the Mann-Whitney statistic () and the AUC () that and the variance of (Noether, 1967):

 ˆV(^θ)=14(nX−1)(nY−1)[p(X≠Y)+(nX−1)bXXY+(nY−1)bYYX−4(nX+nY−1)(^θ−12)2],

where , , , , , , and and

are random variables sampled independently without replacement from X and Y.

Here and are an unbiased estimator of and , respectively.

Using this variance estimator, a CI for the AUC can be constructed as , where is the upper

percentile of the standard normal distribution.

### 2.2 Hanley-McNeil’s method I

Hanley and McNeil (1982) extended the ideas in Bamber (1975) to simpler variance estimators. Unlike Bamber’s variance estimator, those of Hanley and McNeil (and Newcombe as well discussed in later subsection) are not unbiased by using and in the denominator. As Bamber (1975) did for his method, we will use and instead in the denominator to make the estimators unbiased for this method and for the later methods (2.2 - 2.4). The under-coverage of the methods in Hanley and McNeil (1982), which were discussed in Newcombe (2006) is partly due to the underestimation of the variance.

Another shortcoming of the variance formula in Hanley and McNeil (1982) is that it is appropriate only when the biomarker is sufficiently continuous so that there are no ties between the diseased and the non-diseased. In many cases, however, the biomarkers are discrete, or even when they are continuous, ties often happen by rounding values to significant digits. In our simulation study, we apply a generalized formula that incorporates the tie situations.

Then the revised variance estimator is as follows:

 ˆV(^θ)=1(nX−1)(nY−1)[^θ(1−^θ)−14p(Y=X)+(nY−1)(^Q1−^θ2)+(nX−1)(^Q2−^θ2)],

where is an estimator of , or the probability that biomarkers of two randomly chosen diseased subjects, possibly the same subjects, are greater than or equal to that of a randomly chosen non-diseased subject with half the weight to the equality. Similarly, is an estimator of . (The mathematical proof that the revised estimator is an unbiased estimator is provided in the Appendix.)

The Wald-type interval is given as .

### 2.3 Hanley-McNeil’s method II

Hanley and McNeil (1982) further simplified the variance estimator by assuming that X and Y are exponentially distributed:

 ˆV(^θ)=1(nX−1)(nY−1)[^θ(1−^θ)−14p(Y=X)+(nY−1)(^Q1−^θ2)+(nX−1)(^Q2−^θ2)],

where and .

### 2.4 Newcombe’s Wald method

Newcombe (2006) suggested a modification to the Hanley-McNeil’s method II by replacing both and in the numerator with . Then the variance estimator becomes:

 ˆV(^θ)=^θ(1−^θ)(nX−1)(nY−1)[2N−1−3N−3(2−^θ)(1+^θ)]

### 2.5 DeLong’s method

Based on the method in Sen (1960), DeLong et. al. (1988) derived a covariance matrix estimator of which makes possible nonparametric comparisons of two or more ROC curves. Still the idea can be applied to construct a Wald-type CI for the area under a single ROC curve:

 ˆV(^θ)=^s10nX+^s01nY,

where and . Using this variance estimator, a Wald-type CI can be constructed as: .

## 3 Missing data and multiple imputation

A common problem in applied statistics is incomplete data. Incomplete data occur in many medical, social, and demographic analyses and leads to difficulties in performing inference. One way of handling the missing data problem is multiple imputation (Rubin, 2004, Harel and Zhou, 2007b). MI is a Monte Carlo technique that involves three steps: imputation, analysis and combining of results. First, the missing values are filled in by simulated values to create completed datasets where is commonly larger than 5. The imputation model consists of selecting a plausible probability model on both observed and missing values to simulate values for the missing observations. The imputation model should be selected to be compatible with the analysis to be performed on the imputed datasets where the analysis model is determined by the statistical problem in question. Following this, each of the imputed datasets is analyzed separately using some complete data statistical method of interest. This is then followed by the results of each of the analyses being combined (Rubin, 2004) to produce point and/or interval estimates that incorporate both between and within imputation variability.

### 3.1 Missingness mechanism assumptions

Define the missingness mechanism to be ,which is a random variable that is a 1 if the value is missing and 0 otherwise. In order to perform MI, assumptions about the missingness mechanism must be made. The simplest missingness assumption is that the data are missing completely at random (MCAR). This type of missingness implies that the probability of an observation does not depend on the observed or unobserved values in the data. Data are said to be missing at random (MAR) if the probability of missingness does not depend on the unobserved data (Little and Rubin, 2002). If the probability of missingness is related to unobserved values in the dataset, the missingness is said to be missing not at random (MNAR). In this case, to perform MI, the missingness mechanism must be explicitly modeled. In the case of MAR and MCAR, if the parameters of the data model and the parameters of the missing data model are distinct, then missingness mechanism, , does not need to be modeled and the missing data mechanism is referred to as ignorable. In this manuscript, we focus solely on data whose missingness mechanism is ignorable.

### 3.2 Multiple imputation techniques

One common, but simple method for performing MI is to model the data as if it comes from a multivariate normal distribution. Once this model is fit, imputed values are drawn from the posterior predictive distribution using data augmentation (DA) which is a kind of MCMC algorithm (Schafer, 1997). A stand alone statistical package named “NORM” (NORM, 1999) is available for performing this MI method. Additionally, there is a package for performing this method (Novo, 2013) in R (R, 2017).

While the previous method is based on multivariate normal distribution, in many applied statistical problems the adequate imputation model is not multivariate normal. For example, a dataset concerning the Alzheimer’s disease (AD) that we will analyze in Section 6

contains the variable outcome with a value of 1 if the patient has AD or 0 if the patient does not have the disease. Bernaards et. al. (2007) explored using a multivariate normal model when the data clearly violated that assumption, namely in the presence of a binary variable. In that manuscript, they imputed under the multivariate normal model and used a variety of rounding techniques to convert the continuos imputed values back to binary variables. Here, we implement one of these methodologies in our analysis specifically the adaptive rounding approach. Adaptive rounding is a rounding procedure where the threshold value for rounding to zero or one is based on normal approximation to the binomial distribution. The adaptive rounding threshold is given by

where is the imputed variable obtained from the multivariate normal imputation procedure, and is the cumulative density function of the standard normal distribution. For more information on rounding in multiple imputation see Demirtas (2009).

Multivariate imputation by chained equations (MICE) (White et. al., 2011, van Buuren, 2012, van Buuren et. al., 2017) is another widely used MI method. In MICE’s sequential process, a joint distribution for the imputation models does not need to be explicitly specified, and thus makes this method very flexible (Allison, 2009). Despite the lack of theoretical justification as to whether MICE’s sequential model converges to the distributions of the missing variables, it was demonstrated to perform well in many simulation settings (van Buuren et. al., 2006). Zhu and Raghunathan (2016) claim that under certain conditions such as valid model specification or good fit of distribution to data, sequential regression approach may work well even with models that are incompatible with the conditional distributions.

While there are numerous ways to implement MICE, in this paper we focus on predictive mean matching (PMM) (Little, 1988) and logistic regression approach (LR) (Rubin, 2004) since the variable with missingness in our simulation study is dichotomous. Here, the MICE framework is implemented in R via the “mice” package (van Buuren and Groothuis-Oudshoorn, 2011).

### 3.3 Multiple imputation combining rules

Once MI is performed, each of the imputed datasets are analyzed to produce estimates () of the quantity of interest () and estimates () of the associated variance (), where . Assuming that the sampling distribution of is normally distributed then, according to Rubin’s combining rules, where , , , and .

### 3.4 Making inference about the AUC with MI

MI techniques can be employed to fill in the missing values to make an inference on the AUC. From the multiple (

) sets of imputed data, two vectors of statistics are obtained:

and , where and are the estimate and the variance estimate of the AUC for the imputed data. The variance estimates of the AUC can be obtained by applying one of the methods mentioned in Section 2. Given that the sample size is large, the sample AUC’s are asymptotically normal. Then these vectors are combined to form a 95% CI, where , , , , and .

## 4 Simulation Study

Simulation-based methods are quite often used to evaluate performance in incomplete data settings. For example, Mitra and Reiter (2016), Demirtas and Hedeker (2008), Bernaards et. al. (2007), Demirtas (2007), Demirtas and Hedeker (2007), and Demirtas et. al. (2007) are a few of the numerous examples of such research. Here we follow these examples and use simulation-based methods to study the performance of different Wald-type confidence intervals for the AUC in the presence of missing data. Specific details of the simulation follow.

We generated simulated data with 8 variables: disease status (D), biomarker (T), 5 covariates (Z = ), and missingness indicator (R). Given a set of random covariates (Z), D, T, and R were randomly drawn according to the parameters: the prevalence rate (), the AUC (), and the rate of missing observations (or the missing coverage, ) sequentially. The generating scheme for D, T, and R is largely from Alonzo and Pepe (2005). Three different sample sizes, 50, 100, and 200, were considered for each case. For each combination of parameters and sample size, a simulation with 10,000 replicates was performed. This simulation study was done using the software R 3.4.2 (R Core Team, 2017). The code is provided as supplementary material.

After creating random data, CI’s were constructed using 3 different MI techniques (PMM, LR, and NORM) and 5 different CI methods. The complete data were also analyzed for comparison with the MI. Then they were evaluated by measuring the coverage probability, the left and right non-coverage probability, and the average confidence interval length.

### 4.1 Distributional assumptions

The covariates are assumed to be multivariate normal: where is a vector containing the means of 5 covariates, and is a 5 by 5 covariance matrix of .

The disease status, D, is 1 for diseased subjects and 0 for non-diseased subjects. It was generated by random Bernoulli draw with prevalence rate , which is a function of linear combination of ’s: where .

The biomarker, T, is normally distributed with its mean conditional on the values of the disease status and the covariates: where where and are vectors of regression coefficients both containing 5 elements.

The missingness indicator on disease status, R, is 1 for non-observed subjects and 0 for observed subjects. If a subject has either a biomarker value within upper

-th quantile, or at least one of the covariates is within its

-th quantile, then he/she always goes through the golden standard test to verify his/her true disease status. Otherwise, the probability of non-verification was set as ; or for some , and for all , where denotes the -th quantile of the distribution of T and denotes the -th quantile of the distribution of .

### 4.2 Parameter specification

We set the mean of as , where denotes a column vector for some number , and the covariances of as

 ΣZ=(100.30.4−0.4010.20.200.30.210.7−0.50.40.20.71−0.2−0.40−0.5−0.21).

is set as 0 to have = 0.5, and as 1.6111 to have with 95 confidence. We also set , , , , and .

By setting as a constant, rather than a function of

, the biomarker values are assumed to be almost homoscedastic between the diseased and the non-diseased groups. The reason why homoscedasticity is addressed is because heteroscedasticity inflates the variance of the sampling distribution of the AUC, and thus lowers the coverage probabilities of the CI’s. In our simulation study, the variability of the biomarker,

, is slightly higher for the disease group than that for the non-diseased group. However, the disparity in variance between the groups is minor and therefore we did not consider the two groups of biomarkers to be heteroscedastic.

Given those parameters and , = 0.8089, 1.4486, 1.9767 and 2.9670 give the desired values of AUC () = 0.8, 0.9, 0.95 and 0.99. For , the values of = 0.8319, 1.4729, 2.0019 and 2.9939 make to be 0.8, 0.9, 0.95 and 0.99.

If we set , , and , the missing percentage () is roughly . If we set , , and , the missing coverage () is roughly . If we set = 0.95, q1 = 0.99, and q2 = 0.99, the missing coverage () is roughly 90%.

To analyze complete data, the missingness indicator (R) was ignored. To get CI’s for incomplete data, MI techniques were applied to the same data with missingess indicator (R) considered. Imputations were performed m = 10 times with 5 iterations each for all cases. For NORM imputation, a non-informative prior for the multivariate normal distribution was assumed and the iterations started from estimates by the expectation-maximization (EM) algorithm.

For each of 72 different settings (i.e. four AUC values, two prevalence rates, three missingness rates, and three sample sizes), 10,000 simulation replicates were performed for each of three imputation methods (PMM (MICE), LR (MICE), and NORM).

### 4.3 Evaluation

We evaluate the performance of CI’s by measuring coverage probability (CP), left and right non-coverage probability (LNCP and RNCP), and the confidence interval length (CIL). Coverage probability is defined as the proportion of CI’s that capture the population AUC and the proportion of CI’s of which upper (or lower) limit lies below (or above) the population AUC is the LNCP (or RNCP). For calculating the CIL the confidence intervals were truncated above at 1 and below at 0. Each evaluation measure is presented by being averaged across some parameters and/or across simulation replicates if necessary.

For CP, the smaller the error (i.e. difference between the actual and nominal CP) is, the better performing the CI’s are. Given the same confidence level, shorter CIL’s are preferred. LNCP and RNCP are considered good if they are balanced. To additionally see the stability of the coverage statistic, that is, to see if there is a chance that seemingly good average CP happens to result from averaging extremely small and large CP’s, we also measured the mean absolute error of the CP (MAE): , where CP is the nominal coverage probability, is the actual coverage probability of the setting, and is the number of settings.

## 5 Results

For each imputation technique, CI method and missing coverage (), the average CP, the MAE of CP and the average CIL were calculated. Table 1 shows the results when . (Results for values of and can be seen in Tables 4 and 5, respectively, in the appendix). Table 2 presents the results about the average LNCP and RNCP for . (LNCP and RNCP results for and can be found in Tables 6 and 7, respectively, in the appendix.) For each table, the first five rows correspond to the full dataset that has no missing values, the second five rows are for complete case analysis (i.e. naive analysis), and the remaining rows are for incomplete datasets after applying multiple imputation by PMM, LR, and NORM, respectively.

The average CP and the average CIL are also shown in Figure 1 by AUC levels (), MI techniques (compared with analysis on data that have no missing value), the prevalence rate () and the missing coverage (). Then the performances across different values of sample size () when are presented in Figure 2.

We first look at the results for the complete datasets, and see how the performance changes when we do complete case analysis for incomplete data. Then we move on to different imputation techniques for the incomplete data with moderate level of missingness ().

### 5.1 Performance of CI’s for complete data

Focusing on the top section of Table 1 labeled “complete”, one can see that for each CI method, CP decreases away from the nominal value (0.95) as approaches 1 with its mean absolute error increasing. Among the CI methods, Newcombe’s Wald (NW) method has CP noticeably closer to the nominal CP and mean absolute error of CP less than any other methods for all AUC levels. However, since when the true AUC is , the CP significantly deviates from the nominal value even for NW method, Wald-type intervals are worth using when AUC is less than equal to .

Further, as we see in Table 2, LNCP’s are larger than RNCP’s with the imbalance being all the more evident when approaches 1. This is because the symmetric property of Wald-type intervals makes the upper and lower limit over- and underestimated as the standard error decreases as the AUC moves from 0.5 to 1.

### 5.2 Performance of CI’s for incomplete data

Again focusing on Table 1, the CP’s for complete case analysis are, as expected, substantially farther away from the nominal CP (0.95) than those for the complete data. The CIL’s for complete case analysis are on average longer than those for complete data mostly due to loss of information related to the missing data (Figure 1).

Next, focusing on results where MI techniques were applied to incomplete data, it is easy to see that the MI methods vastly outperform the naive complete case analysis particularly when the value of the AUC approaches 1. For instance, when AUC is 0.99, the CP for naive analysis range from 0.287 to 0.355 whereas the worst MI procedure (NORM) has CP’s ranging from 0.610 to 0.785.

While the NORM imputation technique certainly outperforms the naive analysis it is inferior to both PMM and LR in terms of CP. PMM has coverage probabilities that range from 0.896 up to 0.953, much better than NORM and naive analysis, but still underperforming the LR. LR slightly overestimates coverage probability with all CP’s between 0.96 and 0.97 and shows remarkable accuracy in CP across different values of the AUC. Counterintuitively, the CI’s constructed with LR and PMM outperform even the complete data in terms of CP when the AUC is near 1 as seen from Figure 1. For example, the CP of PMM ranges from 0.896 to 0.953 for , while that of the complete data does from 0.767 to 0.892.

The performance of NORM is the worst in terms of both CP and CIL. The longest average CIL with the lowest average CP makes NORM the most inefficient approach. Horton et. al. (2003) pointed out that rounding may cause bias in MI and that it might be better to leave the imputed values as they are. A large bias of NORM is responsible for its high imbalance between LNCP and RNCP.

CI’s for the complete data have better CP’s for a larger sample; however, the opposite is true for NORM imputation, which has worse CP’s for a larger sample as Figure 2 shows. When a point estimate is biased, as is the case here, large sample sizes worsen the under-coverage by reducing the standard error estimates.

Figure 3 illustrates the mean squared error (MSE) of the point estimates. We observe that the MSE of naive estimators gets smaller when AUC gets smaller. Because naive estimators overestimate the AUC () under the missingniss mechanism where diseased subjects are more likely to verify their disease status, the bias of naive estimator is bounded by , which goes to 0 as gets larger. While on the contrary the MI estimators do not necessarily overestimate the AUC, the magnitude of bias can be larger than that of the naive estimator. Also the variance of MI estimators is expected to be larger than that of the naive estimator, under certain settings, the naive estimator outperforms the MI estimators in terms of MSE. However, under moderate levels of AUC () with moderate level of missingness (), MI, especially LR, reasonably performs well in terms of MSE.

In terms of CIL performance between PMM and LR, PMM tends to yield shorter intervals than LR does, as the population AUC () gets larger. For example, when and , the CIL for PMM for range from 0.153 to 0.175 whereas the range of CIL’s for LR is from 0.274 to 0.275. This difference is much less dramatic for smaller values of AUC. For instance, when and , the CIL of PMM ranges from 0.341 to 0.355 whereas with LR the CIL ranges from 0.336 to 0.340.

Finally, from Figure 1, we observe that LR gives CP most robust to different settings and closest to the nominal CP. That is, for every values of and being considered except , CP of LR stay close to the nominal CP. While CIL’s of LR are longer than that of PMM, under moderate settings, i.e. , its CIL’s are comparable to those of PMM. However, when (Table 4 in the appendix), NW method with PMM gives the best CP and CIL.

## 6 Real data example

To explore the pattern of different CI methods for the AUC when using MI on a real dataset, we examined a group of patients collected by 30 past and present national Alzheimer’s Disease Centers (ADCs) and managed by the National Alzheimer’s Coordinating Center (NACC) from September 2005 to May 2016.

This longitudinal dataset contains a total of 33,900 subjects with the demographic, clinical (the Uniform Data Set, UDS) and neuropathologic (the Neuropathology Data Set, NP) data collected on subjects, each subject having up to 11 visits. We considered only observations from the initial visits and one of the ADCs. Among these 1014 subjects, are not known to have AD and thus their disease status is considered missing, where we defined subjects as diseased if their Alzheimer’s disease neuropathologic change (ADNC) score is 1, 2 or 3 (low, intermediate and high ADNC each), and as non-diseased if their ADNC score is 0. Among those who know their AD status of the sample, of subjects have AD.

As a diagnostic test of AD, the Clinical Dementia Rating (CDR) can be used which measures patients’ cognitive status with respect to memory, orientation, judgement and problem-solving, community affairs, home and hobbies, and personal care. Each attribute takes on values from 0 to 3 in increments of 0.5 or 1.0, and their sum (CDRSUM), ranges from 0.0 to 18.0. Supplementary variables were also considered for MI of the AD status: age (AGE), sex (SEX), race according to National Institutes of Health (NIH) definitions (RACE), body mass index (BMI), systolic blood pressure (BPSYS), resting heart rate (HRATE), total number of medications reported (NUMMED), years of smoking (SMOKYRS), family history (FAMHIST, 1 if one of the first-degree family has cognitive impairment, 0 otherwise), years of education (EDUCYRS), the total score of Mini-Mental State Examination (MMSE), the total score of Geriatric Depression Score (GDS), and Unified Parkinson’s Disease Rating Scale (PDNORMAL). Some of the supplementary variables also have missing values. The descriptive statistics of the variables are summarized in Table

3.

The naive estimate (

) of the AUC for the subset is 0.5893, when ignoring the subjects whose AD status is not known. To correct for the verification bias of the estimate, imputations using MICE (PMM and LR) and NORM were performed m = 10 times each. All the supplementary variables mentioned above were considered in the imputation models. RACE was forced into a binary variable (White versus Other) as multi-categorical variables are not applicable in NORM. Each dataset was analyzed using five variance estimators (Bamber’s method (Bm), Hanley-McNeil’s method I (HM1), Hanley-McNeil’s method II (HM2), Newcombe’s Wald method (NW), and DeLong’s method (DL)) and by applying Rubin’s combining rule, the CI’s were constructed.

The point estimates of the AUC after MI are 0.5473, 0.5926, and 0.5247 for PMM, LR, and NORM respectively. The interval estimates are presented in Figure 4.

Based on the assumption that the probability of AD verification is not dependent on the value of AD status, the point estimate by each MI technique has values lower than or similar to the naive estimate. We can also notice that the CI’s, and more specifically their variance estimates, seem to be affected more by the imputation techniques than by the CI methods.

## 7 Discussion

In this paper, we considered several different methods for constructing Wald-type confidence intervals for the area under the ROC curve in the presence of missingness on disease status indicators with the missingness mechanism assumed to be ignorable. Several different methods of multiple imputation were considered for handling the missing data, including MICE (PMM and LR) and NORM, as a way to deal with the verification bias problem that arises in some biomedical research where the true disease status is often missing. We demonstrated through a simulation study that Wald-type CI’s, especially NW method, work reasonably well when the true AUC is moderate (), that using multiple imputation to handle the missing data greatly outperform a naive complete case analysis confidence interval. Based on the results we recommend using MI with LR and the choice of CI method is less important. However, when missingness rate is less severe (), we recommend using NW and MI with PMM.

Such findings are based on some simulation assumptions which may be different from reality to a great degree. However, it is not difficult to analyze data by modifying formulas and to search optimal combination under more realistic settings. For example, we assumed the missing mechanism to be ignorable and the biomarker values to be nearly homoscedastic. As a solution to the non-ignorable missingness, Harel and Zhou (2006) mentioned that appropriate missingness model can be set up and can be applied in the imputation step, keeping the analysis step and combination step unchanged. Further, Demirtas and Schafer (2003) and Demirtas (2005), discuss pattern mixture models for imputation when the missingness mechanism is determined to be non-ignorable. Secondly, the simulation parameters can be set up more comprehensively to find the best method. While our simulation study assumed almost homoscedasticity, there can be biomarkers with significantly different dispersion between groups. In such situation, one can easily design a new simulation study that differentiates the variance of the biomarker.

## Acknowledgements

The NACC database is funded by NIA/NIH Grant U01 AG016976. NACC data are contributed by the NIAfunded ADCs: P30 AG019610 (PI Eric Reiman, MD), P30 AG013846 (PI Neil Kowall, MD), P50 AG008702 (PI Scott Small, MD), P50 AG025688 (PI Allan Levey, MD, PhD), P50 AG047266 (PI Todd Golde, MD, PhD), P30 AG010133 (PI Andrew Saykin, PsyD), P50 AG005146 (PI Marilyn Albert, PhD), P50 AG005134 (PI Bradley Hyman, MD, PhD), P50 AG016574 (PI Ronald Petersen, MD, PhD), P50 AG005138 (PI Mary Sano, PhD), P30 AG008051 (PI Steven Ferris, PhD), P30 AG013854 (PI M. Marsel Mesulam, MD), P30 AG008017 (PI Jeffrey Kaye, MD), P30 AG010161 (PI David Bennett, MD), P50 AG047366 (PI Victor Henderson, MD, MS), P30 AG010129 (PI Charles DeCarli, MD), P50 AG016573 (PI Frank LaFerla, PhD), P50 AG016570 (PI Marie-Francoise Chesselet, MD, PhD), P50 AG005131 (PI Douglas Galasko, MD), P50 AG023501 (PI Bruce Miller, MD), P30 AG035982 (PI Russell Swerdlow, MD) , P30 AG028383 (PI Linda Van Eldik, PhD), P30 AG010124 (PI John Trojanowski, MD, PhD), P50 AG005133 (PI Oscar Lopez, MD), P50 AG005142 (PI Helena Chui, MD), P30 AG012300 (PI Roger Rosenberg, MD), P50 AG005136 (PI Thomas Montine, MD, PhD), P50 AG033514 (PI Sanjay Asthana, MD, FRCP), P50 AG005681 (PI John Morris, MD), and P50 AG047270 (PI Stephen Strittmatter, MD, PhD).

## Appendix. Proofs

We show that modifications to the Hanley-McNeil’s Wald method yields unbiased estimates. First we show how the Hanley-McNeil’s variance formula can be derived. Then we prove that the variance estimator is unbiased.

## Appendix A Derivation of Hanley-McNeil’s variance formula

Recall that , , and are defined as follows:

 θ=P(Y>X)+12P(Y=X),Q1=P(Y1,Y2>X)+12P(Y1>Y2=X)+12P(Y2>Y1=X)+14P(Y1=Y2=X),Q2=P(Y>X1,X2)+12P(Y=X1>X2)+12P(Y=X2>X1)+14P(Y=X1=X2).

We also denote that

 Hi,j=H(Yi,Xj)=⎧⎪⎨⎪⎩1,if Yi>Xj.12,if Yi=Xj.0,otherwise.

- Unbiased estimator of the AUC

Let the AUC estimator be defined as:

 ^θ=1nXnY∑i,jHi,j. (1)

Then, . Thus is an unbiased estimator of .

- Variance of the AUC estimator

From the equation (1), the variance can be derived as:

 V(^θ)=1(nXnY)2[∑i,jV(Hi,j)+∑i≠k∑jCov(Hi,j,Hk,j)+∑i∑j≠hCov(Hi,j,Hi,h)+∑i≠k∑j≠hCov(Hi,j,Hk,h)],

where

 V(Hi,j)=E(H2i,j)−E(Hi,j)2={P(Y>X)+122P(Y=X)}−θ2=θ−14P(Y=X)−θ2,Cov(Hi,j,Hk,j)=E(Hi,jHk,j)−E(Hi,j)E(Hk,j)={12⋅P(Y1,Y2>X)+1⋅12⋅P(Y2>Y1=X)+12⋅1⋅P(Y1>Y2=X)+122⋅P(Y1=Y2=X)}−θ2=Q1−θ2,Cov(Hi,j,Hi,h)=E(Hi,jHi,h)−E(Hi,j)E(Hi,h)=Q2−θ2, andCov(Hi,j,Hk,h)=E(Hi,jHk,h)−E(Hi,j)E(Hk,h)=θ2−θ2=0.

Then we have:

 V(^θ)=1nXnY[θ(1−θ)−14P(Y=X)+(nY−1)(Q1−θ2)+(nX−1)(Q2−θ2)] (2)

When the biomarkers are measured continuously enough so that there is no tie, the variance formula (2), reduces to what Hanley and McNeil suggested (3):

 V(^θ)=1nXnY[θ(1−θ)+(nY−1)(Q1−θ2)+(nX−1)(Q2−θ2)] (3)

, where .

## Appendix B Unbiasedness of the modified Hanley-McNeil’s variance estimator

Let the variance estimator be:

 ˆV(^θ)=1(nX−1)(nY−1)[^θ(1−^θ)−14p(Y=X)+(nY−1)(^Q1−^θ2)+(nX−1)(^Q2−^θ2)], (4)

where

 ^Q1=1nXn2YnX∑j=1[nY∑i=1Hi,j]2^Q2=1n2XnYnY∑i=1[nX∑j=1Hi,j]2, andp(Y=X)=1nXnYnX∑j=1nY∑i=1I(yi=xj)

- Unbiased estimator of

, , and p(Y=X) are an unbiased estimator of , , and P(Y=X), respectively.

- Proof

Let be partitioned as:

 Aj={(i,k)|Yi,Yk>Xj},Bj={(i,k)|Yi>Yk=Xj},Cj={(i,k)|Yk>Yi=Xj},Dj={(i,k)|Yi=Yk=Xj}, andEj={(i,k)|Yi

Then

can be shown to be unbiased in a similar way. Also .

- Unbiased estimator of

defined in (4) is an unbiased estimator of .

- Proof

 E[ˆV(^θ)]=1(nX−1)(nY−1)E[^θ(1−^θ)−14p(Y=X)  +(nY−1)(^Q1−^θ2)+(nX−1)(^Q2−^θ2)]=1(nX−1)(nY−1)[E[^θ(1−^θ)]−E[14p(Y=X)]  +(nY−1)E[^Q1−^θ2]+(nX−1)E[^Q2−^θ2]]=1(nX−1)(nY−1)[θ−θ2−14P(Y=X)  +(nY−1)(Q1−θ2)+(nX−1)(Q1−θ2)  −(nX+nY−1)V(^θ)]=nXnYV(^θ)−(nX+nY−1)V(^θ)(nX−1)(nY−1)=(nX−1)(nY−1)V(^θ)(nX−1)(nY−1)=V(^θ)