Developing Biomarker Combinations in Multicenter Studies via Direct Maximization and Penalization

10/04/2019
by   Allison Meisner, et al.
Johns Hopkins University
0

Motivated by a study of acute kidney injury, we consider the setting of biomarker studies involving patients at multiple centers where the goal is to develop a biomarker combination for diagnosis, prognosis, or screening. As biomarker studies become larger, this type of data structure will be encountered more frequently. In the presence of multiple centers, one way to assess the predictive capacity of a given combination is to consider the center-adjusted AUC (aAUC), a summary of the ability of the combination to discriminate between cases and controls in each center. Rather than using a general method, such as logistic regression, to construct the biomarker combination, we propose directly maximizing the aAUC. Furthermore, it may be desirable to have a biomarker combination with similar performance across centers. To that end, we allow for penalization of the variability in the center-specific AUCs. We demonstrate desirable asymptotic properties of the resulting combinations. Simulations provide small-sample evidence that maximizing the aAUC can lead to combinations with improved performance. We also use simulated data to illustrate the utility of constructing combinations by maximizing the aAUC while penalizing variability. Finally, we apply these methods to data from the study of acute kidney injury.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/04/2019

Combining Biomarkers by Maximizing the True Positive Rate for a Fixed False Positive Rate

Biomarkers abound in many areas of clinical research, and often investig...
10/19/2020

Intriguing Invariants of Centers of Ellipse-Inscribed Triangles

We describe invariants of centers of ellipse-inscribed triangle families...
06/06/2022

The Impact of Sampling Variability on Estimated Combinations of Distributional Forecasts

We investigate the performance and sampling variability of estimated for...
10/23/2020

Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic Analysis

We consider quantizing an Ld-dimensional sample, which is obtained by co...
12/21/2020

Data Combination for Problem-solving: A Case of an Open Data Exchange Platform

In recent years, rather than enclosing data within a single organization...
06/03/2022

A Quantitative Simulation-based Modeling Approach for College Counseling Centers

College counseling centers in various universities have been tasked with...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multicenter studies have long been used in therapeutic settings as a way to increase power and improve generalizability and are increasingly common in biomarker studies (e.g., 1, 2, 3

). In addition, it is now feasible to measure many biomarkers on each participant. As the performance of individual biomarkers is often modest, there is interest in developing combinations of biomarkers for prognosis, diagnosis, and screening. When studies of multiple biomarkers also involve multiple centers, the central question becomes how such biomarker combinations should be constructed. In a given multicenter study, only a sample of centers are observed. Without making strong distributional assumptions, it is not possible to provide predicted probabilities for individuals in a new center. Acknowledging this inherent limitation of the study design, we aim to identify promising biomarker combinations for further development.

The Translational Research Investigating Biomarker Endpoints in Acute Kidney Injury (TRIBE-AKI) study is an example of a multicenter biomarker study. The TRIBE-AKI study collected data from 1219 cardiac surgery patients at six centers in North America 4. Study participants were followed for diagnosis of acute kidney injury (AKI) during hospitalization; in other words, the participants were hospitalized in order to undergo cardiac surgery and were followed postoperatively for signs of AKI. For each patient, blood and urine were collected at multiple time points pre- and postoperatively, and several biomarkers were measured at each time point. AKI is typically diagnosed via changes in serum creatinine, but these changes often do not happen until several days after the injury occurs. The goal of the study was to identify a combination of biomarkers that can provide an earlier diagnosis of AKI.

Methods to construct biomarker combinations by maximizing the area under the receiver operating characteristic (ROC) curve (AUC) have been proposed. However, in the multicenter setting, there is interest in conditional performance. One measure of conditional performance is the center-adjusted AUC (aAUC). We propose a method to construct linear biomarker combinations by targeting the aAUC. We then extend our method to allow for penalization of the variability in center-specific performance, yielding combinations with good overall performance and more similar performance across centers. We provide theoretical and empirical justification for the proposed methods and apply these methods to data from the TRIBE-AKI study.

2 Background

Let be a binary outcome, where cases are denoted by or the subscript and controls are denoted by or the subscript .

2.1 Center-adjusted AUC

Without loss of generality, suppose that for a given predictor , higher values of are more indicative of . Thus, for a particular threshold , the true and false positive rates are and , respectively. The ROC curve for plots the true positive rate versus the false positive rate over the range of possible thresholds  5. The area under the ROC curve (AUC) is a measure of the ability of to discriminate between cases and controls, one aspect of the predictive capacity of . The ROC curve for a useless predictor lies on the 45-degree line, and the corresponding AUC is 0.5 5. The ROC curve for a perfect predictor reaches the upper left-hand corner of the unit square, and its AUC is 1 5. The AUC can also be interpreted as the probability that for a randomly chosen case is larger than for a randomly chosen control 5.

In the multicenter setting, can be evaluated marginally, by considering the AUC for pooled across centers, or conditionally, by summarizing center-specific AUCs. If a measure of marginal performance (i.e., the AUC for pooled across centers) is used, center is allowed to potentially influence the assessment of the discriminatory ability of , restricting interpretability and generalizability 6

. Instead, performance should be assessed within each center and then summarized across centers; this is analogous to the center-adjusted odds ratio in the etiologic setting and the center-adjusted treatment effect in the therapeutic setting 

7, 6. One such summary measure is the center-adjusted AUC (aAUC).

The center-adjusted ROC () and corresponding center-adjusted AUC () of , proposed by 8, can be written as

where indicates center, denotes the false positive rate, and denote the center-specific ROC and AUC, respectively, in center , and is the proportion of cases in center . The aAUC is a weighted average of the center-specific AUCs 9. Thus, it is a summary of the ability of to discriminate between cases and controls in each center 6

. When a sample of centers is used to estimate the aAUC of

, the estimate provides insight into the performance of in new centers, to the extent that the new centers are similar to those used to estimate the aAUC. This is analogous to applying an estimate of a center-adjusted treatment effect from a multicenter randomized trial to a new center.

2.2 Biomarker combinations

For a collection of biomarkers X, the combination (and monotone increasing functions thereof) is optimal in terms of maximizing the true positive rate at each false positive rate 10. Thus, to the extent that the linear logistic model holds, that is,

for some coefficient vector

, the combination is optimal. As the linear logistic model may not hold, methods have been developed to construct biomarker combinations by maximizing the AUC without relying on this model 11.

Methods have also been developed to identify combinations of biomarkers that maximize the AUC while accommodating covariates 12, 13. However, implementation of the method proposed by 12 is computationally challenging or prohibitive for more than two biomarkers. The method proposed by 13

assumes that the biomarkers have multivariate normal distributions and requires specification of the relationship between the covariates (e.g., center) and the biomarkers. Consequently, these methods may not be broadly applicable.

If the same data are used to construct a biomarker combination and evaluate its performance (by estimating the aAUC, for example), the resulting estimate of performance is optimistically biased 14. This bias, which we refer to as “resubstitution bias” 15, can be addressed by using a bootstrapping procedure to estimate the bias and correct the apparent estimate of performance 14, 16. Bootstrapping assumes the observations are exchangeable, but in a multicenter study, observations from the same center may be more similar than observations from different centers, leading to clustering of the data by center; thus, bootstrap resampling by center has been suggested 17, 18, 19, 9. However, similar results for the center-specific AUC have been found whether resampling was done on centers or individual observations 17.

2.3 Smooth AUC approximations

When logistic regression is used to construct a combination, the fitted combination is obtained by maximizing the logistic likelihood. However, we are interested in using combinations for diagnosis, prognosis, or screening. This motivates maximizing measures of performance, e.g., the AUC, to construct a combination. Such direct maximization is compelling as it matches the objective function to the intended use of the combination 20, 11. One benefit of directly maximizing the AUC is that the resulting combination is optimal, in terms of the AUC, within the class of combinations considered 11. Furthermore, the AUC of a linear combination constructed by targeting the AUC will be at least as large as the AUC for the individual biomarkers 20. This simple, desirable property might not hold when a combination is constructed by maximizing a likelihood.

In practice, for a set of biomarkers X, the true AUC for the linear combination is unknown. Instead, we can consider the empirical AUC,

where and are the number of cases and controls, respectively, is the indicator function, denotes the biomarker vector for the case, and denotes the biomarker vector for the control. Since involves indicator functions, direct maximization is challenging. However, smooth approximations to the empirical AUC have been proposed. 21 used a smooth approximation based on the the probit function:

where is the standard normal distribution function and is a tuning parameter. The function serves as an approximation to the indicator function and represents the trade-off between approximation accuracy and estimation feasibility and tends to zero as the sample size grows 21. Note that is not convex. Other approximations have been proposed, including the logistic function 22 and the convex ramp function 23. The probit function approximation tends to be more accurate and stable than the logistic function approximation 21 and is more straightforward to implement than the ramp function approximation.

3 Methods

We consider a population of centers where center has observations, . We observe data from centers with observations from center , giving total observations. We consider a -dimensional vector of biomarkers X. Let be the biomarkers and outcome for an arbitrary observation. We use the subscript on X and to denote the biomarkers and outcome, respectively, for the observation. We use the superscript on X and to denote the biomarkers and outcome, respectively, for an observation from center and we use to indicate the number of cases in center . We assume the center-specific disease prevalence is non-trivial, i.e., , . We consider linear biomarker combinations, , as they are often a reasonable choice and have intuitive appeal for clinical collaborators.

3.1 Direct maximization

We are interested in the aAUC for a linear combination of biomarkers defined by , where and . As with the unadjusted AUC, in practice the aAUC is unknown and we instead consider the empirical aAUC. The empirical aAUC, , is based on empirical estimates of the weights, , and the center-specific AUCs, Again, is a function of , which involves indicator functions, making direct maximization challenging. However, we can use a smooth approximation to , which in turn provides a smooth approximation to . In particular, we propose the SaAUC estimate

(1)

where ,

In this approximation, is a tuning parameter that tends to zero as grows. In order to retain identifiability, constraints must be imposed on ; in particular, for any given scalar , . Thus, we constrain as in 23.

The objective function (1) is a sum of smooth functions and is therefore also smooth. Consequently, gradient-based methods (with Lagrange multipliers to incorporate the constraint) can be used to obtain , though this procedure may not yield a global maximum since is not convex. Gradient-based methods require starting values ; one possibility is to use estimates from logistic regression. Additionally, in the above formulation, each center has its own tuning parameter . We use , where

is the sample standard error of

. This mimics the approach used by others in similar work (e.g., 21) and seems to work well. Since these parameters are estimated based on the starting value , the presence of

tuning parameters does not present a large computational burden. Bootstrapping could be used to obtain confidence intervals for the estimated aAUC of the fitted combination

.

3.2 Penalization

In practice, it is unlikely that a given combination will have the same AUC in each center. This could be due to, for example, differences in the populations of patients at different centers. In this situation, it may be desirable to construct a biomarker combination that has relatively similar performance across centers. In particular, it may be worth sacrificing a small amount of the overall performance (in terms of the aAUC) for less variability in the center-specific AUCs. For instance, one combination may have an aAUC of 0.7 with center-specific AUCs ranging between 0.55 and 0.85; that is, in some centers the performance is very good, while in others it is quite poor. Another combination may have an aAUC of 0.68 with center-specific AUCs ranging between 0.64 and 0.72. This latter combination has slightly worse overall performance, but its discriminatory ability across centers is much less variable. In the typical multicenter setting, where only a fraction of the centers are sampled, this reduced variability may be desirable as it could lead to a fitted combination that has an AUC in new centers within a narrower range.

To accomplish this, we propose where is a fixed penalty parameter, . Since is the difference of two smooth functions, it can be maximized using gradient-based methods. The goal of this penalized method is to construct a combination such that its performance in a new center is similar to what has been observed in previous centers. The notion of “similar” depends upon the degree of underlying variability across the population of centers as well as the centers that have been sampled and are used to obtain .

3.3 Asymptotic results

In the theorem below, we demonstrate good operating characteristics for the combination in large samples. By setting , we obtain asymptotic results for the maximization of without penalization. Define and We first present several conditions necessary for the theorem:

  1. The centers are randomly sampled from the population of centers and observations are randomly sampled from center , .

  2. as , , and such that .

  3. The centers are independent and within each center, the observations , , are independent and identically distributed -dimensional random vectors such that there exists at least one component of , for some , with distribution that has everywhere positive Lebesgue density, conditional on the other components.

  4. The support of , , is not contained in any proper linear subspace of .

  5. For fixed , both the maximum of and the maximum of over are attained.

Theorem 1.

Fix and suppose conditions (A1)–(A5) hold. Then
as , , and such that .

The proof of the theorem is given in Appendix A. We have previously demonstrated that, under certain conditions, converges uniformly in probability to and converges uniformly in probability to , and we use these results in the proof of the theorem 24. Briefly, the proof first demonstrates uniform convergence in probability of the difference between and the empirical analogue of (that is, with in place of ) to zero. The proof then uses previous results for to demonstrate uniform convergence in probability of the difference between and the empirical analogue of to zero. Combining these results gives the desired conclusion.

This theorem does not consider the convergence of to a particular quantity; instead, the theorem relates to the operating characteristics of the combination . By focusing on the performance of the fitted combination, rather than the combination itself, we remove the need to make additional (restrictive) assumptions. More importantly, since our goal is to develop a combination with high discriminatory ability, demonstrating good operating characteristics of the fitted combination is paramount.

3.4 Choosing the penalization parameter

In other penalized estimation procedures, such as ridge regression or the lasso, the penalty parameter

is typically chosen via cross-validation, where the value of that gives the best cross-validated performance is selected. The motivation for cross-validation is that apparent measures of performance for a given model (that is, estimates of performance based on the data used to fit the model) tend to be optimistic 25. Cross-validation is one way of addressing this problem.

For our penalized estimation method, we can extend the ideas behind cross-validation to the multicenter setting. As just described, the goal of cross-validation is typically to get an idea of the performance in new observations. In the case of data from multiple centers, we would like to get an idea of the performance in new centers. To that end, we propose the following procedure, which we call “leave one center out cross-validation” (LOCOCV):

  1. Choose a sequence of values:

  2. For each value of and :

    1. Fit the biomarker combination using the data from all but the center.

    2. Estimate the AUC of the fitted combination from (a) in the center.

  3. Plot the center-specific AUCs from (2b), the corresponding aAUC, and the variability in the center-specific AUCs around the aAUC in (i) the cross-validation “training” centers and (ii) the cross-validation “test” centers as a function of .

  4. Choose an appropriate value of , and use this value to fit the biomarker combination using the data from all centers.

In some situations, it may be acceptable to sacrifice a small amount of overall performance in return for a substantial decrease in the variability of the center-specific AUCs. In other situations, any decline in overall performance may be undesirable. Consequently, we recommend using the cross-validation plot described above to choose , rather than proposing an automated procedure, as the relative costs and benefits of using a larger or smaller value of depend on the specific context. In other words, individual users must determine the appropriate degree of penalization.

4 Results

4.1 Direct maximization without penalization ()

We used simulations to investigate the performance of the proposed direct maximization method. Most of these simulations were based on the set-up used by 23.

In each simulation, we generated a population of centers and individuals, and obtained training data by sampling from this population. In particular, we first sampled centers from the population of centers. Then, within each of the sampled centers, we sampled observations from the observations available in each center (where and did not vary across centers). These observations formed the training data, in which the combinations were constructed. The fitted combinations were then evaluated in independent test data, which consisted of the observations in each of the centers not used in the training data. We considered the following settings: (i) , (ii) , and (iii) .

23

noted that the presence of outliers may lead to diminished performance of logistic regression, while methods based on maximizing the AUC may be less affected since the AUC is a rank-based measure. Thus, we considered simulations with and without outliers in the data-generating model. We focused on the setting of two biomarkers (

) and considered and where and

were independent bivariate normal random variables with mean zero and respective covariance matrices

, , and was an independent Bernoulli random variable with success probability , where when outliers were simulated and otherwise. Other simulations with more complex center effects were considered, including those where the distribution of the biomarkers and/or the biomarker combination varied by center; the results were largely similar to those from the scenario described above.

Estimates from robust logistic regression were used as starting values for the proposed SaAUC method, and the SaAUC method was compared to robust logistic regression and standard unconditional logistic regression, both with fixed center-specific intercepts. We used the robust logistic regression method proposed by 26. This method uses a deviance function that limits the influence of individual observations on the model fit, making it more robust to outliers than standard logistic regression. In addition, when or , we used conditional logistic regression both to provide starting values for and to compare with the SaAUC method. All methods fit a linear combination. The simulations were repeated 1000 times.

The results are presented in Table 1. Clearly, when outliers are present, the proposed method outperforms standard and robust logistic regression, both in terms of the center-adjusted AUC and the center-specific AUCs. There also appears to be less variability in performance across simulations when the SaAUC approach is used to construct combinations. As was observed by 23 for the AUC, when outliers were not present, the three methods produced combinations with comparable performance. In general, we found the results were very similar when conditional logistic regression was used to provide starting values for and as a comparison with the SaAUC method for and (Table 1 of Appendix B).

Outliers
Min Max Min Max Min Max
Yes 0.6244 0.6065 0.6424 0.6492 0.6315 0.6666 0.6856 0.6684 0.7025
(0.012) (0.013) (0.013) (0.030) (0.031) (0.030) (0.007) (0.008) (0.007)
No 0.7032 0.6866 0.7197 0.7032 0.6866 0.7196 0.7030 0.6864 0.7195
(0.002) (0.004) (0.004) (0.002) (0.004) (0.004) (0.002) (0.004) (0.004)
Yes 0.6233 0.5444 0.6992 0.6473 0.5692 0.7215 0.6843 0.6082 0.7564
(0.008) (0.014) (0.012) (0.027) (0.030) (0.026) (0.004) (0.011) (0.009)
No 0.7036 0.6301 0.7731 0.7036 0.6301 0.7731 0.7035 0.6299 0.7730
(0.001) (0.009) (0.008) (0.001) (0.009) (0.008) (0.001) (0.010) (0.008)
Yes 0.6221 0.4683 0.7659 0.6333 0.4798 0.7756 0.6796 0.5287 0.8154
(0.004) (0.015) (0.013) (0.013) (0.020) (0.017) (0.004) (0.015) (0.012)
No 0.7038 0.5574 0.8330 0.7038 0.5574 0.8330 0.7037 0.5573 0.8329
(0.001) (0.014) (0.010) (0.001) (0.014) (0.010) (0.001) (0.014) (0.010)
Table 1:

Mean (standard deviation) of the aAUC in test data and mean (standard deviation) of the minimum and maximum center-specific AUCs (

) across the centers in the test data based on combinations fitted by logistic regression (GLM), robust logistic regression (rGLM), and the proposed method without penalization (; SaAUC). Robust logistic regression estimates were used as the starting values for the SaAUC method.

The proposed SaAUC method had excellent convergence rates (fewer than of simulations failed). Robust logistic regression failed to converge in up to of simulations for and up to for ; when this happened, standard unconditional logistic regression was used to obtain starting values. In addition, when simulating data with outliers, in some instances the true biomarker combination was so large that it returned a non-value for the outcome (in R, this occurs for when ). These observations were removed from the simulated dataset, though this affected fewer than of observations. Finally, for , some of the training data centers were concordant (that is, all cases or all controls) and were removed from the analysis. Up to 11% of simulations had one or two concordant centers in the training data.

Beyond the setting of outliers in the data, we considered a scenario where the true biomarker combination involved ten biomarkers, but only two were available for fitting. As above, we generated a population of centers and individuals and obtained training data by sampling from this population, using the remaining centers as test data in which the fitted combinations were evaluated. In particular, we used and . Here, and where is the identity matrix and for , for , and for . Center-specific intercepts were drawn from a distribution and was generated as a Bernoulli random variable with success probability , where . Although the true combination is a function of all ten biomarkers, only and were available for data analysis. We used estimates from robust logistic regression as starting values for the proposed SaAUC method, and compared the SaAUC method to robust logistic regression and standard unconditional logistic regression. All methods fit a linear combination and the simulations were repeated 1000 times. The mean (standard deviation) aAUC in the test data for logistic regression, robust logistic regression, and the proposed SaAUC method were 0.6330 (0.018), 0.6284 (0.020), and 0.6453 (0.016), respectively. Thus, the proposed method offers a small gain in the aAUC over the logistic regression approaches.

4.2 Direct maximization with penalization ()

We explored our proposed penalized estimation procedure via simulated datasets. In particular, we used individual datasets generated under a variety of models to explore how the method may perform in practice. As was done in the earlier simulations, we first generated a population of centers and individuals and obtained training data by sampling from this population. Specifically, we considered a population of centers with observations in each and sampled observations from each of centers to form the training data, with the observations in the remaining centers serving as test data.

We considered nearly 400 individual datasets; different data-generating mechanisms were used and included variations on the link function relating to

, the distribution of the biomarkers across centers, and the degree of heterogeneity in the true biomarker combination across centers. We simulated four independent normally distributed biomarkers with equal variance and throughout, the true biomarker combination in each center was linear. The details of the data-generating models for the examples presented here are given in the captions of the corresponding figures. Estimates from robust logistic regression were used as starting values for the penalized estimation procedure. For each simulation, we applied the LOCOCV procedure described in Section 

3.4. We considered 50 values of equally-spaced (on the log scale) between 0.1 and 200. This range of values is somewhat arbitrary. In other penalized estimation procedures, it is common to choose the maximum value of to be the value that returns coefficient estimates of 0. The analogous requirement in the current setting would be the value of that gives center-specific AUCs of 0.5 in all centers. This is only expected to occur when all of the biomarker coefficients are 0, which cannot happen due to the constraint in the penalized estimation procedure. The key point is that the range of values used here is meant to be illustrative, not prescriptive.

We present a handful of examples here, and include several more in Figures 1–16 of the Appendix C. All of the plots we present have the same layout: the left plot gives the training data results, the middle plot gives the results of the LOCOCV procedure, and the right plot gives the test data results. In each plot, the horizontal axis shows . Throughout, the solid lines represent the results of the penalized estimation procedure, the dashed lines represent the standard logistic regression results, and the dot-dashed lines represent the robust logistic regression results. In each plot, the left vertical axis displays the AUC, and corresponds to the gray lines (center-specific AUCs for the penalized estimation procedure) and the black lines (center-adjusted AUCs for the penalized estimation procedure, robust logistic regression (“rGLM”), and standard logistic regression (“GLM”)). The right vertical axis shows the variability in the center-specific performance on the standard deviation scale and corresponds to the red lines (variability relative to the adjusted AUC in the training centers) and blue lines (variability relative to the adjusted AUC in the test centers).

In the test data, the centers are so large that the AUCs calculated in these centers are presumed to be equal to the population values. In the training data and cross-validation procedure, on the other hand, the AUCs are empirical estimates. Thus, in the test data, for a combination fitted in the training data, the variability relative to the training centers is and the variability relative to the test centers is where the ’s are the weights for the centers in the test data, is the adjusted AUC for the centers in the training data, and and are the adjusted AUC and the center-specific AUC, respectively, for the centers in the test data.

Figures 1 and 2 present examples where the LOCOCV procedure does a particularly nice job of mimicking the patterns in the test data. We encountered some datasets where the penalized estimation procedure did not work as well. For instance, in a small number of datasets, the variability increased with increasing in the test data, despite the patterns seen in the training data and the LOCOCV results; Figure 3 presents one such example. In this situation, a value of may be chosen that produces a fitted combination with worse overall performance and more variability in center-specific performance than would be obtained without penalization. However, in this example, the drop in overall performance is not large and the variability is fairly small even when is large. Our simulations indicate that when the centers in the training data are not representative of the population of centers, the results from the cross-validation procedure may not reflect the patterns in the test data; such discrepancies would be expected in general when a resampling procedure is applied to a non-representative sample.

Figure 1: Penalized estimation example 1. The four biomarkers were independently normally distributed with mean 0 and center-specific variances, i.e., . The relationships between the biomarkers and were allowed to vary by center and these variations (defined by ) differed across the four biomarkers. The outcome was generated as a Bernoulli random variable with success probability ), where is a center-specific intercept and for and otherwise. In this example, the penalized estimation procedure produces combinations with reduced variability across centers with minimal loss in overall performance (for modest values). Importantly, the LOCOCV results mimic what is seen in the test data.
Figure 2: Penalized estimation example 2. The four biomarkers were independently normally distributed with mean 0 and center-specific variances, i.e., . The relationships between the biomarkers and were allowed to vary by center and these variations (defined by ) differed across the four biomarkers. The outcome was generated as a Bernoulli random variable with success probability ), where is a center-specific intercept and for and otherwise. This is an example where the LOCOCV results closely mimic the patterns seen in the test data, indicating the importance of performing cross-validation.
Figure 3: Penalized estimation example 3. The four biomarkers were independently normally distributed with mean 0 and center-specific variances, i.e., . The outcome was generated as a Bernoulli random variable with success probability ), where is a center-specific intercept and . This is an example where the penalization procedure does not work as well, since the variability in performance across centers increases with increasing , despite the patterns seen in the training data and the LOCOCV.

Problems with convergence were encountered in fewer than 6% of the examples we considered. Such issues generally arose with the more extreme scenarios we considered and primarily occurred during cross-validation. In practice, this may require modification of the range of values considered. None of the results presented here had any convergence failures.

4.3 TRIBE-AKI study data

To illustrate the methods we have developed, we used data from the TRIBE-AKI study and constructed combinations of three biomarkers measured no more than six hours after surgery: urine neutrophil gelatinase-associated lipocalin (NGAL), plasma heart-type fatty acid binding protein (h-FABP), and plasma troponin I (TNI). These data are used as illustration and not to report new findings of the TRIBE-AKI study. As described above, the TRIBE-AKI study involved individuals at six medical centers who were undergoing cardiac surgery and did not have evidence of AKI prior to surgery. AKI is a negative side effect of cardiac surgery, often caused by interruptions in blood flow to the kidneys during the surgery. However, AKI is currently diagnosed by assessing changes in serum creatinine, which frequently do not occur until several days after surgery. The aim of the TRIBE-AKI study was to use biomarkers measured soon after surgery to provide an earlier diagnosis of AKI.

We removed observations with missing values for any of the three biomarkers (leaving 962 observations), log-transformed the biomarker values, and scaled the biomarkers to have equal variance to improve convergence of the gradient-based algorithm used by our methods. The prevalence of AKI in each center was between 7.8% and 22.9%, and the sizes of the centers ranged from 53 to 483 patients.

We applied standard logistic regression (“GLM”), robust logistic regression (“rGLM”), and the proposed SaAUC method to the TRIBE-AKI study data. The fitted combinations (with normalized coefficients) for GLM, rGLM, and SaAUC were

respectively. The apparent estimated center-adjusted AUCs for these combinations were 0.6878, 0.6878 and 0.6918, respectively. After correcting for resubstitution bias, the center-adjusted AUC estimates were 0.6819, 0.6820 and 0.6825. Thus, in these data, there was little difference in the performance of the combinations (though there were clear differences in the fitted combinations themselves). Furthermore, there was more optimistic bias in the apparent aAUC estimate for the combination fitted by the SaAUC method, which might be expected in general since the SaAUC method optimizes a smooth approximation to the estimated aAUC.

Finally, we applied the proposed penalized estimation method to the TRIBE-AKI study data (Figure 4). The results from the LOCOCV procedure support choosing , which is expected to give a reduction in variability in center-specific performance of about 25–30%, with essentially no loss in overall (center-adjusted) performance. In particular, the LOCOCV results indicate that when (the smallest value considered by LOCOCV), the center-specific AUC estimates ranged from 0.6042 to 0.7250, but when , the center-specific AUC estimates were between 0.6270 and 0.6986. Using in the full TRIBE-AKI study dataset yielded the combination

Figure 4: Penalized estimation method applied to the TRIBE-AKI study data. The results from the LOCOCV procedure support choosing , which is expected to give a reduction in variability in center-specific performance of about 25–30%.

5 Discussion

We have proposed a method to construct biomarker combinations by maximizing a smooth approximation to the center-adjusted AUC. In addition, we have incorporated a penalty term that can be used to encourage similarity in performance across centers. Our method could be useful for other discrete nuisance covariates, such as batch, analysis laboratory, or measurement protocol. We used data on biomarkers measured after cardiac surgery to construct combinations for the diagnosis of acute kidney injury, demonstrating the feasibility of our methods. An R package including code to implement aAUC maximization without penalization, maxadjAUC, is publicly available via CRAN. In addition, R code to implement maximization with penalization is available on GitHub (https://github.com/allisonmeisner/maxadjAUCpen).

Multicenter studies include only a subset of centers, so predicted probabilities (risks) for individuals in new centers cannot be generated without relying on methods that require strong assumptions, e.g., random intercept logistic regression 24. The methods proposed here do not provide predicted probabilities for individuals in either new centers or in the observed centers. Consequently, it is not possible to directly evaluate the calibration of a combination fitted by our method. Additionally, while we have provided a theoretical result related to the large-sample performance of combinations fitted by our method, we cannot offer guarantees on the performance of a fitted combination in a new center. We view this as a fundamental limitation of the data structure, rather than the methods. Moreover, when biomarkers exhibit “center effects,” further development of the biomarkers would be required prior to clinical application; after such development, predicted probabilities would change. A strength of the proposed methodology is that it does not rely on assumptions about the nature of “center effects” in constructing biomarker combinations; thus, it allows for the identification of promising combinations for further development while avoiding such assumptions.

In addition, in multicenter studies, different sampling schemes could be used (e.g., case-control or stratified case-control sampling). The estimated weights would potentially be affected by different sampling procedures, and may not estimate . This would in turn affect the interpretation of the center-adjusted AUC, though it would still be a measure of the conditional performance. Our methods would then be optimizing this measure of conditional performance. The sampling scheme could also affect the validity of the asymptotic results we have provided. Furthermore, if a study involves matching, our methods would need to be modified to adjust the AUC for the matching in addition to center 6.

The adjusted AUC is a reasonable estimand even when the center-specific AUCs of a given combination are not the same across centers, though it is helpful to consider these center-specific AUCs, as they provide some insight into how the combination might be expected to perform in a new center. In addition, differences in performance across centers may be scientifically meaningful and merit further investigation. When assessing center-specific AUCs, it is important to also consider the sizes of the centers, as estimates of center-specific AUCs from small centers may be highly variable. One feature of our penalization approach is the use of the weights in the penalty function, which reflect the proportion of cases in each center and so will tend to give less weight to small centers. Furthermore, the optimal combination (in terms of the center-specific AUC) may be different for each center. Importantly, however, our aim is not to identify the optimal combination in every center; instead, we are interested in constructing a single combination that performs well across centers.

Since our smooth approximation function is not convex, further research is needed on the choice of starting values for the proposed method. It may also be possible to extend the method proposed by 23, which optimizes the convex ramp function approximation to the AUC, to the center-adjusted AUC. This may lead to further improvements in performance over logistic regression, as was seen in 23 for the unadjusted AUC. In addition, when the centers are very small, the empirical center-specific AUC will be unreliable. Research is needed into the use of other (possibly parametric) methods to estimate the center-specific AUC by borrowing information across centers, which may be necessary when the centers are small. Furthermore, it may be possible to apply the ideas in this paper to survival data using the method proposed by 27 for the covariate-adjusted AUC for censored data. Extending the proposed method to other center-adjusted measures of performance, such as the partial AUC or the true positive rate for a fixed false positive rate, is another avenue for future research.

Acknowledgments

This work was supported by the National Institutes of Health (F31 DK108356, R01 HL085757, and K24 DK090203). The opinions, results, and conclusions reported in this article are those of the authors and are independent of the funding sources.

The authors wish to acknowledge the TRIBE-AKI study investigators: Steven G. Coca (Department of Internal Medicine, Icahn School of Medicine at Mount Sinai, New York, New York), Amit X. Garg (Institute for Clinical Evaluative Sciences Western, London, Ontario, Canada; Division of Nephrology, Department of Medicine, and Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada), Jay Koyner (Section of Nephrology, Department of Medicine, University of Chicago Pritzker School of Medicine, Chicago, Illinois), and Michael Shlipak (Kidney Health Research Collaborative, San Francisco Veterans Affairs Medical Center, University of California, San Francisco, San Francisco, California). Urine NGAL, plasma h-FABP, and plasma cardiac troponin I assays used in the TRIBE-AKI study were donated by Abbott Diagnostics, Randox Laboratories, and Beckman Coulter, respectively.

Data Availability Statement

An R package containing code to implement aAUC maximization without penalization, maxadjAUC, is publicly available via CRAN. R code to implement maximization with penalization is available on GitHub (https://github.com/allisonmeisner/maxadjAUCpen).

Appendix A

Proof.

Proof of Theorem 1. First we will show Let

We can write

Under conditions (A1)–(A4), we have shown (Lemma 1 and Theorem 1 in Meisner et al. 24) that and
We can write

where as . Then by Theorem 1 in Meisner et al. 24,

where , , and ; note that , and . Then

We have (by Lemma 1 and Theorem 1 in Meisner et al. 24) and This gives

Since for every ,

Furthermore, we have previously shown (Theorem 1 in Meisner et al. 24) that
as , , and such that . Thus,

Now consider . We will first show Ma and Huang 22 demonstrated that as . We can write

since for every .

Now consider . We can write

where ,