Combining Biomarkers by Maximizing the True Positive Rate for a Fixed False Positive Rate

10/04/2019
by   Allison Meisner, et al.
Johns Hopkins University
0

Biomarkers abound in many areas of clinical research, and often investigators are interested in combining them for diagnosis, prognosis, or screening. In many applications, the true positive rate for a biomarker combination at a prespecified, clinically acceptable false positive rate is the most relevant measure of predictive capacity. We propose a distribution-free method for constructing biomarker combinations by maximizing the true positive rate while constraining the false positive rate. Theoretical results demonstrate desirable properties of biomarker combinations produced by the new method. In simulations, the biomarker combination provided by our method demonstrated improved operating characteristics in a variety of scenarios when compared with alternative methods for constructing biomarker combinations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/15/2018

tau-FPL: Tolerance-Constrained Learning in Linear Time

Learning a classifier with control on the false-positive rate plays a cr...
10/04/2019

Developing Biomarker Combinations in Multicenter Studies via Direct Maximization and Penalization

Motivated by a study of acute kidney injury, we consider the setting of ...
07/02/2020

Lack of evidence for a substantial rate of templated mutagenesis in B cell diversification

B cell receptor sequences diversify through mutations introduced by purp...
03/02/2020

Graphing Website Relationships for Risk Prediction: Identifying Derived Threats to Users Based on Known Indicators

The hypothesis for the study was that the relationship based on referrer...
10/20/2020

FishNet: A Unified Embedding for Salmon Recognition

Identifying individual salmon can be very beneficial for the aquaculture...
02/13/2018

The false positive risk: a proposal concerning what to do about p-values

It is widely acknowledged that the biomedical literature suffers from a ...
11/12/2018

The doctrinal paradox: ROC analysis in a probabilistic framework

The doctrinal paradox is analysed from a probabilistic point of view ass...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the number of available biomarkers has grown, so has the interest in combining them for the purposes of diagnosis, prognosis, or screening. Over the past decade, much work has been done to develop methods for constructing biomarker combinations by targeting measures of performance, including those related to the receiver operating characteristic, or ROC, curve. This is in contrast to more traditional methods that construct biomarker combinations by optimizing general global fit criteria, such as the maximum likelihood approach. While methods to construct both linear and nonlinear combinations have been proposed, linear biomarker combinations are more common than nonlinear combinations, due to their greater interpretability and ease of construction (Wang and Chang, 2011; Hsu and Hsueh, 2013).

Although the area under the ROC curve, the AUC, is arguably the most popular way to summarize the ROC curve, there is often interest in identifying a biomarker combination with a high true positive rate (TPR), the proportion of correctly classified diseased individuals, while setting the false positive rate (FPR), the proportion of incorrectly classified nondiseased individuals, at some clinically acceptable level. A common practice among applied researchers is to construct linear biomarker combinations using logistic regression, and then calculate the TPR for the prespecified FPR, e.g., 

Moore et al. (2008). While methods for constructing biomarker combinations by maximizing the AUC or the partial AUC have been developed, these methods do not directly target the TPR for a specified FPR.

We propose a distribution-free method for constructing linear biomarker combinations by maximizing the TPR while constraining the FPR. We demonstrate desirable theoretical properties of the resulting combination, and provide empirical evidence of good small-sample performance through simulations. To illustrate our method, we consider data from a prospective study of diabetes mellitus in 532 adult women with Pima Indian heritage (Smith et al., 1988). Several variables were measured for each participant, and criteria from the World Health Organization were used to identify women who developed diabetes. A primary goal of the study was to predict the onset of diabetes within five years.

2 Background

2.1 ROC Curve and Related Measures

The ROC curve provides a means to evaluate the ability of a biomarker or, equivalently, biomarker combination to identify individuals who have or will experience a binary outcome . For example, in a diagnostic setting, denotes the presence or absence of disease and may be used to identify individuals with the disease. The ROC curve provides information about how well the biomarker discriminates between individuals who have or will experience the outcome, that is, the cases, and individuals who do not have or will not experience the outcome, that is, the controls (Pepe, 2003). Mathematically, if larger values of are more indicative of having or experiencing the outcome, for each threshold we can define the TPR as and the FPR as  (Pepe, 2003). For a given , the TPR is also referred to as the sensitivity, and one minus the specificity equals the FPR (Pepe, 2003). The ROC curve is a plot of the TPR versus the FPR as ranges over all possible values; as such, it is non-decreasing and takes values in the unit square (Pepe, 2003). A perfect biomarker has an ROC curve that reaches the upper left corner of the unit square, and a useless biomarker has an ROC curve on the 45-degree line (Pepe, 2003).

The most common summary of the ROC curve is the AUC, the area under the ROC curve. The AUC ranges between 0.5 for a useless biomarker and 1 for a perfect biomarker (Pepe, 2003)

. The AUC has a probabilistic interpretation: it is the probability that the biomarker value for a randomly chosen case is larger than that for a randomly chosen control, assuming that higher biomarker values are more indicative of having or experiencing the outcome 

(Pepe, 2003). Both the ROC curve and the AUC are invariant to monotone increasing transformations of the biomarker  (Pepe, 2003).

The AUC summarizes the entire ROC curve, but in many situations it is more appropriate to only consider certain FPR values. For example, screening tests require a very low FPR, while diagnostic tests for fatal diseases may allow for a slightly higher FPR if the corresponding TPR is very high (Hsu and Hsueh, 2013). Such considerations led to the development of the partial AUC, the area under the ROC curve over some range of FPR values (Pepe, 2003). Rather than considering a range of FPR values, there may be interest in fixing the FPR at a single value, determining the corresponding threshold , and evaluating the TPR for that threshold. As opposed to the AUC and the partial AUC, this method returns a single classifier, or decision rule, which may appeal to researchers seeking a tool for clinical decision-making.

2.2 Biomarker Combinations

Many methods to combine biomarkers have been proposed, and they can be divided into two categories. The first includes indirect methods that seek to optimize a measure other than the performance measure of interest, while the second category includes direct methods that optimize the target performance measure. We focus on the latter.

Targeting the entire ROC curve (that is, constructing a combination that produces an ROC curve that dominates the ROC curve for all other linear combinations at all points) is very challenging and is only possible under special circumstances. Su and Liu (1993)

demonstrated that, when the vector

X

of biomarkers has a multivariate normal distribution conditional on

with proportional covariance matrices, it is possible to identify the linear combination that maximizes the TPR uniformly over the entire range of FPRs; this linear combination is Fisher’s linear discriminant function.

McIntosh and Pepe (2002) used the Neyman-Pearson lemma to demonstrate optimality (in terms of the ROC curve) of the likelihood ratio function and, consequently, of the risk score and monotone transformations of . Thus, if the biomarkers are conditionally multivariate normal and the -specific covariance matrices are equal, the optimal linear combination dominates not just every other linear combination, but also every nonlinear combination. This results from the fact that in this case, the linear logistic model holds for some -dimensional , where is the dimension of X. If the covariance matrices are proportional but not equal, the likelihood ratio is a nonlinear function of the biomarkers, as shown in the Appendix A for , and the optimal biomarker combination with respect to the ROC curve is nonlinear.

In general, there is no linear combination that dominates all others in terms of the TPR over the entire range of FPR values (Su and Liu, 1993; Anderson and Bahadur, 1962). Thus, methods to optimize the AUC have been proposed. When the biomarkers are conditionally multivariate normal with nonproportional covariance matrices, Su and Liu (1993) gave an explicit form for the best linear combination with respect to the AUC. Others have targeted the AUC without any assumption on the distribution of the biomarkers; many of these methods rely on smooth approximations to the empirical AUC, which involves indicator functions (Ma and Huang, 2007; Fong, Yin, and Huang, 2016; Lin et al., 2011).

Acknowledging that often only a range of FPR values is of interest clinically, methods have been proposed to target the partial AUC for some FPR range

. Some methods make parametric assumptions about the joint distribution of the biomarkers 

(Yu and Park, 2015; Hsu and Hsueh, 2013; Yan et al., 2018) while others do not (Wang and Chang, 2011; Komori and Eguchi, 2010; Yan et al., 2018). The latter group of methods generally uses a smooth approximation to the partial AUC, similar to some of the methods that aim to maximize the AUC (Wang and Chang, 2011; Komori and Eguchi, 2010; Yan et al., 2018). One challenge faced in partial AUC maximization is that for narrow intervals, that is, when is close to , the partial AUC is often very close to 0, which can make optimization difficult (Hsu and Hsueh, 2013).

Some work in constructing biomarker combinations by maximizing the TPR has been done for conditionally multivariate normal biomarkers. In this setting, procedures for constructing a linear combination that maximizes the TPR for a fixed FPR (Anderson and Bahadur, 1962; Gao et al., 2008) as well as methods for constructing a linear combination by maximizing the TPR for a range of FPR values (Liu, Schisterman, and Zhu, 2005) have been proposed. Importantly, in the method proposed by Liu, Schisterman, and Zhu (2005), the range of FPR values over which the fitted combination is optimal may depend on the combination itself; that is, the range of FPR values may be determined by the combination and so may not be fixed in advance. Thus, this method does not optimize the TPR for a prespecified FPR. Baker (2000) proposed a flexible nonparametric method for combining biomarkers by optimizing the ROC curve over a narrow target region of FPR values. However, this method is not well-suited to situations in which more than a few biomarkers are to be combined.

An important benefit of constructing linear biomarker combinations by targeting the performance measure of interest is that the performance of the combination will be at least as good as the performance of the individual biomarkers (Pepe, Cai, and Longton, 2006). Indeed, several authors have recommended matching the objective function to the performance measure, i.e., constructing biomarker combinations by optimizing the relevant measure of performance (Hwang et al., 2013; Liu, Schisterman, and Zhu, 2005; Wang and Chang, 2011; Ricamato and Tortorella, 2011). To that end, we propose a distribution-free method to construct biomarker combinations by maximizing the TPR for a given FPR.

Figure 1 illustrates the importance of targeting the measure of interest in constructing biomarker combinations. In this example, combinations of three biomarkers are constructed by (i) maximizing the logistic likelihood, (ii) maximizing the AUC via the optAUC package in R (i.e., the method of Huang, Qin, and Fang (2011)), and (iii) maximizing the TPR for an FPR of 20% using the proposed method. The ROC curves for the three combinations differ markedly near the prespecified FPR of 20%. In particular, the TPRs at an FPR of 20% for the three combinations are 18.0%, 24.0%, and 34.0% for maximum likelihood, AUC optimization, and maximization of the TPR for a given FPR, respectively. This example highlights the utility of methods that target the TPR for a specific FPR as opposed to methods that target other measures.

Figure 1: Biomarker combinations obtained by targeting different measures. In this illustrative example, there is interest in the TPR for an FPR of 20%. Combinations of three biomarkers were constructed in training data (400 cases and 400 controls) and evaluated in a large test dataset (10,000 cases and 10,000 controls). Two of the biomarkers, and

, were distributed as conditional bivariate lognormal random variables such that, among controls,

and, among cases, . The third biomarker, , was distributed as an independent lognormal random variable such that, among both cases and controls, . The combinations were constructed by maximizing the TPR for an FPR of 20% (“Optimal TPR”; green), by maximizing the logistic likelihood (“Logistic Regression”; blue), and by maximizing the AUC (“Optimal AUC”; red). The gray line indicates the ROC curve for a useless marker (FPR and TPR are equal), the black dashed line denotes an FPR of 20%, and the dot-dashed lines indicate the TPRs for an FPR of 20%. This figure appears in color in the electronic version of this article.

3 Methodology

3.1 Description

Cases are denoted by the subscript and controls are denoted by the subscript . Let denote the vector of biomarkers for the case, and let denote the vector of biomarkers for the control.

We propose constructing a linear biomarker combination of the form for a -dimensional X by maximizing the TPR when the FPR is below some prespecified, clinically acceptable value . We define the true and false positive rates for a given X as a function of and :

Since the true and false positive rates for a given combination and threshold are invariant to scaling of the parameters , we must restrict to ensure identifiability. Specifically, we constrain as in Fong, Yin, and Huang (2016). For any fixed , we can consider

where This provides the optimal combination and threshold . We define to be an element of , where may be a set.

Of course, in practice, the true and false positive rates are unknown, so and

cannot be computed. We can replace these unknowns by their empirical estimates,

where is the number of cases and is the number of controls, giving the total sample size . We can then define

where It is possible to conduct a grid search over to perform this constrained optimization, though this becomes computationally demanding when combining more than two or three biomarkers.

Since the objective function involves indicator functions, it is not a smooth function of the parameters and thus not amenable to derivative-based methods. However, smooth approximations to indicator functions have been used for AUC maximization (Ma and Huang, 2007; Fong, Yin, and Huang, 2016; Lin et al., 2011). One such smooth approximation is , where is the standard normal distribution function, and is a tuning parameter representing the trade-off between approximation accuracy and estimation feasibility such that tends to zero as the sample size grows (Lin et al., 2011). We can use this smooth approximation to implement the method described above, writing the smooth approximations to the empirical true and false positive rates as

Thus, we propose to compute

(1)

where Since both and are smooth functions, we can use gradient-based methods that incorporate the necessary constraints, e.g., Lagrange multipliers. In particular, can be obtained using existing software for constrained optimization of smooth functions, such as the Rsolnp package in R. An R package including code for our method based on Rsolnp, maxTPR, is available on CRAN. Other details related to implementation, including the choice of tuning parameter , are discussed below.

3.2 Asymptotic Properties

We present a theorem establishing that, under certain conditions, the combination obtained by optimizing the smooth approximation to the empirical TPR while constraining the smooth approximation to the empirical FPR has desirable operating characteristics. In particular, its FPR is bounded almost surely by the acceptable level in large samples. In addition, its TPR converges almost surely to the supremum of the TPR over the set where the FPR is constrained. We focus on the operating characteristics of since these are of primary interest to clinicians.

Rather than enforcing to be a strict maximizer, in the theoretical study below we allow it to be a near-maximizer of within in the sense that

where is a decreasing sequence of positive real numbers tending to zero. This provides some flexibility to accommodate situations in which a strict maximizer either does not exist or is numerically difficult to identify.

Before stating our key theorem, we give the following conditions.

  1. Observations are randomly sampled conditional on disease status , and the group sizes tend to infinity proportionally, in the sense that and .

  2. For each , observations , , are independent and identically distributed -dimensional random vectors with distribution function .

  3. For each , no proper linear subspace is such that

  4. For each

    , the distribution and quantile functions of

    given are globally Lipschitz continuous uniformly over such that .

  5. The map is globally Lipschitz continuous over .

Theorem 1.

Under conditions (1)–(5), for every fixed , we have that

  • almost surely; and

  • tends to zero almost surely.

The proof of Theorem 1 is given in Appendix B. The proof relies on two lemmas, also in Appendix B. Lemma 1 demonstrates almost sure convergence to zero of the difference between the supremum of a function over a fixed set and the supremum of the function over a stochastic set that converges to the fixed set in an appropriate sense. Lemma 2 establishes the almost sure uniform convergence to zero of the difference between the FPR and the smooth approximation to the empirical FPR and the difference between the TPR and the smooth approximation to the empirical TPR. The proof of Theorem 1 demonstrates that Lemma 1 holds for the relevant function and sets, relying in part on the conclusions of Lemma 2. The conclusions of Lemmas 1 and 2 then demonstrate the claims of Theorem 1.

3.3 Implementation Details

Certain considerations must be addressed to implement the proposed method, including the choice of tuning parameter and starting values for the optimization routine. In using similar methods to maximize the AUC, Lin et al. (2011) proposed using , where

is the sample standard error of

. In simulations, we considered both and and found similar behavior for the convergence of the optimization routine. Thus, we use . We must also identify initial values for our procedure. As done in Fong, Yin, and Huang (2016), we use normalized estimates from robust logistic regression, which is described in greater detail below. Based on this initial value , we choose such that . In addition, we have also found that when is bounded by , the performance of the optimization routine can be poor. Thus, we introduce another tuning parameter, , which allows for a small amount of relaxation in the constraint on the smooth approximation to the empirical FPR, imposing instead Since the effective sample size for the smooth approximation to the empirical FPR is , we chose to scale with , and have found to work well.

Our method involves computing the gradient of the smooth approximations to the true and false positive rates defined above, which is fast regardless of the number of biomarkers involved. This is in contrast with methods that rely on brute force (e.g., grid search), which typically become computationally infeasible for combinations of more than two or three biomarkers. However, we note that for any method, the risk of overfitting is expected to grow as the number of biomarkers increases relative to the sample size. We emphasize that our method does not impose constraints on the distribution of the biomarkers that can be included, except for weak conditions that allow us to establish its large-sample properties.

4 Simulations

4.1 Bivariate Normal Biomarkers with Contamination

First, we considered bivariate normal biomarkers with contamination, similar to a scenario described by Croux and Haesbroeck (2003). In particular, we considered a setting where two biomarkers

were independently normally distributed with mean zero and variance one.

was then defined as where was distributed as a logistic random variable with location parameter zero and scale parameter one. Next, the sample was contaminated by a set of points with and . We consider simulations where the training set consisted of 800 or 1600 “typical” observations and 50 or 100, respectively, contaminating observations (this yielded a disease prevalence of approximately 47%). The test set consisted of “typical” observations and 62,500 contaminating observations. The maximum acceptable FPR, , was 0.2 or 0.3. We performed 1000 simulations.

We considered five approaches: (1) logistic regression, (2) the robust logistic regression method proposed by Bianco and Yohai (1996), (3) grid search, (4) the method proposed by Su and Liu (1993) and (5) the proposed method. As discussed above, the method proposed by Su and Liu (1993) yields a combination with maximum AUC when the biomarkers have a conditionally multivariate normal distribution. We did not consider the optimal AUC method proposed by Huang, Qin, and Fang (2011) as the implementation provided in R is too slow for use in simulations (and, as illustrated in Figure 1, may not yield a combination with optimal TPR). While the methods recently proposed by Yan et al. (2018) to optimize the partial AUC are compelling and may yield a combination with high TPR at the specified FPR value, implementation of their method, particularly the nonparametric kernel-based method, is non-trivial, and so is not included here. Finally, the method of Liu, Schisterman, and Zhu (2005), discussed above, may also yield a combination with high TPR at a particular FPR. However, given the shortcomings of this method described above (namely, that the range of FPRs over which the combination is optimal cannot be fixed in advanced and the biomarkers are assumed to have a conditionally multivariate normal distribution), we do not include this as a comparison method. Above all, none of these methods specifically target the TPR for a specified FPR which, as indicated by Figure 1, may lead to combinations with reduced TPR at the specified FPR.

We focused on evaluating the operating characteristics of the fitted combination rather than the biomarker coefficients as the former is typically of primary interest. In particular, we evaluated the TPR in the test data for FPR = in the test data. In other words, for each combination, the threshold used to calculate the TPR in the test data was chosen such that the FPR in the test data was equal to . Evaluating the TPR in this way puts the combinations on equal footing in terms of the FPR, and so allows a fair comparison of the TPR. We evaluated the FPR of the fitted combinations in the test data using the thresholds estimated in the training data, i.e., the th quantile of the fitted biomarker combination among controls in the training data. While we could have used the estimate of provided by our method in the evaluation, we found improved performance (that is, better control of the FPR) when re-estimating the threshold based on the fitted combination in the training data.

Table 1 summarizes the results. For both sample sizes and FPR thresholds, all methods adequately controlled the FPR, while for the TPR, the proposed method outperformed logistic regression, robust logistic regression, and the method of Su and Liu (1993). Furthermore, the results from the proposed method were comparable to those from the grid search, which may be regarded as a performance reference but is infeasible for more than two or three biomarkers.

Measure Method
GLM rGLM Grid Search Su & Liu sTPR
0.20
800 TPR 52.7 (15.9) 55.8 (15.1) 72.5 (0.9) 53.6 (15.6) 72.0 (4.5)
FPR 20.4 (1.4) 20.4 (1.4) 20.6 (1.3) 20.4 (1.4) 20.4 (1.3)
1600 TPR 57.5 (13.2) 60.0 (12.1) 72.7 (0.5) 58.3 (12.8) 72.8 (0.5)
FPR 20.3 (1.0) 20.3 (1.0) 20.4 (1.0) 20.3 (1.0) 20.3 (1.0)
0.30
800 TPR 68.5 (14.3) 71.4 (13.6) 86.0 (0.5) 69.3 (13.9) 86.0 (1.2)
FPR 30.6 (1.8) 30.6 (1.8) 30.6 (1.8) 30.6 (1.8) 30.4 (1.8)
1600 TPR 74.0 (11.5) 76.1 (10.3) 86.1 (0.3) 74.7 (11.1) 86.1 (0.3)
FPR 30.4 (1.3) 30.3 (1.3) 30.4 (1.3) 30.4 (1.3) 30.2 (1.3)
Table 1:

Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data across 1000 simulations for contaminated data with two biomarkers. The TPR is based on the threshold corresponding to an FPR of

in the test data whereas the FPR is based on the thresholds estimated in the training data. , size of the training dataset; , acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.

4.2 Conditionally Multivariate Lognormal Biomarkers

We also considered simulations with conditionally multivariate lognormal biomarkers (Mishra, 2019). In particular, we considered three biomarkers . Among controls, and had a bivariate normal distribution with and . Among cases, and had a bivariate normal distribution with and . The third biomarker, was simulated from an independent lognormal distribution with and among both cases and controls. Given the performance of the method of Su and Liu (1993) and the performance of the proposed method relative to grid search observed above (and the computational challenges of implementing grid search for three biomarkers), we considered three methods here: (1) logistic regression, (2) robust logistic regression, and (3) the proposed method. Although neither logistic regression nor robust logistic regression performed particularly well in the simulations in Section 4.1, these methods represent the most commonly used approach for constructing biomarker combinations and the method used to provide starting values for the proposed method, respectively. Thus, it was important to include them here.

The maximum acceptable FPR, , was 0.2 and 1000 simulations were performed. The training data consisted of either 400 cases and 400 controls, or 800 cases and 800 controls. The test data consisted of observations. The TPR and FPR were evaluated as described above. We present the results in Table 2. All three methods did well in controlling the FPR at the specified value. Furthermore, the proposed method substantially outperformed logistic regression and robust logistic regression: the mean TPR based on the proposed method was at least 20% larger than the mean TPRs from logistic regression and robust logistic regression.

Measure Method
GLM rGLM sTPR
800 TPR 34.1 (6.0) 34.1 (6.0) 41.5 (5.7)
FPR 30.3 (2.3) 30.3 (2.3) 31.2 (2.4)
1600 TPR 34.7 (4.2) 34.7 (4.2) 41.9 (4.9)
FPR 30.2 (1.6) 30.2 (1.6) 30.7 (1.7)
Table 2: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data across 1000 simulations for three conditionally lognormal biomarkers and . The TPR is based on the threshold corresponding to an FPR of in the test data whereas the FPR is based on the thresholds estimated in the training data. , size of the training dataset; , acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.

4.3 Bivariate Normal Biomarkers and Bivariate Normal Mixture Biomarkers

The above simulations demonstrate superiority of our approach relative to alternative methods in particular scenarios. We conducted further simulations to demonstrate the feasibility of our approach in other settings (for instance, small sample size, small and large prevalence, and low FPR cutoffs) relative to logistic regression and robust logistic regression.

We considered simulations with and without outliers in the data-generating distribution, and simulated data under a model similar to that used by 

Fong, Yin, and Huang (2016). We considered two biomarkers and constructed as

where was a Bernoulli random variable with success probability when outliers were simulated and otherwise, and and were independent bivariate normal random variables with mean zero and respective covariance matrices

was then simulated as a Bernoulli random variable with success probability . We considered two functions: and a piecewise logistic function,

We varied to reflect varying prevalences, with a prevalence of approximately 50–60% for 0, 16–18% for 0, and 77–82% for 0. We considered 0.05, 0.1, and 0.2. A plot illustrating the data-generating distribution with and 0, with and without outliers, is given in Appendix D.

The training data consisted of 200, 400, or 800 observations while the test set included observations. The TPR and FPR were evaluated as described above. The results are presented in Appendix C. When no outliers were present, the proposed method was comparable to logistic regression and robust logistic regression in terms of both the TPR and FPR. In the presence of outliers, robust logistic regression tended to provide combinations with higher TPRs than did logistic regression, and the TPRs of the combinations provided by the proposed method tended to be comparable to or somewhat better than the results from robust logistic regression. In all scenarios, all three methods controlled the FPR, particularly as sample size increased. In addition to demonstrating feasibility of our approach, these simulations highlight the fact that logistic regression is relatively robust to violations of the linear-logistic model (e.g., nonlinear biomarker combinations and deviations from the logit link).

4.4 Convergence

In most simulation settings, convergence of the proposed method was achieved in more than of simulations. For some of the more extreme outlier scenarios considered in Section 4.3, convergence failed in up to of simulations.

5 Application to Diabetes Data

We applied the method we have developed to a study of diabetes in women with Pima Indian heritage (Smith et al., 1988). We considered seven predictors measured in this study: number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skin fold thickness, body mass index, age, and diabetes pedigree function (a measure of family history of diabetes (Smith et al., 1988)). We used 332 observations as training data and reserved the remaining 200 observations for testing. The training and test datasets had 109 and 68 diabetes cases, respectively. We scaled the variables to have equal variance. The distribution of predictors is depicted in Appendix E. The combinations were fitted using the training data and evaluated using the test data. We fixed the acceptable FPR at . We used logistic regression, robust logistic regression, and the proposed method to construct the combinations, giving the results in Table 3, where the fitted combinations from logistic regression and robust logistic regression have been normalized to aid in comparison.

Predictor GLM rGLM sTPR
Number of pregnancies 0.321 0.320 0.403
Plasma glucose 0.793 0.792 0.627
Blood pressure 0.077 0.073 0.026
Skin fold thickness 0.089 0.090 0.146
Body mass index 0.399 0.400 0.609
Diabetes pedigree 0.280 0.281 0.191
Age 0.133 0.134 0.123
Table 3: Fitted combinations of the scaled predictors in the diabetes study. GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method with 0.10.

Using thresholds based on an FPR of 10% in the test data, the estimated TPR in the test data was 54.4% for both logistic regression and robust logistic regression, and 55.9% for the proposed method. The estimated FPR in the test data using thresholds corresponding to an FPR of 10% in the training data was 18.2% for both logistic regression and robust logistic regression and 26.5% for the proposed method. The fact that these FPRs exceeded the target value for all three methods indicates potentially important differences in the controls between the training and test data.

6 Discussion

We have proposed a distribution-free method for constructing linear biomarker combinations by maximizing a smooth approximation to the TPR while constraining a smooth approximation to the FPR. Ours is the first distribution-free approach targeting the TPR for a specified FPR that can be used with more than two or three biomarkers. While we do not expect our method to outperform every other approach in every dataset, we have demonstrated broad feasibility of our method and, importantly, we have identified scenarios where the performance of our method is superior to alternative approaches.

The proposed method could be adapted to minimize the FPR while controlling the TPR to be above some acceptable level. Since the TPR and FPR condition on disease status, the proposed method can be used with case-control data. For case-control data matched on a covariate, however, it becomes necessary to consider the covariate-adjusted ROC curve and corresponding covariate-adjusted summaries, and thus the methods presented here are not immediately applicable (Janes and Pepe, 2008).

As our smooth approximation function is non-convex, the choice of starting values should be considered further. Extensions of convex methods, such as the ramp function method proposed by Fong, Yin, and Huang (2016) for the AUC, could also be considered. The idea of partitioning the search space, proposed by Yan et al. (2018), may also be useful. Further research could investigate methods for evaluating the true and false positive rates of biomarker combinations after estimation, for example, sample-splitting, bootstrapping, or -fold cross-validation.

7 Software

An R package containing code to implement the proposed method, maxTPR, is publicly available via CRAN.

Funding

This work was supported by the National Institutes of Health [F31 DK108356, R01 CA152089, and R01 HL085757]; and the University of Washington Department of Biostatistics Career Development Fund [to M.C.]. The opinions, results, and conclusions reported in this article are those of the authors and are independent of the funding sources.

Appendix A

Proposition 1.

If the biomarkers are conditionally multivariate normal with proportional covariance matrices given , that is,

then the optimal biomarker combination in the sense of the ROC curve is of the form

for some vector .

Proof.

It is known that the optimal combination of in terms of the ROC curve is the likelihood ratio, , or any monotone increasing function thereof (McIntosh and Pepe, 2002). Let . Without loss of generality, let and . Then

Denote the entries of by

Then, we can write that

as claimed, where

Appendix B

The proof of Theorem 1 relies on Lemmas 1 and 2, which are stated and proved below.

Lemma 1.

Say that a bounded function and possibly random sets are given, and let be a decreasing sequence of positive real numbers tending to zero. For each , suppose that and are near-maximizers of over and , respectively, in the sense that and . Further, define

where is the Euclidean distance in . If and tend to zero almost surely, and is globally Lipschitz continuous, then tends to zero almost surely. In particular, this implies that

almost surely.

Proof.

Say that both and tend to zero almost surely, and denote by the Lipschitz constant of . Suppose that for some we have that

We will show that this leads to a contradiction, and thus that it must be true that

for each , thus establishing the desired result.

On a set of probability one, there exists an such that, for each , there exists and satisfying and . Then, on this same set, for , and , so that and in particular. Since and , it must also be true that and . This then implies that for all on a set of probability one. Since tends to zero deterministically, this yields the sought contradiction.

To establish the last portion of the Lemma, we simply use the first part along with the fact that

Lemma 2.

Under conditions (1)–(5), we have that

almost surely as tends to , where .

Proof.

We prove the claim for the false positive rate (FPR); the proof for the true positive rate (TPR) is analogous. We can write

First, we consider . We can write this as

The class of functions is a Vapnik–Chervonenkis (VC) class. Since is monotone for each , the class of functions is also VC (Kosorok, 2008; van der Vaart, 1998; van der Vaart and Wellner, 2000). Since the constant 1 is an applicable envelope function for this class, is –Glivenko-Cantelli, giving that (Kosorok, 2008; van der Vaart and Wellner, 2000)

almost surely.

Next, we consider . We can write this as

For a general random variable with distribution function that is Lipschitz continuous, say with constant , we can write

with . Using integration by parts and Lemma 2.1 from Winter (1979), this becomes

and so, we find that

Since tends to zero as tends to infinity, this implies that

We now return to and consider the case , so that Let and . Then, we have that , where and . We find that

for any . Since , we can write