There is a torrent of new markers and risk-prediction models, but unfortunately, historically used metrics for markers and risk models provide at best indirect information about utility, or even predictiveness, for clinical/public-health use. The odds-ratio is a well-known poor measure of predictiveness(Kraemer, 2004; Pepe et al., 2004). When comparing two tests, it is uncommon for one test to have both higher sensitivity and specificity, or both higher positive predictive value (PPV) and lower complement of the negative predictive value (cNPV). Two commonly used summary statistics, Youden’s Index (Youden, 1950)
and the Area Under the Receiver Operating Characteristic Curve (AUC)(Hanley and McNeil, 1982), have been correctly criticized for not taking predictive values (i.e. absolute risks) into account, and for not permitting differential weighting of false-positive versus false-negative errors (Greenhouse et al., 1950; Hilden, 1991). In spite of the well-known criticisms, the AUC remains by far the most popular metric in scientific practice.
There has been much research on improved metrics, such as risk-reclassification metrics (Cook, 2007; Pencina et al., 2008), and especially metrics with decision-theoretic justification (Gail and Pfeiffer, 2005; Baker et al., 2009). Decision-theoretic metrics often rely only on specifying a risk-threshold (Pauker and Kassirer, 1980) that implicitly accounts for benefits, harms, and costs. The most popular metric is the Net Benefit (Vickers and Elkin, 2006), which provides a range of risk-thresholds, below which action should always be taken, above which action should never be taken, and within which using the marker/model provides the most utility. Net Benefit is an important advance, but we feel that its meaning as an abstract ”utility” measure is hard for scientists to interpret.
A parallel line of research focuses on ”risk stratification”, a ubiquitous term in medicine (20,684 papers in PubMed on August 18 2017). Although ”risk stratification” is a broadly used term, we define it as the ability of a marker/model to separate those at high risk of disease from those at low risk (Wentzensen and Wacholder, 2013). ”Risk stratification”, as a term, is much less common in the statistical literature (75 papers in Scopus on August 18 2017). Important advances in risk-stratification include the predictiveness curve (Copas, 1999; Huang et al., 2007) and its summary metric Total Gain (Bura and Gastwirth, 2001).
We introduce two new broadly applicable, linked metrics that connect the decision-theoretic and risk-stratification approaches to evaluating markers/models. We first define Mean Risk Stratification (MRS) as the average change in risk of disease (posttest-pretest) revealed by the marker/model. MRS is twice the cross-product difference
of the joint probabilities in a 2x2 table, which is easily remembered by analogy with odds ratios. We then define a decision-theoretic metric, the Net Benefit of Information (NBI): the increase in expected utility from using the marker/model to select people for interventionversus randomly selecting people for intervention. Our 4 key results are:
Youden’s index and AUC reflect on (1) the fraction of the maximum possible risk-stratification (i.e. MRS) that is attained by the marker/model, and (2) the fraction of the maximum possible utility gain over random selection (i.e. NBI) that is attained by the marker/model. MRS and NBI provide Youden’s index and AUC with risk-stratification and decision-theoretic justification, the lacks of which have long been a criticism of Youden’s index and AUC.
For rare diseases, high AUC does not imply high risk-stratification or utility gain over random selection. AUC must be considered in light of disease prevalence, which is automatically done by MRS and NBI.
NBI is a function of only MRS and the risk-threshold for action, connecting decision-theory to risk-stratification and providing a decision-theoretic rationale for MRS.
MRS and NBI provide a range of risk thresholds for which the marker/model is ”optimally informative”, in the sense that the risk-thresholds maximize both risk-stratification and the utility gain over random selection.
We apply MRS and NBI to the controversy over who should get tested for mutations in the BRCA1/2 genes, which cause high risks of breast and ovarian cancer (Kuchenbaecker et al., 2017). The mutations are rare in the general population (), but are 10 times more common among Ashkenazi-Jews (Struewing et al., 1997). Currently, according to the UK National Institute for Health and Care Excellence (NICE) and the US Preventive Services Task Force, women are referred for mutation testing only if they have a strong family history of breast or ovarian cancer (Moyer, 2014) as quantified by a risk model calculating that their risk of carrying a mutation exceeds 10% (NICE, 2017). However, as mutation-testing costs fall, prominent voices have called for testing all women (King et al., 2014; Rizk, 2017). BRCA1/2 testing is already being offered to a large unselected Canadian population as a demonstration project (Staff Reporter, 2017). Testing all women would strain clinical resources by testing millions of women, of whom will test negative. At US $500-$1,000 per test, testing millions of women has clear commercial implications.
In contrast, we recently demonstrated that 80% of Ashkenazi-Jewish mutation-carriers could be identified by testing only 44% of Ashkenazi-Jewish women (Best et al., 2017). This is achieved by a low mutation-risk threshold of 0.78%, far lower than the current 10% UK NICE and US recommendation. However, we could not formally justify any choice of risk-threshold. We seek to better understand the implications of different choices of risk threshold. Our eclectic approach considers multiple metrics, including AUC, Net Benefit, and MRS/NBI. The ranges of useful risk-thresholds, as determined by Youden’s index, AUC, MRS, NBI, and Net Benefit, always overlap at risk threshold equaling disease prevalence. The value of MRS/NBI is to interpret AUC in the context of prevalence and to provide a range of risk thresholds for which the risk model is optimally informative. Our MRS webtool is available.
2 Mean Risk Stratification (MRS)
Because continuous markers or risk models require dichotomization to determine action, we refer to a marker or model , dichotomized at a cutpoint , as a test: is positive and is negative. In the absence of test results or other pretest information, each individual can only be assigned as a best guess the same population-average risk . Upon taking the test, 2 outcomes are possible:
With probability , the test is positive. The person’s risk increases from to Positive Predictive Value (), an increase of .
With probability , the test is negative. The person’s risk decreases from to complement of Negative Predictive Value: . The person’s risk decreases by .
Mean Risk Stratification (MRS) is a weighted average of the increase in risk among those who test positive and the decrease in risk among those who test negative:
MRS is the average difference between predicted post-test individual risk (either PPV or cNPV) and pretest population-average risk . Stated simply, MRS is the average change in risk that a test reveals. For instance, an MRS of 6% means that if a person uses this test, the person learns that their disease risk will increase or decrease by an average of 6 disease cases per 100 people.
Importantly, via the definition of and , MRS is a function of the cutpoint that defines risk thresholds. We will plot MRS for all possible to gain insight on the value of possible risk thresholds. The cutpoint that maximizes MRS is always where the risk threshold equals disease prevalence (see Appendix). Thus, the range of risk thresholds where MRS is highest will always contain disease prevalence.
Total Gain is the special case of MRS where cutpoint represents the risk threshold equaling disease prevalence (see Webappendix). MRS is not Net Reclassification Index (NRI) (Pencina et al., 2008), because under dichotomization, NRI equals Youden’s index.
The Appendix shows that MRS measures association by equaling twice the covariance of disease and marker, and is related other association measures. MRS also equals twice the cross-product difference of the joint probabilities inside a 2x2 table, which is easy to remember by analogy with odds ratios (see Appendix).
Next we introduce Net Benefit of Information and its relationship to MRS.
3 Net Benefit of Information and Mean Risk Stratification
Decision-theoretic metrics of test performance are based on the expected utility from using the test (see Baker et al. (2009) for a comprehensive review). Calculating expected utility requires specification of the utility for the 4 possible outcomes: the utility of true positive prediction , the utility of true negative prediction , the utility of false positive prediction , and the utility of false negative prediction . Furthermore, the cost of marker is . These 5 utilities require considering the benefits, harms, and costs of the test and all subsequent interventions, which may be personal and difficult to quantify.
Instead, rational utility theory requires specification of only a risk-threshold that encapsulates the utilities (Pauker and Kassirer, 1980). In this approach, the marker/model is dichotomized at risk threshold : . The that maximizes expected utility is determined by the ratio of benefit () to costs (), which is (Pauker and Kassirer, 1980). The risk-threshold weighs the utility of true-positives versus false-positives, e.g. a 10% threshold means that a rational person is willing to accept 9 false-positives for every true-positive. Decision-theoretic metrics of test performance are plotted versus all possible risk thresholds (as defined by cutpoints ) to gain insight on the value of possible risk thresholds.
To derive Net Benefit of Information, we first calculate the expected utility of the test, which averages utilities, weighted by the joint probabilities for the outcomes, plus test cost:
In particular, note that randomly selecting people for intervention, with the same positivity and cost as the test of interest, has utility (Kraemer, 1992, Ch.9)
where . Random selection is the minimum utility possible for a test (which need not be zero) and provides a baseline utility that must be substantially exceeded. A test is more informative the higher its utility is than randomly selecting people for intervention.
Next, plugging in the following identities (see Webappendix section 2)
into the expected-utility equation yields
Finally, we scale utility in units of benefit to define Net Benefit of Information (NBI) as the increase in (scaled) utility from using the test to select people for intervention versus randomly selecting people for intervention:
NBI is a function of test characteristics only via the MRS, connecting the decision-theoretic approach to risk-stratification. For small risk thresholds, NBI is close to half the MRS. This approximate equivalence provides NBI a concrete risk-stratification interpretation, and vice versa, provides a decision-theory justification for MRS.
Because MRS is a function of the cutpoint that defines the risk threshold (equation 1), so is NBI. We will plot over the range of risk thresholds defined by cutpoints . The risk thresholds where is near its peak is where the most utility is gained over random selection. This range of risk-thresholds is where the marker/risk-model is ”optimally informative”. For small , this range will also maximize .
4 Relationship of MRS and NBI to Youden’s Index, AUC, and the risk difference
4.1 Relationship of MRS and NBI to Youden’s Index and AUC
MRS and NBI can be calculated by combining prevalence with Youden’s index or AUC for a dichotomized marker. Equation 5 in the Appendix shows that MRS can be written as
where sensitivity , , and . Denote specificity . Because ,
MRS is Youden’s index, , rescaled by disease prevalence. Note that Youden’s index is function of cutpoint
. MRS can be calculated by combining an estimate of Youden’s indexwith an external estimate of disease prevalence , which we will do in section 5.3.
Although AUC is usually calculated for continuous markers, for a dichotomized marker, (Cantor and Kattan, 2000). Thus
Similarly, is a function of or via the above MRS expressions. In fact, MRS, NBI, Youden’s index, and AUC are maximized when cutpoint implies that risk threshold equals prevalence (see Appendix).
The key point is that MRS/NBI interpret Youden’s index and AUC in light of prevalence. This is especially important for rare diseases because
A high Youden’s index or AUC might not imply much risk stratification or NBI for rare diseases. MRS and NBI naturally temper overenthusiasm for markers with high AUC, but for rare diseases.
Importantly, disease prevalence bounds MRS and NBI. For perfect tests (),
and . Thus if disease is rare, there may be little risk stratification or NBI even for perfect tests. Figure 1 plots the relationship of MRS to AUC for 3 uncommon disease prevalences. The importance of disease prevalence is illustrated by noting that, the maximum MRS (achieved if AUC=1) is also obtained if AUC=0.55 for diseases 10 times more prevalent; AUC=0.6 suffices if disease is 5 times more prevalent. Thus a perfect marker for a rare disease provides the same risk-stratification as a weakly-associated marker for a disease that is 5-10 times as prevalent.
4.2 Simple and useful decision-theoretic interpretation of Youden’s index and AUC
The fraction of the maximum NBI and MRS achieved by the test is Youden’s index:
Youden’s index is the fraction of the maximum possible risk-stratification attained by the test. Youden’s index is also the fraction of the maximum possible utility gain over random selection, that is attained by the test. Thus MRS/NBI indeed provide Youden’s index (and thus AUC) with simple and useful decision-theoretic and risk-stratification interpretations.
But this also illustrates the pitfalls of and AUC for rare diseases. Since Youden’s index and AUC reflect on multiplicative gains in MRS/NBI, for rare diseases, a high Youden’s index or AUC can mask small additive increases in MRS and NBI (Fig. 1, left plot). Since , a 1% increase in AUC implies a 2% increase in MRS or NBI. Thus MRS/NBI double from AUC=0.6 to 0.7. An AUC=0.6 is widely considered to be ”modest”, and indeed, only 20% of maximal MRS/NBI is achieved. An AUC=0.7 is widely considered ”good”, but only 40% of maximal MRS/NBI is achieved. An AUC=0.95 is required to achieve 90% of maximal MRS/NBI.
4.3 MRS and NBI for a rarely-positive test: relationship to the risk difference
The risk difference is (recall ). Risk stratification is sometimes (mis)measured by the risk difference: a large spread in risks is considered evidence of good risk stratification. Starting from equation (5) in the appendix
A large risk difference does not imply much risk stratification, and hence NBI, if the test is rarely positive. Figure 1 (right panel) plots the relationship of MRS to the risk-difference for 3 test positivity rates. When risk-difference is 1, the maximum MRS is achieved. The importance of test positivity is illustrated by noting that, the MRS achieved for risk-difference of 1 is also obtained for a risk-difference of approximately only 0.1 when the test is 10 times as positive (dashed line). Thus a perfect marker for a rarely positive test provides the same risk-stratification, and hence NBI, as a weakly associated marker 10-times as positive.
5 Informativeness of risk models to select who might get Brca1/2 testing
As detailed in the Introduction, mutations in the BRCA1/2 genes cause high risk of breast and ovarian cancers. The mutations are rare in the general population (), but 10 times more common among Ashkenazi-Jews. Currently, women are asked to provide their family history of cancer (e.g. Webappendix Figure 1), and she is offered mutation-testing in the UK and US if a risk model calculates that her risk of carrying a mutation exceeds 10% (NICE, 2017). Popular risk models are BRCAPRO (Parmigiani et al., 1998) or BOADICEA (Antoniou et al., 2004). We will focus on BRCAPRO.
However, as mutation-testing costs fall, prominent voices have called for testing all women, which would strain clinical resources by testing millions of women, 99.75% of whom will test negative. Instead, a lower risk threshold, below 10%, might identify nearly all mutation-carriers, yet avoid unnecessary testing for most women. We recently showed that a low 0.78% risk-threshold would identify 80% of Ashkenazi-Jewish mutation-carriers yet test only 44% of Ashkenazi-Jewish women (Best et al., 2017). We use MRS/NBI, AUC and Net Benefit to throw light on the value of the risk model BRCAPRO to select women for BRCA1/2 testing at risk thresholds between 0%-10%.
We use data on 4,589 volunteers (102 BRCA1/2 mutation carriers) from the Washington Ashkenazi Study (WAS) (Struewing et al., 1997). We calculated each volunteer’s risk of carrying a mutation, based on their self-reported family-history of breast/ovarian cancers, using BRCAPRO. Here is the BRCAPRO risk score, and because BRCAPRO is a well-calibrated risk model (Best et al., 2017), , i.e. the cutpoint equals the risk threshold . Disease indicates the presence of a BRCA1/2 mutation.
5.1 MRS and NBI for BRCAPRO at different risk-thresholds for Ashkenazi-Jews
Recall that is a function of the cutpoint used to decide (equation 1). The left panel of Figure 2 plots for BRCAPRO over a range of risk thresholds () (top axis) for Ashkenazi-Jews. The x-axis is the fraction of women testing positive by each risk threshold. The right axis is the AUC implied by each MRS, according to equation (4), given the observed mutation prevalence for Ashkenazi-Jews.
BRCAPRO has best for risk thresholds in a ”sweetspot” of 0.78% to 5%. An MRS=1.7% means that a woman who uses BRCAPRO, dichotomized at any threshold between 0.78% to 5% to refer for mutation-testing, will learn that her risk of carrying a mutation will increase or decrease by 1.7% on average. An average change in risk of seems meaningful because it is similar to pretest risk (i.e. mutation prevalence) of 2.3%.
In contrast, the current 10% threshold yields a substantially lower MRS of (p=0.039 versus MRS=1.7% at the 0.78% threshold). The Webappendix shows how to calculate the variance of MRS and conduct hypothesis testing for two MRSs.
The 10% threshold yields a much higher risk-difference than the 0.78% threshold: 12.58% vs. 3.39% respectively. Thus MRS was lower at the 10% threshold, in spite of a higher risk-difference, because the 10% threshold has only 4.5% test-positivity (vs. 44% at the 0.78% threshold). Rarely positive tests have low risk-stratification (see section 4.3).
The right-axis of figure 2 (left panel) shows the AUC implied by each risk threshold. The 0.78%-5% risk-threshold sweetspot has , indicating that only 38% of the maximum MRS of 4.4% is achievable by BRCAPRO. The MRS=1.7% reveals the risk-stratification implications of AUC=0.69.
is maximized when the risk threshold equals disease prevalence, i.e. (see Appendix). Thus the sweetspot of risk thresholds that maximizes MRS/NBI, and Youden’s index and AUC, always includes prevalence, in this case, 2.3%.
Recall that, just as MRS is a function of the risk-threshold cutpoint , so is NBI (equation 2). Figure 2 (right panel) examines and versus the sensitivity (% of BRCA1/2 mutations detected) of BRCAPRO as the risk threshold varies. This plot trades off MRS/NBI, which informs a woman of the informativeness of BRCAPRO for herself, versus sensitivity, which is the public-health perspective of identifying as many mutation-carriers as possible. The MRS/NBI risk-threshold sweetspot of 0.78%-5% identifies 45%-80% of BRCA1/2 mutation-carriers, while the traditional 10% threshold identifies only 28% of BRCA1/2 mutation-carriers. In the 0.78%-5% risk-threshold range, the NBI of around 0.85% means that BRCAPRO additively increases utility by 0.85% over random selection. More importantly, MRS and NBI/2 are very similar for the low risk thresholds of interest (below 10%). Thus the ranges of risk thresholds chosen by MRS and NBI will coincide and maximize both risk stratification and utility gain over random selection. Hence, BRCAPRO is optimally informative, when dichotomized at risk-thresholds between 0.78%-5%, to identify which Ashkenazi-Jewish women to refer for BRCA1/2 testing.
5.2 MRS/NBI and Net Benefit: Complementary perspectives
Net Benefit is an important modern approach to identify risk-thresholds where a marker/model is useful for clinical actions (Vickers and Elkin, 2006). We summarize Net Benefit, demonstrate its relationship to NBI, and compare insights from Net Benefit and MRS/NBI on risk thresholds for BRCA1/2 testing.
Recall that NBI subtracts the utility of random selection from the utility of the test , standardized by benefit (see section 3). In contrast, Net Benefit (NB) subtracts the utility of calling everyone negative () from the utility of the test, standardized by benefit:
In particular, the Net Benefit of calling everyone positive subtracts the the utility of calling everyone negative from the utility of calling everyone positive :
The goal of Net Benefit is to identify the range of risk thresholds where the marker/model provides more utility than all-or-nothing actions. These thresholds will be those where the Net Benefit is positive (where the test provides more utility than calling everyone negative) and greater than (where the test provides more utility than calling everyone positive).
Note that the Net Benefit of random selection (i.e. ) is not zero:
NBI equals the Net Benefit of the test minus the Net Benefit of random selection:
The left panel of Figure 3 shows the Net Benefit for using the BRCAPRO risk model (solid line) to the Net Benefit for doing BRCA1/2 testing on everyone (; dotted) and the Net Benefit for BRCA1/2 testing for no one (zero by definition). The Net Benefit of randomly selecting women for BRCA1/2 testing (dashed line) is always between the Net Benefits for testing everyone or no one (all 3 are zero at risk threshold equals prevalence).
Note that Net Benefit is greater than starting at risk thresholds around 1.7%, and remains positive until around 30%. According to Net Benefit, the BRCAPRO risk model appears useful for risk thresholds between 1.7%-30%. Below 1.7%, all Ashkenazi-Jews should be referred for mutation testing, and above 30%, none should be referred. Although 1.7% is within the 0.78%-5% MRS ”sweetspot”, the BRCAPRO model is rather uninformative at risk thresholds above 10%. For example, the 30% threshold has MRS=0.79% (p=0.0005 vs. 1.7%) and AUC=0.59. For a 2.3% prevalence, an average change of seems small.
Figure 3 (right panel) compares NBI to Net Benefit Gain, : the increase in Net Benefit versus testing everyone or no one. NBI and Net Benefit coincide at risk threshold equals prevalence (see Webappendix), where they and MRS achieve their maximum (see Appendix). Thus the ranges of useful risk-thresholds, as determined by Youden’s index, AUC, MRS/NBI and Net Benefit, will all overlap at disease prevalence.
Note that Net Benefit Gain is nearly zero for risk thresholds 0.5%-1%, where NBI remains near its peak. MRS/NBI and Net Benefit answer different questions and thus present different perspectives. At the 0.78% risk-threshold, MRS/NBI are at their peak and the BRCAPRO risk-model is optimally informative, but Net Benefit implies that the model should not be used and all Ashkenazi-Jews should undergo BRCA1/2 testing. MRS/NBI emphasize that the model is optimally informative at 0.78%, identifying 80% of BRCA1/2 mutation-carriers while testing only 44% of Ashkenazi-Jews. In contrast, Net Benefit notes that a 0.78% threshold implies that a rational person trades-off 127 false-positives for 1 true-positive. False-positives are unimportant, and one should not use the model (even though it is optimally informative) but rather refer all Ashkenazi-Jews for BRCA1/2 testing.
In summary, Net Benefit emphasizes that some risk thresholds are so low that false-positives are unimportant and thus everyone should be tested. MRS/NBI emphasize that the risk model is optimally informative, refers only a minority for BRCA1/2 testing yet identifies the big majority of mutation-carriers. We find both perspectives illuminating.
5.3 Comparing MRS/NBI to other risk prediction metrics: Importance of prevalence
BRCA1/2 mutations induce the same cancer risk for Ashkenazi-Jews and the general-population (Kuchenbaecker et al., 2017). Only mutation prevalence varies between populations. This situation allows us to isolate the effect of prevalence on risk-prediction metrics.
Because we do not have comparable data on BRCA1/2 mutations in the general-population, we will approximate MRS/NBI for the general-population by combining the general-population mutation-prevalence with sensitivity/specificity from the WAS (see section 4.1). Because pathogenic BRCA1/2
mutations have the same risk for cancer regardless of population, the Bayes Factor used by BRCAPRO is the same regardless of population; only the prior distribution (i.e. mutation prevalence) differs by population(Katki, 2006). Thus we use BRCAPRO to calculate mutation carrier-risks on WAS data, but substitute the mutation prevalence in the general-population (0.26%) as the prior. To calculate specificity and prevalence, we weight non-carriers by a factor of 9 to ensure that mutation-prevalence is 10-times smaller than in the WAS. Figure 4 shows that Ashkenazi-Jews and the general population indeed have similar ROC curves and Lorenz curves. This reinforces that ROC, AUC, and Lorenz curves cannot distinguish between populations when only disease prevalence differs between them.
Figure 5 examines MRS and Net Benefit among Ashkenazi-Jews and in the general population. Because NBI and MRS/2 in the general population are almost identical for risk thresholds below 10% (not shown), we focus on MRS. For the general population, the MRS/NBI sweetspot (left panel) of 0.07%-0.56% yields MRS of only around 0.20%. In this sweetspot, the AUC=0.69, the same as for Ashkenazi-Jews in their sweetspot of 0.78%-5%. However, the MRSs are very different because prevalences are very different. MRS and NBI reveal the very different risk-stratification, and decision-theoretic, implications of equal AUC in populations with different prevalences.
Both MRS/NBI and Net Benefit (Fig. 5, right panel) concur that BRCAPRO is generally more useful for Ashkenazi-Jews than the general-population. However, MRS notes that, if the risk threshold were below 0.12%, then BRCAPRO is actually more informative in the general population. Below 0.12% there are hardly any mutation-carriers among Ashkenazi-Jews, but 33% of mutation-carriers remain in the general population.
According to Net Benefit, the BRCAPRO model in the general-population is best between risk thresholds of 0.12% to 2.5%. In this range, AUC drops from 0.69 to 0.61, and MRS is nearly halved from 0.20% to 0.11%. Given that mutation-prevalence is 0.26%, reductions in MRS much below 0.20% seem like substantial losses of informativeness.
At certain low risk thresholds, such as 0.26%, Net Benefit suggests mutation-testing for all Ashkenazi-Jews but using the BRCAPRO model to select general-population women for mutation testing. MRS/NBI can be seen to agree with Net Benefit. For Ashkenazi-Jews, the model has rather low MRS at 0.26% compared to other risk thresholds, suggesting that the BRCAPRO model is not useful for Ashkenazi-Jews at a 0.26% threshold. For the general-population, the model is optimally informative at 0.26%, although both the MRS and Net Benefit are low.
Dismayingly, at the current 10% risk-threshold in the general-population, Net Benefit finds that no one should undergo BRCA1/2 testing (i.e. the dashed line is below zero at 10%). In contrast, MRS/NBI note only that the risk-model is uninformative at the 10% threshold (MRS=0.05%), and makes no judgment about the value of BRCA1/2 testing.
To better understand the diagnostic performance of markers and risk-models at various risk thresholds, we introduced two new broadly applicable, linked metrics: Mean Risk Stratification (MRS) and Net Benefit of Information (NBI). We presented 4 key results. First, Youden’s index and AUC reflect on (1) the fraction of the maximum possible utility gain over random selection (i.e. NBI) that is attained by the test and (2) the fraction of the maximum possible risk-stratification (i.e. MRS) that is attained by the test. MRS and NBI provide Youden’s index and AUC with decision-theoretic and risk-stratification rationale, the lacks of which have long been a criticism of Youden’s index and AUC. Second, for rare diseases, high AUC does not imply high risk-stratification or utility gain over random selection. AUC must be considered in light of disease prevalence, which is automatically done by MRS and NBI. Third, NBI is a function of only MRS and the risk-threshold for action, connecting decision-theory to risk-stratification and providing a decision-theoretic rationale for MRS. Last, MRS and NBI provide a range of risk thresholds for which the risk model is ”optimally informative”: risk-thresholds that maximize both risk-stratification and the utility gain over random selection.
We proposed an eclectic approach, using AUC, MRS/NBI, and Net Benefit, to evaluate risk-thresholds for the BRCAPRO risk model to refer women for BRCA1/2 mutation-testing. Although the AUC is essentially the same for both the general population and for Ashkenazi-Jews, both MRS/NBI and Net Benefit note that mutation testing is more valuable for Ashkenazi-Jews, for whom mutations are 10-times more prevalent. Interestingly, MRS/NBI notes that at extremely low risk thresholds, BRCAPRO is actually more informative in the general population, because these risk-thresholds are so low that nearly all Ashkenazi-Jews would be tested anyway.
MRS/NBI and Net Benefit address complementary questions: MRS/NBI quantify the utility of the information in a test, while Net Benefit quantifies the utility versus all-or-nothing actions. In the BRCA1/2 example, both perspectives were valuable. Net Benefit emphasizes that risk thresholds below 1.7% are so low that false-positives are unimportant and thus all Ashkenazi-Jews should be tested; instead, risk-thresholds up to 30% could be considered for using the BRCAPRO model. MRS/NBI emphasize that for risk thresholds in 0.78% to 5%, the BRCAPRO model is optimally informative, referring only a minority of Ashkenazi-Jews for BRCA1/2 testing yet identifying the big majority of mutation-carriers. At these risk thresholds, the MRS=1.7%, meaning that a woman who uses BRCAPRO will learn that her risk of carrying a mutation will increase or decrease by 1.7% on average. An average change in risk of 1.7% seems meaningful because it is similar to pretest risk (i.e. mutation prevalence) of 2.3%. MRS/NBI note a substantial loss of BRCAPRO model informativeness at risk thresholds above 5%, and especially 30%. The ranges of useful risk-thresholds, as determined by Youden’s index, AUC, MRS, NBI, and Net Benefit, will always overlap at risk threshold equaling disease prevalence.
Dismayingly, at the current 10% risk-threshold in the general-population, Net Benefit finds that no one should undergo BRCA1/2 testing. According to Net Benefit, BRCA1/2 testing is never worthwhile at a 10% threshold, because it implies that one is willing to trade-off only 9 false-positives for 1 true-positive. However, current BRCA1/2 testing at 10% is widely agreed to be a major success (Rizk, 2017) and is recommended in the US (Moyer, 2014) and by the UK NICE (NICE, 2017). Stopping all BRCA1/2 testing would be considered absurd. This example suggests limits for applying rational risk-threshold theory to develop medical guidelines. In contrast, MRS/NBI make no judgment about the value of the actual intervention (i.e. genetic testing), only noting that the BRCAPRO model at the 10% threshold is rather uninformative.
However, we caution that neither MRS/NBI nor Net Benefit can determine the risk threshold for action. Prespecified utilities determine risk-thresholds. If a threshold is not optimally informative, that fact should be noted but does not disqualify the threshold.
MRS/NBI reinforce that disease prevalence and test-positivity are crucial for evaluating risk-stratification and interpreting AUC. An AUC=0.6 achieves only 20% of maximum risk-stratification/NBI and AUC=0.95 is required to achieve 90%. Furthermore, there is little risk-stratification or NBI possible for rare diseases or for rarely positive tests. MRS/NBI readily interpret AUC in light of disease prevalence, making it immediately clear why BRCA1/2 testing is less valuable in the general population, where mutations are rare, but might be more valuable for Ashkenazi-Jews, where mutations are more common. Although experts are aware of the importance of disease prevalence (Baker et al., 2009, Sec. 10), or can fix the ROC curve to account for prevalence (Hilden, 1991) (see Webappendix), MRS is a simple metric, intuitive to scientists, that automatically does the job. MRS/NBI could be routinely calculated to better interpret AUC.
Although we introduced both MRS and NBI, we prefer to focus on MRS when possible. MRS has a concrete risk-stratification interpretation, which we find to be more appealing to scientists than the abstract ”utility” interpretation of decision-theoretic metrics. MRS measures association via its simple formula as twice the cross-product difference of joint probabilities in a 2x2 table, which is easy to grasp in analogy with odds ratios. At low risk thresholds, as would be reasonable for rare conditions such as cancer or BRCA1/2 mutations, the NBI is close to MRS/2 and thus either NBI or MRS could be used. The main value of NBI is as a decision-theoretic rationale for using MRS in practice.
MRS/NBI are valuable to include in an eclectic approach to evaluating tests that includes AUC and Net Benefit. MRS/NBI reveals the risk-stratification meaning of AUC, and provides a complementary perspective to Net Benefit for considering risk thresholds for action. Much work remains to be done to make MRS/NBI usable in practice, especially for longitudinal markers and models necessary for disease screening (Sweeting and Thompson, 2012). Our MRS webtool is available.
This research was supported by the Intramural Research Program of the NIH/NCI. We thank Mark Schiffman and Anil Chaturvedi for their long-standing support and discussions. We thank Ionut Bebu and Holly Janes for valuable comments on prior drafts. We are indebted to our late mentor, collaborator, and friend Sholom Wacholder for his support. We thank Christine Fermo and Sue Pan for helping develop the MRS Webtool.
MRS is maximized when dichotomizing at disease prevalence
Equation 3 notes that , where is Youden’s index calculated at cutpoint . MRS is maximized as a function of cutpoint when Youden’s index is maximized, which occurs when dichotomizing at disease prevalence: . To prove this we differentiate Youden’s index as a function of
with respect to the cutpoint using the Leibnitz Integral Rule:
Setting the derivative equal to zero, and using Bayes’ rule:
Thus dichotomizing marker/model at at disease prevalence maximizes Youden’s index and thus MRS. Thus the ”sweetspot” of risk-thresholds that maximize MRS will always include disease prevalence. At this cutpoint, MRS equals Total Gain (Bura and Gastwirth, 2001) (see Webappendix).
MRS measures association: MRS is twice the covariance of disease and marker
Recall that and . Rewriting MRS equation (1):
MRS is simply twice the covariance of and . MRS is zero if and only if disease and marker are independent. Negative MRS means that is inversely associated with disease. When test positive/negative are interchanged, MRS changes sign.
Other association measures, such as Pearson’s correlation, the Phi coefficient, Yule’s Q, and Cohen’s Kappa, use MRS as a numerator but standardize it with different denominators. MRS is the numerator of the Mantel-Haenszel and Cochran’s tests (Lachin, 2000, Ch 2.6).
The Webappendix demonstrates that MRS also equals the departure of any of the 4 joint probabilities of and from the product of their margins:
MRS is twice the cross-product difference of joint probabilities inside 2x2 tables
Denote . Substituting into MRS equation (5) (see above) yields
MRS is simply twice the cross-product difference of the joint probabilities in the interior of the 2x2 table. The cross-product difference is also the determinant of the 2x2 table as a matrix. In contrast, the odds ratio (OR) is the cross-product ratio. Being a ratio, the OR is dimensionless, while the MRS is on the scale of risk differences. MRS as a cross-product difference is easy for scientists to remember.
Additional materials can be found in the file of WebAppendix at the end.
- Antoniou et al. (2004) Antoniou, A., P. P. D. Pharoah, P. Smith, and D. Easton (2004, Oct). The BOADICEA model of genetic susceptibility to breast and ovarian cancer. Br J Cancer 91(8), 1580–90.
- Baker et al. (2009) Baker, S. G., N. R. Cook, A. Vickers, and B. S. Kramer (2009, October). Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society. Series A, (Statistics in Society) 172, 729–748.
- Best et al. (2017) Best, A. F., M. A. Tucker, M. N. Frone, M. H. Greene, J. A. Peters, and H. A. Katki (2017). To test or not to test: Selection criteria for population-based BRCA1/2 mutation screening. Submitted.
Bura, E. and J. L. Gastwirth (2001).
The binary regression quantile plot: Assessing the importance of predictors in binary regression visually.Biometrical Journal 43(1), 5–21.
- Cantor and Kattan (2000) Cantor, S. B. and M. W. Kattan (2000). Determining the area under the ROC curve for a binary diagnostic test. Med Decis Making 20(4), 468–470.
- Cook (2007) Cook, N. R. (2007, Feb). Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 115(7), 928–935.
Copas, J. (1999).
The effectiveness of risk scores: The logit rank plot.Journal of the Royal Statistical Society. Series C (Applied Statistics) 48(2), 165–183.
- Gail and Pfeiffer (2005) Gail, M. H. and R. M. Pfeiffer (2005, April). On criteria for evaluating models of absolute risk. Biostatistics (Oxford, England) 6, 227–239.
- Greenhouse et al. (1950) Greenhouse, S. W., J. Cornfield, and F. Homburger (1950, Nov). The Youden index: letters to the editor. Cancer 3(6), 1097–1101.
- Hanley and McNeil (1982) Hanley, J. A. and B. J. McNeil (1982, April). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36.
- Hilden (1991) Hilden, J. (1991). The area under the ROC curve and its competitors. Medical Decis Making 11, 95–101.
- Huang et al. (2007) Huang, Y., M. Sullivan Pepe, and Z. Feng (2007). Evaluating the predictiveness of a continuous marker. Biometrics 63(4), 1181–1188.
- Katki (2006) Katki, H. A. (2006, June). Effect of misreported family history on Mendelian mutation prediction models. Biometrics 62(2), 478–487.
- King et al. (2014) King, M.-C., E. Levy-Lahad, and A. Lahad (2014, September). Population-based screening for BRCA1 and BRCA2: 2014 Lasker Award. JAMA 312, 1091–1092.
- Kraemer (1992) Kraemer, H. C. (1992). Evaluating Medical Tests: Objective and Quantitative Guidelines. Newbury Park, CA: Sage Publications Inc.
- Kraemer (2004) Kraemer, H. C. (2004, Jan). Reconsidering the odds ratio as a measure of 2x2 association in a population. Stat Med 23(2), 257–270.
- Kuchenbaecker et al. (2017) Kuchenbaecker, K. B., J. L. Hopper, D. R. Barnes, et al. (2017, June). Risks of breast, ovarian, and contralateral breast cancer for BRCA1 and BRCA2 mutation carriers. JAMA 317, 2402–2416.
- Lachin (2000) Lachin, J. M. (2000). Biostatistical Methods: The Assessment of Relative Risks. New York: Wiley-Interscience.
- Moyer (2014) Moyer, V. A. (2014, February). Risk assessment, genetic counseling, and genetic testing for BRCA-related cancer in women: U.S. Preventive Services Task Force recommendation statement. Ann Int Med 160, 271–281.
- NICE (2017) NICE (2017). Familial breast cancer: classification, care and managing breast cancer and related risks in people with a family history of breast cancer, recommendation 1.5.11. Technical report, National Institute for Health and Care Excellence Clinical Guidance https://www.nice.org.uk/guidance/cg164/chapter/Recommendations#genetic-testing.
- Parmigiani et al. (1998) Parmigiani, G., D. Berry, and O. Aguilar (1998, Jan). Determining carrier probabilities for breast cancer-susceptibility genes BRCA1 and BRCA2. Am J Hum Genet 62(1), 145–158.
- Pauker and Kassirer (1980) Pauker, S. G. and J. P. Kassirer (1980, May). The threshold approach to clinical decision making. N Engl J Med 302(20), 1109–1117.
- Pencina et al. (2008) Pencina, M. J., R. B. D’Agostino, R. B. D’Agostino, and R. S. Vasan (2008, Jan). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 27(2), 157–72; discussion 207–12.
- Pepe et al. (2004) Pepe, M. S., H. Janes, G. Longton, W. Leisenring, and P. Newcomb (2004, May). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 159(9), 882–890.
- Rizk (2017) Rizk, C. (2017). Researchers debate merits of population-wide genetic testing at AACR. Technical report, GenomeWeb https://www.genomeweb.com/genetic-research/researchers-debate-merits-population-wide-genetic-testing-aacr.
- Staff Reporter (2017) Staff Reporter (2017). Veritas Genetics to provide BRCA testing for Canadian hereditary cancer screening effort. Technical report, GenomeWeb https://www.genomeweb.com/clinical-sequencing/veritas-genetics-provide-brca-testing-canadian-hereditary-cancer-screening.
- Struewing et al. (1997) Struewing, J. P., P. Hartge, S. Wacholder, S. M. Baker, M. Berlin, M. McAdams, M. M. Timmerman, L. C. Brody, and M. A. Tucker (1997). The risk of cancer associated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. N. Engl. J. Med. 336, 1401–1408.
- Sweeting and Thompson (2012) Sweeting, M. J. and S. G. Thompson (2012, Apr). Making predictions from complex longitudinal data, with application to planning monitoring intervals in a national screening programme. J R Stat Soc Ser A Stat Soc 175(2), 569–586.
- Vickers and Elkin (2006) Vickers, A. J. and E. B. Elkin (2006). Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 26, 565–574.
- Wentzensen and Wacholder (2013) Wentzensen, N. and S. Wacholder (2013, Feb). From differences in means between cases and controls to risk stratification: a business plan for biomarker development. Cancer Discov 3(2), 148–157.
- Youden (1950) Youden, W. J. (1950, Jan). Index for rating diagnostic tests. Cancer 3(1), 32–35.
Appendix A Total Gain is a special case of Mean Risk Stratification
Total Gain (TG) measures the explanatory power of a continuous covariate in a binary regression model , where
is typically a logistic regression(Bura and Gastwirth, 2001). Denote overall disease prevalence as . Choose cutpoint such that is cut at disease prevalence: . Then
Thus TG is for continuous cut at disease prevalence. MRS is valid for discrete/continuous and allows any cutpoint of . Unlike MRS, TG is non-negative.
Appendix B MRS equals twice the deviation of any joint probability from the product of its marginals
This section proves that MRS can be written as twice the deviation from any of the 4 joint probabilities inside a 2x2 tables from the product of its corresponding marginals. Recall that and . Thus
Denote and . First, substituting yields MRS as a deviation from :
Second, substituting yields MRS as a deviation from :
Third, substituting yields MRS as a deviation from :
Fourth and final, substituting yields MRS as a deviation from :
Appendix C Relationship of Net Benefit to Net Benefit of Information
Denote and . Net Benefit (NB), ignoring test costs, is
by substituting . Then
Now starting from NBI, and substituting MRS equation (7) from the main paper
Thus if , if , and at ( is Youden’s index).
Appendix D Variance of MRS, and hypothesis testing for two MRSs
|0.78% threshold||10% threshold||30% threshold|
Simulation of MRS, Youden’s index, their standard errors (se), and 95% confidence interval (CI) overage.
Asymptotic variances for MRS and Youden’s index follow from applying the delta method to the quadrinomial variance matrix from a 2x2 table with as the sample size and . Each variance is where is the usual gradient of the quantity. is the usual quadrinomial variance matrix of the cell probabilities, for total sample size :
Recall that MRS is twice the cross-product difference of joint probabilities inside a 2x2 table: (see main paper Appendix). The variance of MRS is based on the gradient . Calculating yields the variance
This variance requires only the sample proportions of the joint probabilities. It does not require fixed or a priori known test positivity or disease prevalence.
Table 1 examines the properties of the MRS variance and MRS confidence interval coverage by simulation. We did 1 million simulations for each of 3 quadrinomial 2x2 tables based on the 3 cutpoints we considered in the Washington Ashkenazi Study (WAS) : 0.78%, 10%, and 30%. The quadrinomials were based on sample size of 4589 (as in WAS), with expectations for the cell counts as:
In all cases, MRS and its variance were estimated with little bias, and 95% confidence intervals performed nominally (Table 1).
To ensure proper MRS confidence intervals, note that . This is easy to see based on the MRS expression , where is Youden’s index and is disease prevalence. The maximum/minimum MRS of occurs when and . Thus and a logit transformation of will ensure that confidence intervals are proper. Applying the delta method yields
This variance is used to calculate confidence intervals on the scale. Then, convert back to the MRS scale by applying to each endpoint the inverse function
d.1 Hypothesis testing if two MRSs are equal
In general, testing if two independent MRSs differ can be based on the difference of the two MRSs, whose variance would be . But if the MRSs are calculated within the same population, and hence same prevalence, the ratio of MRSs is a better statistic. This is because the nuisance parameter of prevalence cancels out, leaving the ratio of Youden’s indices :
We will calculate the variance of the log of the ratio of two independent Youden’s indices, based on a quadrinomial likelihood for 2x2 tables.
Recall that Youden’s index can be written in terms of the joint probabilities in a 2x2 table:
Then, to first calculate the variance of a single Youden’s index, the gradient is
The variance of a single Youden’s index is . Table 1 shows that the variance for Youden’s index is unbiased and 95% confidence intervals performed nominally.
For two independent Youden’s indices, the variance of the log of their ratio is asymptotically
and asymptotically .
The p-values comparing MRSs in sections 5.1 and 5.2 of the main paper are based on the ratio of Youden’s indices. For Ashkenazi-Jews, comparing the MRSs at a 0.78% threshold vs. 10%, the p-value based on the difference of MRSs is 0.0703, but that based on ratio of Youden’s indices is 0.0392 (as reported in the main paper section 5.1). For comparing the 0.78% threshold vs. 30%, the p-value based on the difference of MRSs is 0.0035, but that based on ratio of Youden’s indices is 0.00054 (as reported in the main paper section 5.2). In each situation, the smaller p-values by using the ratio of Youden’s indices reflects the gain in statistical power by removing the nuisance parameter, disease prevalence .
Appendix E Figure 1: Example cancer family history and family tree
Appendix F Fixing the ROC curve to account for disease prevalence: The frequency-scaled ROC curve and its relationship of AUC and MRS
The ROC plots vs. . The frequency-scaled ROC (fROC) plots vs. (Hilden, 1991). Unlike the square ROC, fROC accounts for prevalence and is a by rectangle. The diagonal is the uninformative fROC curve, the area under which is . The area under the fROC curve for a test, which is a single point with lines extending to the bottom-left and top-right corners, can be shown to be . The ratio of the area under the fROC to the chance area for random selection, equals the AUC:
The difference between the the area under the fROC to the chance area equals :
A high ratio (AUC) might conceal a small difference (MRS), which is apt to be the case for uncommon diseases.
- Bura and Gastwirth (2001) Bura, E. and J. L. Gastwirth (2001). The binary regression quantile plot: Assessing the importance of predictors in binary regression visually. Biometrical Journal 43(1), 5–21.
- Hilden (1991) Hilden, J. (1991). The area under the ROC curve and its competitors. Medical Decis Making 11, 95–101.