Connecting population-level AUC and latent scale-invariant R^2 via Semiparametric Gaussian Copula and rank correlations

10/31/2019
by   Debangan Dey, et al.
0

Area Under the Curve (AUC) is arguably the most popular measure of classification accuracy. We use a semiparametric framework to introduce a latent scale-invariant R^2, a novel measure of variation explained for an observed binary outcome and an observed continuous predictor, and then directly link the latent R^2 to AUC. This enables a mutually consistent simultaneous use of AUC as a measure of classification accuracy and the latent R^2 as a scale-invariant measure of explained variation. Specifically, we employ Semiparametric Gaussian Copula (SGC) to model a joint dependence between observed binary outcome and observed continuous predictor via the correlation of latent standard normal random variables. Under SGC, we show how, both population-level AUC and latent scale-invariant R^2, defined as a squared latent correlation, can be estimated using any of the four rank statistics calculated on binary-continuous pairs: Wilcoxon rank-sum, Kendall's Tau, Spearman's Rho, and Quadrant rank correlations. We then focus on three implications and applications: i) we explicitly show that under SGC, the population-level AUC and the population-level latent R^2 are related via a monotone function that depends on the population-level prevalence rate, ii) we propose Quadrant rank correlation as a robust semiparametric version of AUC; iii) we demonstrate how, under complex-survey designs, Wilcoxon rank sum statistics and Spearman and Quadrant rank correlations provide asymptotically consistent estimators of the population-level AUC using only single-participant survey weights. We illustrate these applications using binary outcome of five-year mortality and continuous predictors including Albumin, Systolic Blood Pressure, and accelerometry-derived measures of total volume of physical activity collected in 2003-2006 National Health and Nutrition Examination Survey (NHANES) cohorts.

READ FULL TEXT

page 10

page 21

research
05/13/2022

Semiparametric Gaussian Copula Regression modeling for Mixed Data Types (SGCRM)

Many clinical and epidemiological studies encode collected participant-l...
research
03/12/2019

ROC and AUC with a Binary Predictor: a Potentially Misleading Metric

In analysis of binary outcomes, the receiver operator characteristic (RO...
research
12/19/2017

Bayesian Latent-Normal Inference for the Rank Sum Test, the Signed Rank Test, and Spearman's ρ

Bayesian inference for rank-order problems is frustrated by the absence ...
research
08/15/2021

On boosting the power of Chatterjee's rank correlation

Chatterjee (2021)'s ingenious approach to estimating a measure of depend...
research
05/21/2021

Computational Efficient Approximations of the Concordance Probability in a Big Data Setting

Performance measurement is an essential task once a statistical model is...
research
01/07/2021

DICE: Deep Significance Clustering for Outcome-Aware Stratification

We present deep significance clustering (DICE), a framework for jointly ...

Please sign up or login with your details

Forgot password? Click here to reset