Predicting with Proxies

12/28/2018
by   Hamsa Bastani, et al.
10

Predictive analytics is increasingly used to guide decision-making in many applications. However, in practice, we often have limited data on the true predictive task of interest, but copious data on a closely-related proxy predictive task. Practitioners often train predictive models on proxies since it achieves more accurate predictions. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. However, not accounting for the bias in the proxy can lead to sub-optimal decisions. Using real datasets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics commonly used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features d). Our proof relies on a new tail inequality on the convergence of LASSO for approximately sparse vectors. Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare datasets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.

READ FULL TEXT

page 25

page 27

page 29

research
05/22/2023

Risk Scores, Label Bias, and Everything but the Kitchen Sink

In designing risk assessment algorithms, many scholars promote a "kitche...
research
11/04/2019

Understanding racial bias in health using the Medical Expenditure Panel Survey data

Over the years, several studies have demonstrated that there exist signi...
research
11/14/2018

Predictive Modeling with Delayed Information: a Case Study in E-commerce Transaction Fraud Control

In Business Intelligence, accurate predictive modeling is the key for pr...
research
06/27/2012

Smoothness and Structure Learning by Proxy

As data sets grow in size, the ability of learning methods to find struc...
research
07/10/2018

A Cautionary Tail: A Framework and Casey Study for Testing Predictive Model Validity

Data scientists frequently train predictive models on administrative dat...
research
10/17/2022

Efficient surrogate-assisted inference for patient-reported outcome measures with complex missing mechanism

Patient-reported outcome (PRO) measures are increasingly collected as a ...
research
01/18/2017

Surrogate Aided Unsupervised Recovery of Sparse Signals in Single Index Models for Binary Outcomes

We consider the recovery of regression coefficients, denoted by β_0, for...

Please sign up or login with your details

Forgot password? Click here to reset