Removing the influence of a group variable in high-dimensional predictive modelling

10/18/2018
by   Emanuele Aliverti, et al.
0

Predictive modelling relies on the assumption that observations used for training are representative of the data that will be encountered in future samples. In a variety of applications, this assumption is severely violated, since observational training data are often collected under sampling processes which are systematically biased with respect to group membership. Without explicit adjustment, machine learning algorithms can produce predictions that have poor generalization error with performance that varies widely by group. We propose a method to pre-process the training data, producing an adjusted dataset that is independent of the group variable with minimum information loss. We develop a conceptually simple approach for creating such a set of features in high dimensional settings based on a constrained form of principal components analysis. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be independent of the group variable. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two real examples: removing machine-specific correlations from brain scan data, and removing race and ethnicity information from a dataset used to predict recidivism.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2016

A statistical framework for fair predictive algorithms

Predictive modeling is increasingly being employed to assist human decis...
research
02/06/2023

Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts

Training machine learning models robust to distribution shifts is critic...
research
01/10/2022

Towards Group Robustness in the presence of Partial Group Labels

Learning invariant representations is an important requirement when trai...
research
08/26/2021

Machine Unlearning of Features and Labels

Removing information from a machine learning model is a non-trivial task...
research
05/19/2022

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

The great success of deep learning heavily relies on increasingly larger...
research
09/11/2020

DART: Data Addition and Removal Trees

How can we update data for a machine learning model after it has already...
research
12/02/2021

Learning Optimal Predictive Checklists

Checklists are simple decision aids that are often used to promote safet...

Please sign up or login with your details

Forgot password? Click here to reset