Representative Fair Synthetic Data

04/07/2021
by   Paul Tiwald, et al.
0

Algorithms learn rules and associations based on the training data that they are exposed to. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. Synthetic data, on the other hand, emerges with the promise to provide an unlimited amount of representative, realistic training samples, that can be shared further without disclosing the privacy of individual subjects. We present a framework to incorporate fairness constraints into the self-supervised learning process, that allows to then simulate an unlimited amount of representative as well as fair synthetic data. This framework provides a handle to govern and control for privacy as well as for bias within AI at its very source: the training data. We demonstrate the proposed approach by amending an existing generative model architecture and generating a representative as well as fair version of the UCI Adult census data set. While the relationships between attributes are faithfully retained, the gender and racial biases inherent in the original data are controlled for. This is further validated by comparing propensity scores of downstream predictive models that are trained on the original data versus the fair synthetic data. We consider representative fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.

READ FULL TEXT
research
10/25/2021

DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks

Machine learning models have been criticized for reflecting unfair biase...
research
09/12/2022

Rule-adhering synthetic data – the lingua franca of learning

AI-generated synthetic data allows to distill the general patterns of ex...
research
12/20/2022

PreFair: Privately Generating Justifiably Fair Synthetic Data

When a database is protected by Differential Privacy (DP), its usability...
research
06/30/2023

FFPDG: Fast, Fair and Private Data Generation

Generative modeling has been used frequently in synthetic data generatio...
research
10/24/2022

FairGen: Fair Synthetic Data Generation

With the rising adoption of Machine Learning across the domains like ban...
research
01/11/2017

Looking Beyond Appearances: Synthetic Training Data for Deep CNNs in Re-identification

Re-identification is generally carried out by encoding the appearance of...
research
08/16/2023

Fair GANs through model rebalancing with synthetic data

Deep generative models require large amounts of training data. This ofte...

Please sign up or login with your details

Forgot password? Click here to reset