Fair Generative Modeling via Weak Supervision

10/26/2019 ∙ by Aditya Grover, et al. ∙ 25

Real-world datasets are often biased with respect to key demographic factors such as race and gender. Due to the latent nature of the underlying factors, detecting and mitigating bias is especially challenging for unsupervised machine learning. We present a weakly supervised algorithm for overcoming dataset bias for deep generative models. Our approach requires access to an additional small, unlabeled but unbiased dataset as the supervision signal, thus sidestepping the need for explicit labels on the underlying bias factors. Using this supplementary dataset, we detect the bias in existing datasets via a density ratio technique and learn generative models which efficiently achieve the twin goals of: 1) data efficiency by using training examples from both biased and unbiased datasets for learning, 2) unbiased data generation at test time. Empirically, we demonstrate the efficacy of our approach which reduces bias w.r.t. latent factors by 57.1 image generation using generative adversarial networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 16

page 18

page 19

page 20

Code Repositories

fairgen

Fair Generative Modeling via Weak Supervision


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Increasingly, many applications of machine learning (ML) involve data generation. Examples of such production level systems include Wavenet for text-to-speech synthesis [1], generative adversarial networks (GAN) for dental restoration [2], and a large number of creative applications such Coconet used for designing the “first AI-powered Google Doodle” [3]. As these generative applications become more prevalent, it becomes increasingly important to consider questions with regards to the potential discriminatory nature of such systems and ways to mitigate it [4].

A variety of socio-technical factors contribute to the discriminatory nature of ML systems [5]. A major factor is the existence of biases in the training data itself [6, 7]. Since data is the fuel of ML, any existing bias in the dataset can be propagated to the learned model [8]

. This is a particularly pressing concern for generative models which can easily amplify the bias by generating more of the biased data at test-time. Further, learning a generative model is fundamentally an unsupervised learning problem and hence, the bias factors of interest are typically latent. For example, while learning a generative model of human faces, we often do not have access to attributes such as gender, race, and age. Any existing bias in the dataset with respect to these attributes are easily picked by deep generative models. See Figure 

1 for an illustration.

Figure 1:

Samples from a baseline WGAN that reflect the gender bias underlying the true data distribution in CelebA. All faces above the orange line (68%) are classified as female, while the rest are labeled as male (32%).

In this work, we present a weakly-supervised approach to learning fair generative models in the presence of dataset bias. Our source of weak supervision is motivated by the observation that obtaining relatively small unbiased datasets is feasible in practice e.g., survey data collected by organizations such as World Bank typically follow several good dataset collection practices [9] to ensure representativeness before release. However, collecting such unbiased datasets can be expensive and unscalable. In contrast, obtaining large unlabelled (but biased) datasets is relatively cheap for many domains in the big data era. As another example, biotech firms invest monetary and infrastructure resources to gather representative data from different demographics for personal genomics, but admit limitations in scaling as the majority of their existing data belongs to a highly biased sample of the world population [10, 11]. Note that neither of our datasets need to be labeled w.r.t. the latent bias attributes and the size of the unbiased dataset can be much smaller than the biased dataset. Hence, the unbiased supervision we require is weak.

Using an unbiased dataset to augment a biased dataset, our goal is to learn a generative model that best approximates the desired, unbiased data distribution. Simply using the unbiased dataset alone for learning is an option, but this may not suffice since this dataset can be too small to learn an expressive model that accurately captures the underlying unbiased data distribution. Our approach to learning a fair

generative model that is robust to biases in the larger training set is based on importance reweighting. In particular, we learn a generative model which reweighs the data points in the biased dataset based on the ratio of densities assigned by the biased data distribution as compared to the target unbiased data distribution. Since we do not have access to explicit densities assigned by either of the two distributions, we estimate the weights by using a probabilistic classifier 

[12, 13].

We test our weakly-supervised approach on learning generative adversarial networks on the CelebA dataset [14]. The dataset consists of attributes such as gender and hair color, which we use for designing biased and unbiased data splits and subsequent evaluation. Our empirical results successfully demonstrate how the reweighting approach can offset dataset bias on a wide range of settings. In particular, we obtain improvements of 74.9% and 22.6% on average over baselines in reducing the bias with respect to the latent factors in the setting of single-attribute and multi-attribute dataset bias respectively for comparable sample quality.

2 Problem Setup

2.1 Background

We assume there exists a true (unknown) data distribution over a set of observed variables . In generative modeling, our goal is to learn the parameters of a distribution over the observed variables , such that the model distribution is close to . Depending on the choice of learning algorithm, different approaches have been previously considered. Broadly, these include adversarial training (e.g., GANs [15]

) and maximum likelihood estimation (MLE) (e.g., variational autoencoders 

[16, 17] and normalizing flows [18]). Our bias mitigation framework is agnostic to the above training approaches. For generality, we consider expectation based learning loss objectives:

(1)

where is a suitable per-example loss that depends on both examples drawn from a dataset and the model parameters . The above expression encompasses a broad class of MLE and adversarial objectives. For example, if denotes the negative log-likelihood assigned to the point as per , then we recover the MLE training objective.

2.2 Dataset Bias

The standard assumption for learning a generative model is that we have access to a sufficiently large dataset of training examples, where each is assumed to be sampled independently from a target unbiased distribution . In practice however, collecting large datasets that are i.i.d. w.r.t. is difficult due to a variety of socio-technical factors. Even more, the sample complexity for learning high dimensional distributions can be doubly-exponential in the dimensions in many cases [19], surpassing the size of the largest available datasets.

We can partially offset this difficulty by considering data from alternate sources related to the target distribution, e.g., images scraped from the Internet. However, these additional datapoints are not expected to be i.i.d. w.r.t. . We characterize this phenomena as dataset bias, where we assume the availability of a dataset , such that the examples are sampled independently from a biased (unknown) distribution that is different from , but shares the same support.

2.3 Evaluation

Evaluating generative models and fairness in machine learning are both open areas of research in itself. Our work is at the intersection of these two fields and we propose the following metrics for measuring bias mitigation for data generation.

  1. [leftmargin=0cm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left]

  2. Sample Quality: We employ sample quality metrics e.g., Frechet Inception Distance (FID) [20], Kernel Inception Distance (KID) [21], etc. These metrics match empirical expectations w.r.t. a reference data distribution and a model distribution in a predefined feature space e.g., the prefinal layer of activations of Inception Network [22]. Lower the score, better is the learned model’s ability to approximate . For the fairness context in particular, we are interested in measuring the discrepancy w.r.t. even if the model has been trained to use both and .

  3. Fairness: Alternatively, we can evaluate bias of generative models specifically in the context of some sensitive latent variables, say . For example, may correspond to the age and gender of an individual depicted via an image . We emphasize that such attributes are unknown during training, and used only for evaluation at test time.

    If we have access to a highly accurate predictor for the distribution of the sensitive attributes conditioned on the observed attributes , we can evaluate the extent of bias mitigation via the discrepancies in the expected marginal likelihoods of as per and . Formally, we define the fairness discrepancy for a generative model w.r.t. and sensitive attributes :

    (2)

    In practice, the expectations in Eq. (2) can be computed via Monte Carlo averaging. Again the lower is the discrepancy in the above two expectations, the better is the learned model’s ability to mitigate dataset bias w.r.t. the sensitive attributes .

3 Bias Mitigation

Recall that we assume a learning setting where we are given access to a data source in addition to a dataset of training examples . Our goal in this work is to capitalize on both data sources and for learning a model that best approximates the unbiased target distribution .

3.1 Baselines

We begin by discussing two baseline approaches at the extreme ends of the spectrum. First, one could completely ignore and consider learning based on alone. Since we only consider proper losses w.r.t. , global optimization of the objective in Eq. (1) in a well-specified model family will recover the true data distribution as . However, since is finite in practice, this is likely to give poor sample quality even though the fairness discrepancy would be low.

On the other extreme, we can consider learning based on the full dataset consisting of both and . This procedure will be data efficient and could lead to high sample quality, but it comes at the cost of fairness since the learned distribution will be heavily biased w.r.t. .

3.2 Solution 1: Conditional Modeling

Our first proposal is to learn a generative model conditioned on the identity of the dataset used during training. Formally, we learn a generative model where

is a binary random variable indicating whether the model distribution was learned to approximate the data distribution corresponding to

(i.e., ) or (i.e., ). By sharing model parameters across the two values of , we hope to leverage both data sources. At test time, conditioning on for should result in fair generations.

As we demonstrate in Section 4 however, this simple approach does not achieve the intended effect in practice. The likely cause is that the conditioning information is too weak for the model to infer the bias factors and effectively distinguish between the two distributions. Next, we present an alternate two-phased approach based on density ratio estimation which effectively overcomes the dataset bias in a data-efficient manner.

3.3 Solution 2: Importance Reweighting

Recall one of the trivial baselines in Section 3.1 which learns a generative model on the union of and . This method is problematic because it assigns equal weight to the loss contributions from each individual datapoint in our dataset in Eq. (1), regardless of whether the datapoint comes from or . For example, in situations where the dataset bias causes a minority group to be underrepresented, this objective will encourage the model to focus on the majority group such that the overall value of the loss is minimized on average with respect to a biased empirical distribution i.e., a weighted mixture of and with weights proportional to and .

Our key idea is to reweight the datapoints from during training such that the model learns to downweight over-represented data points from while simultaneously upweighting the under-represented points from . The challenge in the unsupervised context is that we do not have direct supervision on which points are over- or under-represented and by how much. To resolve this issue, we consider importance sampling [23]. Whenever we are given data from two distributions, w.l.o.g. say and , and wish to evaluate a sample average w.r.t. given samples from , we can do so by reweighting the samples from by the ratio of densities assigned to the sampled points by and . In our setting, the distributions of interest are and respectively. Hence, an importance weighted objective for learning from is:

(3)
(4)

where is defined to be the importance weight for .

Estimating density ratios via binary classification.

To estimate the importance weights, we use a binary classifier as described below [12].

Consider a binary classification problem with classes

with training data generated as follows. First, we fix a prior probability for

. Then, we repeatedly sample . If , we independently sample a datapoint , else we sample . Then, as shown in [24], the ratio of densities and assigned to an arbitrary point can be recovered via a Bayes optimal (probabilistic) classifier as:

(5)

where is the probability assigned by the classifier to the point belonging to class . Here, is the ratio of marginals of the labels for two classes. In practice, we do not have direct access to either or and hence, our training data consists of points sampled from the empirical data distributions defined uniformly over and . Further, we may not be able to learn a Bayes optimal classifier and denote the importance weights estimated by the learned classifier for a point as .

Input:

, Classifier and Generative Model Architectures & Hyperparameters


      Output: Generative Model Parameters

1: Phase 1: Estimate importance weights
2:Learn binary classifier for distinguishing vs.
3:Estimate importance weight for all (using Eq. (5))
4:Set importance weight for all
5: Phase 2: Minibatch gradient descent on based on weighted loss
6:Initialize model parameters at random
7:Set full dataset
8:while training do
9:     Sample a batch of points from at random
10:      Set minibatch loss
11:     Estimate gradients and update parameters based on optimizer update rule
12:end while
13:return
Algorithm 1 Learning Fair Generative Models

Our overall procedure is summarized in Algorithm 1

. We use deep neural networks for parameterizing the binary classifier and the generative model. Given a biased and unbiased dataset along with the network architectures and other standard hyperparameters (e.g., learning rate, optimizer etc.), we first learn a probabilistic binary classifier (Line 

2). The learned classifier can provide importance weights for the datapoints from via estimates of the density ratios (Line 3). For the datapoints from , we do not need to perform any reweighting and set the importance weights to 1 (Line 4). Using the combined dataset , we then learn the generative model where the minibatch loss for every gradient update weights the contributions from each datapoint (Lines 6-12).

For a practical implementation, it is best to account for some diagnostics and best practices while executing Algorithm 1

. For density ratio estimation, we test that the classifier is calibrated on a held out set. This is a necessary check (but insufficient) for the estimated density ratios to be meaningful. If the classifier is miscalibrated, we can apply standard recalibration techniques such as Platt scaling before estimating the importance weights. Furthermore, while optimizing the model using a weighted objective, there can be an increased variance across the loss contributions from each example in a minibatch due to importance weighting. We did not observe this in our experiments, but techniques such as normalization of weights within a batch can potentially help control the unintended variance introduced within a batch 

[12].

(a) single, bias=0.9
(b) single, bias=0.8
(c) multi
Figure 2: Distribution of importance weights for different latent subgroups. On average, The underrepresented subgroups are upweighted while the overrepresented subgroups are downweighted.

Theoretical Analysis. The performance of Algorithm 1 critically depends on the quality of estimated density ratios, which in turn is dictated by the training of the binary classifier. We define the expected negative cross-entropy (NCE) objective for a classifier as:

(6)

In the following result, we characterize the NCE loss for the Bayes optimal classifier.

Theorem 1.

Let

denote a set of unobserved bias variables. Suppose there exist two joint distributions

and over and . Let and denote the marginals over and for the joint and similar notation for the joint such that

(7)

and , have disjoint supports for . Then, the negative cross-entropy of the Bayes optimal classifier is given as:

(8)

where .

For example, as we shall see in our experiments in the following section, the inputs can correspond to face images, whereas the unobserved represents sensitive bias factors for a subgroup such as gender, race etc. The proportion of examples belonging a subgroup can differ across the biased and unbiased datasets with the relative proportions given by . Note that the above result only requires knowing these relative proportions and not the true for each . The practical implication is that under the assumptions of Theorem 1, we can check the quality of density ratios estimated by an arbitrary learned classifier by comparing its empirical NCE with the theoretical NCE of the Bayes optimal classifier in Eq. 1 (see Section 4.1).

4 Empirical Evaluation

In this sections, we are interested in investigating two broad questions empirically:

  1. How well can we estimate density ratios for the proposed weak supervision setting?

  2. How effective is the reweighting technique for learning fair generative models on the fairness discrepancy metric proposed in Section 2.3?

We further demonstrate the usefulness of our generated data in downstream applications such as data augmentation for learning a fair classifier in Supplement D.1.

Dataset. We consider the CelebA [14] dataset, which is commonly used for benchmarking deep generative models and comprises of images of faces with 40 labeled binary attributes. We use this attribute information to construct 3 different setting for partitioning the full dataset into and .

  • Setting 1 (single, bias=0.9): We set to be a single bias variable corresponding to “gender" with values 0 (female) and 1 (male) and .

    Specifically, this means that contains the same fraction of male and female images whereas contains 0.9 fraction of females and rest as males.

  • Setting 2 (single, bias=0.8): We use same bias variable (gender) as Setting 1 with .

  • Setting 3 (multi): We set as two bias variables corresponding to “gender" and “black hair". In total, we have 4 subgroups: females without black hair (00), males without black hair (01), females with black hair (10), and males with black hair (11). We set .

We emphasize that the attribute information is used only for designing controlled biased and unbiased datasets and faithful evaluation. Our algorithm does not explicitly require such labeled information.

(a) Samples generated via importance reweighting. Faces above orange line classified as female (49/100) while rest as male.
(b) Fairness Discrepancy
(c) FID
Figure 3: Single Attribute Dataset Bias Mitigation for bias=90

. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better.

Models. We train two classifiers for our experiments: (1) the attribute (e.g. gender) classifier which we use to assess the level of bias present in our final samples; and (2) the density ratio classifier. For both models, we use a variant of ResNet18 [25] on the standard train and validation splits of CelebA. For the generative model, we used a Progressive GAN [26] trained to minimize the Wassertein GAN [27, 28] objective with gradient penalty. Additional details regarding the architectural design and hyperparameters in Supplement B.

4.1 Density Ratio Estimation via Classifier

For each of the three experiments settings, we can evaluate the quality of the estimated density ratios by comparing empirical estimates of the cross-entropy loss of the density ratio classifier with the cross-entropy loss of the Bayes optimal classifier derived in Eq. 1. We show the results in Table 1 where we find that the two losses are very close, suggesting that we obtain high-quality density ratio estimates that we can use for subsequently training fair generative models. In Supplement C, we show a more fine-grained analysis of the 0-1 accuracies and calibration of the learned models.

Model Bayes optimal Empirical
single, bias=0.9 0.591 0.606
single, bias=0.8 0.604 0.652
multi 0.611 0.666
Table 1: Comparison between the cross-entropy loss of the Bayes classifier and learned density ratio classifier.

In Figure 2, we show the distribution of our importance weights for the various latent subgroups. We find that across all the considered settings, the underrepresented subgroups (e.g., males in Figure 2(a), 2(b), females with black hair in 2(c)) are upweighted on average (mean density ratio estimate > 1), while the overrepresented subgroups are downweighted on average (mean density ratio estimate < 1). Also, as expected, the density ratio estimates are closer to 1 when the bias is low (see Figure 2(a) v.s. 2(b)).

(a) Samples generated via importance reweighting. Faces above orange line classified as female (50/100) while rest as male.
(b) Fairness Discrepancy
(c) FID
Figure 4: Single Attribute Dataset Bias Mitigation for bias=80. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower discrepancy and FID is better.

4.2 Fair Data Generation

We compare our importance weighted approach against three baselines: (1) equi-weight: a WGAN-GP trained on the full dataset that weighs every point equally; (2) unbiased-only: a WGAN-GP trained on the unbiased dataset ; and (3) conditional: a conditional WGAN-GP where the conditioning label indicates whether a data point is from or . In all our experiments, the unbiased-only variant which only uses the unbiased dataset for learning however failed to give any recognizable samples. For a clean presentation of the results due to other methods, we hence ignore this baseline in the results below and defer the reader to the supplementary material for further results.

To evaluate performance under a variety of settings, we also vary the size of the balanced dataset relative to the unbalanced dataset size : perc = {0.1, 0.25, 0.5, 1.0}. Here, perc = 0.1 denotes = 10% of and perc = 1.0 denotes .

4.2.1 Single Attribute Splits

We train our gender classifier for evaluation on the entire CelebA training set, and achieve a level of 98% accuracy on the held-out set. For each experimental setting, we evaluate bias mitigation based on the fairness discrepancy metric (Eq. (2)) and also report sample quality based on FID [20].

For the bias = 90 split, we show the samples generated via imp-weight in Figure 3a and the resulting fairness discrepancies in Figure 3b. Our framework generates samples that are slightly lower quality than equi-weight baseline samples shown in Figure 1, but is able to produce almost identical proportion of samples across the two genders. Similar observations hold for bias = 80, as shown in Figure 4. While importance weighting outperforms all baselines on the fairness discrepancy metric as seen in Figure 4a, the baselines do better in a couple of cases for FID (perc in Figure 4b).

4.2.2 Multi-Attribute Split

We conduct a similar experiment with a multi-attribute split based on gender and the presence of black hair. The attribute classifier for the purpose of evaluation is now trained with a 4-way classification task instead of 2, and achieves an accuracy of roughly 88% on the test set. We refer the reader to Supplement D.2 for corresponding results and analysis.

5 Related Work

Fairness & generative modeling.

There is a rich body of work in fair ML, which focus on different notions of fairness (e.g. demographic parity, equality of odds and opportunity) and study methods by which models can perform tasks such as classification in a non-discriminatory way 

[5, 29, 30, 31]. Our focus is in the context of fair generative modeling. The vast majority of related work in this area is centered around fair and/or privacy preserving representation learning, which exploit tools from adversarial learning and information theory among others [32, 33, 34, 35, 36, 37]. A unifying principle among these methods is such that a discriminator is trained to perform poorly in predicting an outcome based on a protected attribute.  [38]

considers transfer learning of race and gender identities as a form of weak supervision for predicting other attributes on datasets of faces. While the end goal for the above works is classification, our focus is on data generation in the presence of dataset bias and we do not require explicit supervision for the protected attributes.

The most relevant prior works in the data generation setting are FairGAN [39] and FairnessGAN [40]. The goal of both methods is to generate fair datapoints and their corresponding labels as a preprocessing technique. This allows for learning a useful downstream classifier and obscures information about protected attributes. Again, these works are not directly comparable to ours as we do not assume explicit supervision regarding the protected attributes during training, and our goal is fair generation given unlabelled biased datasets where the bias factors are latent.

Reweighting. Reweighting datapoints is a common algorithmic technique for problems such as dataset bias and class imbalance [41]. It has often been used in the context of fair classification [42], for example, [43] details reweighting

as a way to remove discrimination without relabeling instances. For reinforcement learning,

[44] used an importance sampling approach for selecting fair policies. There is also a body of work on fair clustering [45, 46, 47, 48] which ensure that the clustering assignments are balanced with respect to some sensitive attribute.

Density ratio estimation using classifiers. The use of classifiers for estimating density ratios has a rich history of prior works across ML [12]. For deep generative modeling, density ratios estimated by classifiers have been used for expanding the class of various learning objectives [49, 13, 50]

, evaluation metrics based on two-sample tests 

[51, 52, 53, 54, 55, 56, 57], or improved Monte Carlo inference via these models [58, 59, 60, 61]. [58] use importance reweighting for alleviating model bias between and . Closest related is the proposal of [62] to use importance reweighting for learning generative models where training and test distributions differ, but explicit importance weights are provided for at least a subset of the training examples. We consider a more realistic, weakly-supervised setting where we estimate the importance weights using a small unbiased dataset.

6 Discussion & Future Work

We considered the task of fair data generation given access to a (potentially small) unbiased dataset and a large biased dataset. For data efficient learning, we proposed an importance weighted objective that corrects bias by reweighting the biased datapoints. These weights are estimated by a binary classifier. Empirically, we showed that our technique outperforms baselines by 57.1% on average in reducing dataset bias on CelebA without incurring a significant reduction in sample quality.

We stress the need for caution in using our techniques and interpreting the empirical findings. For scaling our evaluation, we relied on a pretrained attribute classifier for inferring the bias in the generated data samples. The classifiers we considered are highly accurate w.r.t. the train/test splits, but can have blind spots especially when evaluated on generated data. For future work, we would like to investigate human evaluations on different bias-mitigation approaches as well [63].

Our work presents an initial foray into the field of fair data generation with weak supervision, highlighting several challenges which also serve as opportunities for future work. As a case in point, our work calls for rethinking sample quality metrics for generative models in the presence of dataset bias. On one hand, our approach increases the diversity of generated samples in the sense that the different subgroups are more balanced; at the same time, however, variation across other image features decreases because the newly generated underrepresented samples are learned from a smaller dataset of underrepresented subgroups.

We proposed evaluation metrics in this work based on absolute measures of sample quality w.r.t. , as well as relative discrepancies in sample quality and protected attributes respectively. We duly acknowledge imperfections with these metrics. Besides requiring additional supervision for evaluation (e.g., an accurate gender classifier), it is possible for simple tricks to ‘game’ these metrics when looked in isolation. For example, current metrics such as FID show limited use in evaluating mitigation of bias, as they prefer models trained on larger datasets without any bias correction to avoid even slight compromises on perceptual sample quality. This suggests that such metrics may be unsuitable for model selection in high-stake scenarios where fairness across subgroups is crucial.

In summary, the need for better unsupervised metrics for evaluating fairness is an open and critical direction for future work. Finally, it would be interesting to explore whether even weaker forms of supervision would be possible for this task, e.g., when the biased dataset has a somewhat disjoint but related support from the small, unbiased dataset – this would be highly reflective of the diverse data sources used for training many current and upcoming large-scale ML systems [64].

References

  • [1] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
  • [2] Jyh-Jing Hwang, Sergei Azernikov, Alexei A Efros, and Stella X Yu. Learning beyond human expertise with generative models for dental restorations. arXiv preprint arXiv:1804.00064, 2018.
  • [3] Cheng-Zhi Anna Huang, Tim Cooijmnas, Adam Roberts, Aaron Courville, and Douglas Eck. Counterpoint by convolution. ISMIR, 2017.
  • [4] J Podesta, P Pritzker, EJ Moniz, J Holdren, and J Zients. Big data: seizing opportunities, preserving values. Executive Office of the President, The White House, 2014.
  • [5] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018. http://www.fairmlbook.org.
  • [6] Antonio Torralba, Alexei A Efros, et al. Unbiased look at dataset bias. In CVPR, volume 1, page 7. Citeseer, 2011.
  • [7] Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A deeper look at dataset bias. In

    Domain Adaptation in Computer Vision Applications

    , pages 37–55. Springer, 2017.
  • [8] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
  • [9] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
  • [10] 23&me. The real issue: Diversity in genetics research. Retrieved from https://blog.23andme.com/ancestry/the-real-issue-diversity-in-genetics-research/, 2016.
  • [11] Euny Hong. 23andme has a problem when it comes to ancestry reports for people of color. Quartz. Retrieved from https://qz. com/765879/23andme-has-a-race-problem-when-it-comes-to-ancestryreports-for-non-whites.
  • [12] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [13] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • [14] Xiaogang Wang Ziwei Liu, Ping Luo and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [17] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • [18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • [19] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do gans learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
  • [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [21] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos.

    Mmd gan: Towards deeper understanding of moment matching network.

    In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
  • [22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 2818–2826, 2016.
  • [23] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952.
  • [24] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
  • [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [26] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [27] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [29] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
  • [30] Hoda Heidari, Claudio Ferrari, Krishna Gummadi, and Andreas Krause. Fairness behind a veil of ignorance: A welfare analysis for automated decision making. In Advances in Neural Information Processing Systems, pages 1265–1276, 2018.
  • [31] Flavio du Pin Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Data pre-processing for discrimination prevention: Information-theoretic optimization and analysis. IEEE Journal of Selected Topics in Signal Processing, 12(5):1106–1119, 2018.
  • [32] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325–333, 2013.
  • [33] Harrison Edwards and Amos Storkey. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897, 2015.
  • [34] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015.
  • [35] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075, 2017.
  • [36] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. Learning controllable fair representations. arXiv preprint arXiv:1812.04218, 2018.
  • [37] Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller. One-network adversarial fairness. 2019.
  • [38] Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. Inclusivefacenet: Improving face attribute detection with race and gender diversity. arXiv preprint arXiv:1712.00193, 2017.
  • [39] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data), pages 570–575. IEEE, 2018.
  • [40] Prasanna Sattigeri, Samuel C Hoffman, Vijil Chenthamarakshan, and Kush R Varshney. Fairness gan: Generating datasets with fairness properties using a generative adversarial network. In Proc. ICLR Workshop Safe Mach. Learn, volume 2, 2019.
  • [41] Jonathon Byrd and Zachary C Lipton. What is the effect of importance weighting in deep learning? arXiv preprint arXiv:1812.03372, 2018.
  • [42] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13–18. IEEE, 2009.
  • [43] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012.
  • [44] Shayan Doroudi, Philip S Thomas, and Emma Brunskill. Importance sampling for fair policy selection. Grantee Submission, 2017.
  • [45] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, pages 5029–5037, 2017.
  • [46] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. arXiv preprint arXiv:1902.03519, 2019.
  • [47] Suman K Bera, Deeparnab Chakrabarty, and Maryam Negahbani. Fair algorithms for clustering. arXiv preprint arXiv:1901.02393, 2019.
  • [48] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means clustering. arXiv preprint arXiv:1812.10854, 2018.
  • [49] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
  • [50] Aditya Grover and Stefano Ermon. Boosted generative models. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [51] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pages 513–520, 2007.
  • [52] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  • [53] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
  • [54] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. Comparison of maximum likelihood and gan-based training of real nvps. arXiv preprint arXiv:1705.05263, 2017.
  • [55] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
  • [56] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating gans with divergences proposed for training. arXiv preprint arXiv:1803.01045, 2018.
  • [57] Ishaan Gulrajani, Colin Raffel, and Luke Metz. Towards gan benchmarks which require generalization. 2018.
  • [58] Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. In NeurIPS, 2019.
  • [59] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator rejection sampling. arXiv preprint arXiv:1810.06758, 2018.
  • [60] Ryan Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. Metropolis-hastings generative adversarial networks. arXiv preprint arXiv:1811.11357, 2018.
  • [61] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chi-square generative adversarial network. In International Conference on Machine Learning, pages 4894–4903, 2018.
  • [62] Maurice Diesendruck, Ethan R Elenberg, Rajat Sen, Guy W Cole, Sanjay Shakkottai, and Sinead A Williamson. Importance weighted generative networks. arXiv preprint arXiv:1806.02512, 2018.
  • [63] Nina Grgic-Hlaca, Elissa M Redmiles, Krishna P Gummadi, and Adrian Weller. Human perceptions of fairness in algorithmic decision making: A case study of criminal risk prediction. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 903–912, 2018.
  • [64] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 2017.
  • [65] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.

    Automatic differentiation in pytorch.

    2017.
  • [66] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651. JMLR. org, 2017.

References

  • [1] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
  • [2] Jyh-Jing Hwang, Sergei Azernikov, Alexei A Efros, and Stella X Yu. Learning beyond human expertise with generative models for dental restorations. arXiv preprint arXiv:1804.00064, 2018.
  • [3] Cheng-Zhi Anna Huang, Tim Cooijmnas, Adam Roberts, Aaron Courville, and Douglas Eck. Counterpoint by convolution. ISMIR, 2017.
  • [4] J Podesta, P Pritzker, EJ Moniz, J Holdren, and J Zients. Big data: seizing opportunities, preserving values. Executive Office of the President, The White House, 2014.
  • [5] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018. http://www.fairmlbook.org.
  • [6] Antonio Torralba, Alexei A Efros, et al. Unbiased look at dataset bias. In CVPR, volume 1, page 7. Citeseer, 2011.
  • [7] Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A deeper look at dataset bias. In

    Domain Adaptation in Computer Vision Applications

    , pages 37–55. Springer, 2017.
  • [8] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
  • [9] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
  • [10] 23&me. The real issue: Diversity in genetics research. Retrieved from https://blog.23andme.com/ancestry/the-real-issue-diversity-in-genetics-research/, 2016.
  • [11] Euny Hong. 23andme has a problem when it comes to ancestry reports for people of color. Quartz. Retrieved from https://qz. com/765879/23andme-has-a-race-problem-when-it-comes-to-ancestryreports-for-non-whites.
  • [12] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [13] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • [14] Xiaogang Wang Ziwei Liu, Ping Luo and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [17] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • [18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • [19] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do gans learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
  • [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [21] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos.

    Mmd gan: Towards deeper understanding of moment matching network.

    In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
  • [22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 2818–2826, 2016.
  • [23] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952.
  • [24] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
  • [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [26] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [27] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [29] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
  • [30] Hoda Heidari, Claudio Ferrari, Krishna Gummadi, and Andreas Krause. Fairness behind a veil of ignorance: A welfare analysis for automated decision making. In Advances in Neural Information Processing Systems, pages 1265–1276, 2018.
  • [31] Flavio du Pin Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Data pre-processing for discrimination prevention: Information-theoretic optimization and analysis. IEEE Journal of Selected Topics in Signal Processing, 12(5):1106–1119, 2018.
  • [32] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325–333, 2013.
  • [33] Harrison Edwards and Amos Storkey. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897, 2015.
  • [34] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015.
  • [35] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075, 2017.
  • [36] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. Learning controllable fair representations. arXiv preprint arXiv:1812.04218, 2018.
  • [37] Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller. One-network adversarial fairness. 2019.
  • [38] Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. Inclusivefacenet: Improving face attribute detection with race and gender diversity. arXiv preprint arXiv:1712.00193, 2017.
  • [39] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data), pages 570–575. IEEE, 2018.
  • [40] Prasanna Sattigeri, Samuel C Hoffman, Vijil Chenthamarakshan, and Kush R Varshney. Fairness gan: Generating datasets with fairness properties using a generative adversarial network. In Proc. ICLR Workshop Safe Mach. Learn, volume 2, 2019.
  • [41] Jonathon Byrd and Zachary C Lipton. What is the effect of importance weighting in deep learning? arXiv preprint arXiv:1812.03372, 2018.
  • [42] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13–18. IEEE, 2009.
  • [43] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012.
  • [44] Shayan Doroudi, Philip S Thomas, and Emma Brunskill. Importance sampling for fair policy selection. Grantee Submission, 2017.
  • [45] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, pages 5029–5037, 2017.
  • [46] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. arXiv preprint arXiv:1902.03519, 2019.
  • [47] Suman K Bera, Deeparnab Chakrabarty, and Maryam Negahbani. Fair algorithms for clustering. arXiv preprint arXiv:1901.02393, 2019.
  • [48] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means clustering. arXiv preprint arXiv:1812.10854, 2018.
  • [49] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
  • [50] Aditya Grover and Stefano Ermon. Boosted generative models. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [51] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pages 513–520, 2007.
  • [52] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  • [53] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
  • [54] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. Comparison of maximum likelihood and gan-based training of real nvps. arXiv preprint arXiv:1705.05263, 2017.
  • [55] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
  • [56] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating gans with divergences proposed for training. arXiv preprint arXiv:1803.01045, 2018.
  • [57] Ishaan Gulrajani, Colin Raffel, and Luke Metz. Towards gan benchmarks which require generalization. 2018.
  • [58] Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. In NeurIPS, 2019.
  • [59] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator rejection sampling. arXiv preprint arXiv:1810.06758, 2018.
  • [60] Ryan Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. Metropolis-hastings generative adversarial networks. arXiv preprint arXiv:1811.11357, 2018.
  • [61] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chi-square generative adversarial network. In International Conference on Machine Learning, pages 4894–4903, 2018.
  • [62] Maurice Diesendruck, Ethan R Elenberg, Rajat Sen, Guy W Cole, Sanjay Shakkottai, and Sinead A Williamson. Importance weighted generative networks. arXiv preprint arXiv:1806.02512, 2018.
  • [63] Nina Grgic-Hlaca, Elissa M Redmiles, Krishna P Gummadi, and Adrian Weller. Human perceptions of fairness in algorithmic decision making: A case study of criminal risk prediction. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 903–912, 2018.
  • [64] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 2017.
  • [65] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.

    Automatic differentiation in pytorch.

    2017.
  • [66] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651. JMLR. org, 2017.

Supplementary Material

A Proof of Theorem 1

Proof.

Since and have disjoint supports for , we know that for all , there exists a deterministic mapping such that . Further, for all :

(9)
(10)

Combining Eqs. 9,10 above with the assumption in Eq. 7, we can simplify the density ratios as:

(11)
(12)
(13)
(14)

From Eq. 5 and Eq. 11, the Bayes optimal classifier can hence be expressed as:

(15)

The optimal cross-entropy loss of a binary classifier for density ratio estimation (DRE) can then be expressed as:

(16)
(17)
(18)
(19)

B Architecture and Hyperparameter Configurations

We used PyTorch [65] for all our experiments. Our overall experimental framework involved three different kinds of models which we describe below.

b.1 Attribute classifier

We use the same architecture and hyperparameters for both the single- and multi-attribute classifiers. Both are variants of ResNet-18 where the output number of classes correspond to the dataset split (e.g. 2 classes for single-attribute, 4 classes for the multi-attribute experiment).

Architecture.

We provide the architectural details in Table 2 below:

Name Component
conv1

conv, 64 filters. stride 2

Residual Block 1 max pool, stride 2
Residual Block 2
Residual Block 3
Residual Block 4
Output Layer average pool stride 1, fully-connected, softmax
Table 2: ResNet-18 architecture adapted for attribute classifier.
Hyperparameters.

During training, we use a batch size of 64 and the Adam optimizer with learning rate = 0.001. The classifiers learn relatively quickly for both scenarios and we only needed to train for 15 epochs. We used early stopping with the validation set in CelebA to determine the best model to use for downstream evaluation.

b.2 Density Ratio Classifier

Architecture.

We provide the architectural details in Table 2 below:

Name Component
conv1 conv, 64 filters. stride 2
Residual Block 1 max pool, stride 2
Residual Block 2
Residual Block 3
Residual Block 4
Output Layer average pool stride 1, fully-connected, softmax
Table 3: ResNet-18 architecture adapted for attribute classifier.
Hyperparameters.

We also use a batch size of 64, the Adam optimizer with learning rate = 0.0001, and a total of 15 epochs to train the density ratio estimate classifier.

Experimental Details.

We note a few steps we had to take during the training and validation procedure. Because of the imbalance in both (a) unbalanced/balanced dataset sizes and (b) gender ratios, we found that a naive training procedure encouraged the classifier to predict all data points as belonging to the biased, unbalanced dataset. To prevent this phenomenon from occuring, two minor modifications were necessary:

  1. We balance the distribution between the two datasets in each minibatch: that is, we ensure that the classifier sees equal numbers of data points from the balanced () and unbalanced () datasets for each batch. This provides enough signal for the classifier to learn meaningful density ratios, as opposed to a trivial mapping of all points to the larger dataset.

  2. We apply a similar balancing technique when testing against the validation set. However, instead of balancing the minibatch, we weight the contribution of the losses from the balanced and unbalanced datasets. Specifically, the loss is computed as:

    where the subscript pos denotes examples from the balanced dataset () and neg denote examples from the unbalanced dataset ().

b.3 Progressive GAN

Architecture.

The architectural details for the Progressive GAN are provided in Table 4 below:

Generator Discriminator
Noise Image
- conv, filters. LReLU
conv, filters. LReLU conv, filters. LReLU
conv, filters. LReLU conv, filters. LReLU
Upsample Downsample
conv, filters. LReLU conv, filters. LReLU
conv, filters. LReLU conv, filters. LReLU
Upsample Downsample
conv, filters. LReLU conv, filters. LReLU
conv, filters. LReLU conv, filters. LReLU
Upsample Downsample
conv, filters. LReLU conv, filters. LReLU
conv, filters. LReLU conv, filters. LReLU
Upsample Downsample
conv, filters. LReLU Minibatch Stddev
conv, filters. LReLU conv, filters. LReLU
conv, filters. linear.
Table 4: Architecture for the generator and discriminator. Notation: “ conv, filters. LReLU” indicates a -by- convolutional filter with

output channels, followed by a leaky ReLU activation (with negative slope

). Pixel normalization is applied before each leaky ReLU activation in the generator. Upsampling is performed via nearest neighbor interpolation, whereas downsampling is performed via mean pooling.

Hyperparameters.

We use a batch size of , and the Adam optimizer with learning rate , and . We train the model using a progressively-grown WGAN-GP with gradient penalty weight . Noise is distributed uniformly on the surface of a hypersphere of radius .

b.4 Dataset construction procedure

We construct such dataset splits from the full CelebA training set using the following procedure. We initially fix our dataset size to be roughly 135K out of the total 162K based on the total number of females present in the data. Then for each level of bias, we partition 1/4 of males and 1/4 of females into to achieve the 50-50 ratio. The remaining number of examples are used for , where the number of males and females are adjusted to match the desired level of bias (e.g. 90). Finally at each level of unbiased dataset size perc, we discard the appropriate fraction of datapoints from both the male and female category in . For example, for perc = 0.5, we discard half the number of females and half the number of males from .

C Density Ratio Classifier Analysis

(a) bias=90, perc=1.0
(b) bias=80, perc=1.0
(c) multi, perc=1.0
(d) bias=90, perc=0.5
(e) bias=80, perc=0.5
(f) multi, perc=0.5
(g) bias=90, perc=0.25
(h) bias=80, perc=0.25
(i) Multi perc=0.25
(j) bias=90, perc=0.1
(k) bias=80, perc=0.1
(l) multi, perc=0.1
Figure 5: Calibration curves

In Figure 5, we show the calibration curves for the density ratio classifiers for each of the dataset sizes across all levels of bias. As evident from the plots, the classifiers are already calibrated and did not require any post-training recalibration.

Model Optimal Empirical
single, bias=0.9 70% 68.6%
single, bias=0.8 65% 63.7%
multi 67.6% 60.4%
Table 5: Comparison between the 0-1 accuracy loss of the optimal 0-1 classifier and learned density ratio classifier post thresholding of predicted probabilities.

Table 5 shows that our density ratio classifier achieves an accuracy close to the optimal 0-1 accuracy on this dataset.

D Additional Results

d.1 Downstream Classification Task

We note that although it is difficult to directly compare our model to supervised baselines such as FairGAN [39] and FairnessGAN [40] due to the unsupervised nature of our work, we conduct further evaluations on a relevant downstream task classification task, adapted to a fairness setting.

In this task, we augment a biased dataset (165K exmaples) with a "fair" dataset (135K examples) generated by a pre-trained GAN to use for training a classifier, then evaluate the classifier’s performance on a held-out dataset of true examples. We train a conditional GAN using the AC-GAN objective [66], where the conditioning is on an arbitrary downstream attribute of interest (e.g., we consider the “attractiveness” attribute of CelebA as in [40]). Our goal is to learn a fair classifier trained to predict the attribute of interest in a way that is fair with respect to gender, the sensitive attribute.

As an evaluation metric, we use the demographic parity distance (), denoted as the absolute difference in demographic parity between two classifiers and :

We consider 2 AC-GAN variants: (1) equi-weight trained on ; and (2) imp-weight, which reweights the loss by the density ratio estimates. The classifier is trained on both real and generated images for both AC-GAN variants, with the labels given by the conditioned attractiveness values for the respective generations. The classifier is then asked to predict attractiveness for the CelebA test set.

As shown in Table 6, we find that the classifier trained on both real data and synthetic data generated by our imp-weight AC-GAN achieved a much lower than the equi-weight baseline, demonstrating that our method achieves a higher demographic parity with respect to the sensitive attribute, despite the fact that we did not explicitly use labels during training.

Model Accuracy NLL
Baseline classifier, no data augmentation 79% 0.7964 0.038
equi-weight 79% 0.7902 0.032
imp-weight (ours) 75% 0.7564 0.002
Table 6: For the CelebA dataset, classifier accuracy, negative log-likelihood, and across bias = and perc= on the downstream classification task. Our importance-weighting method learns a fair classifier that achieves a lower , as desired, albeit with a slight reduction in accuracy.

d.2 Multi-Attribute Experiment

The results for the multi-attribute split based on gender and the presence of black hair are shown in Figure 6. As a reminder to the reader, there could be potential pitfalls in evaluation for data generation due to imperfections of the multi-attribute classifier. For instance, in Figure 6a, some samples classified as females without black hair seem to have darker hair shades based on visual inspection. Even in this challenging setup involving two latent bias factors, we find that the importance weighted approach again outperforms the baselines in almost all cases in mitigating bias in the generated data while admitting a small deterioration in image quality overall.

(a) Samples generated via importance reweighting with subgroups separated via orange, red, and green lines. For the 100 samples above, the classifier concludes 38 females and 26 males without black hair, 20 females and 16 males with black hair.
(b) Fairness Discrepancy
(c) FID
Figure 6: Mult-Attribute Dataset Bias Mitigation. Lower discrepancy and FID is better. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models.

E Additional generated samples

Additional samples for other experimental configuration are displayed in the following pages.

(a) equi-weight
(b) conditional
(c) imp-weight
(d) unbiased-only
Figure 7: Additional samples of bias=90, across different methods. All samples shown are from the scenario where .
(a) equi-weight
(b) conditional
(c) imp-weight
(d) unbiased-only
Figure 8: Additional samples of bias=80, across different methods. All samples shown are from the scenario where .
(a) equi-weight
(b) conditional
(c) imp-weight
(d) unbiased-only
Figure 9: Additional samples of the multi-attribute experiment, across different methods. All samples shown are from the scenario where .