fairgen
Fair Generative Modeling via Weak Supervision
view repo
Real-world datasets are often biased with respect to key demographic factors such as race and gender. Due to the latent nature of the underlying factors, detecting and mitigating bias is especially challenging for unsupervised machine learning. We present a weakly supervised algorithm for overcoming dataset bias for deep generative models. Our approach requires access to an additional small, unlabeled but unbiased dataset as the supervision signal, thus sidestepping the need for explicit labels on the underlying bias factors. Using this supplementary dataset, we detect the bias in existing datasets via a density ratio technique and learn generative models which efficiently achieve the twin goals of: 1) data efficiency by using training examples from both biased and unbiased datasets for learning, 2) unbiased data generation at test time. Empirically, we demonstrate the efficacy of our approach which reduces bias w.r.t. latent factors by 57.1 image generation using generative adversarial networks.
READ FULL TEXT VIEW PDF
An important component for generalization in machine learning is to unco...
read it
A learned generative model often produces biased statistics relative to ...
read it
Machine learning systems are increasingly being used to make impactful
d...
read it
Many natural language inference (NLI) datasets contain biases that allow...
read it
Bias in classifiers is a severe issue of modern deep learning methods,
e...
read it
Datasets often contain biases which unfairly disadvantage certain groups...
read it
Fair and unbiased machine learning is an important and active field of
r...
read it
Fair Generative Modeling via Weak Supervision
Increasingly, many applications of machine learning (ML) involve data generation. Examples of such production level systems include Wavenet for text-to-speech synthesis [1], generative adversarial networks (GAN) for dental restoration [2], and a large number of creative applications such Coconet used for designing the “first AI-powered Google Doodle” [3]. As these generative applications become more prevalent, it becomes increasingly important to consider questions with regards to the potential discriminatory nature of such systems and ways to mitigate it [4].
A variety of socio-technical factors contribute to the discriminatory nature of ML systems [5]. A major factor is the existence of biases in the training data itself [6, 7]. Since data is the fuel of ML, any existing bias in the dataset can be propagated to the learned model [8]
. This is a particularly pressing concern for generative models which can easily amplify the bias by generating more of the biased data at test-time. Further, learning a generative model is fundamentally an unsupervised learning problem and hence, the bias factors of interest are typically latent. For example, while learning a generative model of human faces, we often do not have access to attributes such as gender, race, and age. Any existing bias in the dataset with respect to these attributes are easily picked by deep generative models. See Figure
1 for an illustration.In this work, we present a weakly-supervised approach to learning fair generative models in the presence of dataset bias. Our source of weak supervision is motivated by the observation that obtaining relatively small unbiased datasets is feasible in practice e.g., survey data collected by organizations such as World Bank typically follow several good dataset collection practices [9] to ensure representativeness before release. However, collecting such unbiased datasets can be expensive and unscalable. In contrast, obtaining large unlabelled (but biased) datasets is relatively cheap for many domains in the big data era. As another example, biotech firms invest monetary and infrastructure resources to gather representative data from different demographics for personal genomics, but admit limitations in scaling as the majority of their existing data belongs to a highly biased sample of the world population [10, 11]. Note that neither of our datasets need to be labeled w.r.t. the latent bias attributes and the size of the unbiased dataset can be much smaller than the biased dataset. Hence, the unbiased supervision we require is weak.
Using an unbiased dataset to augment a biased dataset, our goal is to learn a generative model that best approximates the desired, unbiased data distribution. Simply using the unbiased dataset alone for learning is an option, but this may not suffice since this dataset can be too small to learn an expressive model that accurately captures the underlying unbiased data distribution. Our approach to learning a fair
generative model that is robust to biases in the larger training set is based on importance reweighting. In particular, we learn a generative model which reweighs the data points in the biased dataset based on the ratio of densities assigned by the biased data distribution as compared to the target unbiased data distribution. Since we do not have access to explicit densities assigned by either of the two distributions, we estimate the weights by using a probabilistic classifier
[12, 13].We test our weakly-supervised approach on learning generative adversarial networks on the CelebA dataset [14]. The dataset consists of attributes such as gender and hair color, which we use for designing biased and unbiased data splits and subsequent evaluation. Our empirical results successfully demonstrate how the reweighting approach can offset dataset bias on a wide range of settings. In particular, we obtain improvements of 74.9% and 22.6% on average over baselines in reducing the bias with respect to the latent factors in the setting of single-attribute and multi-attribute dataset bias respectively for comparable sample quality.
We assume there exists a true (unknown) data distribution over a set of observed variables . In generative modeling, our goal is to learn the parameters of a distribution over the observed variables , such that the model distribution is close to . Depending on the choice of learning algorithm, different approaches have been previously considered. Broadly, these include adversarial training (e.g., GANs [15]
) and maximum likelihood estimation (MLE) (e.g., variational autoencoders
[16, 17] and normalizing flows [18]). Our bias mitigation framework is agnostic to the above training approaches. For generality, we consider expectation based learning loss objectives:(1) |
where is a suitable per-example loss that depends on both examples drawn from a dataset and the model parameters . The above expression encompasses a broad class of MLE and adversarial objectives. For example, if denotes the negative log-likelihood assigned to the point as per , then we recover the MLE training objective.
The standard assumption for learning a generative model is that we have access to a sufficiently large dataset of training examples, where each is assumed to be sampled independently from a target unbiased distribution . In practice however, collecting large datasets that are i.i.d. w.r.t. is difficult due to a variety of socio-technical factors. Even more, the sample complexity for learning high dimensional distributions can be doubly-exponential in the dimensions in many cases [19], surpassing the size of the largest available datasets.
We can partially offset this difficulty by considering data from alternate sources related to the target distribution, e.g., images scraped from the Internet. However, these additional datapoints are not expected to be i.i.d. w.r.t. . We characterize this phenomena as dataset bias, where we assume the availability of a dataset , such that the examples are sampled independently from a biased (unknown) distribution that is different from , but shares the same support.
Evaluating generative models and fairness in machine learning are both open areas of research in itself. Our work is at the intersection of these two fields and we propose the following metrics for measuring bias mitigation for data generation.
[leftmargin=0cm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left]
Sample Quality: We employ sample quality metrics e.g., Frechet Inception Distance (FID) [20], Kernel Inception Distance (KID) [21], etc. These metrics match empirical expectations w.r.t. a reference data distribution and a model distribution in a predefined feature space e.g., the prefinal layer of activations of Inception Network [22]. Lower the score, better is the learned model’s ability to approximate . For the fairness context in particular, we are interested in measuring the discrepancy w.r.t. even if the model has been trained to use both and .
Fairness: Alternatively, we can evaluate bias of generative models specifically in the context of some sensitive latent variables, say . For example, may correspond to the age and gender of an individual depicted via an image . We emphasize that such attributes are unknown during training, and used only for evaluation at test time.
If we have access to a highly accurate predictor for the distribution of the sensitive attributes conditioned on the observed attributes , we can evaluate the extent of bias mitigation via the discrepancies in the expected marginal likelihoods of as per and . Formally, we define the fairness discrepancy for a generative model w.r.t. and sensitive attributes :
(2) |
In practice, the expectations in Eq. (2) can be computed via Monte Carlo averaging. Again the lower is the discrepancy in the above two expectations, the better is the learned model’s ability to mitigate dataset bias w.r.t. the sensitive attributes .
Recall that we assume a learning setting where we are given access to a data source in addition to a dataset of training examples . Our goal in this work is to capitalize on both data sources and for learning a model that best approximates the unbiased target distribution .
We begin by discussing two baseline approaches at the extreme ends of the spectrum. First, one could completely ignore and consider learning based on alone. Since we only consider proper losses w.r.t. , global optimization of the objective in Eq. (1) in a well-specified model family will recover the true data distribution as . However, since is finite in practice, this is likely to give poor sample quality even though the fairness discrepancy would be low.
On the other extreme, we can consider learning based on the full dataset consisting of both and . This procedure will be data efficient and could lead to high sample quality, but it comes at the cost of fairness since the learned distribution will be heavily biased w.r.t. .
Our first proposal is to learn a generative model conditioned on the identity of the dataset used during training. Formally, we learn a generative model where
is a binary random variable indicating whether the model distribution was learned to approximate the data distribution corresponding to
(i.e., ) or (i.e., ). By sharing model parameters across the two values of , we hope to leverage both data sources. At test time, conditioning on for should result in fair generations.As we demonstrate in Section 4 however, this simple approach does not achieve the intended effect in practice. The likely cause is that the conditioning information is too weak for the model to infer the bias factors and effectively distinguish between the two distributions. Next, we present an alternate two-phased approach based on density ratio estimation which effectively overcomes the dataset bias in a data-efficient manner.
Recall one of the trivial baselines in Section 3.1 which learns a generative model on the union of and . This method is problematic because it assigns equal weight to the loss contributions from each individual datapoint in our dataset in Eq. (1), regardless of whether the datapoint comes from or . For example, in situations where the dataset bias causes a minority group to be underrepresented, this objective will encourage the model to focus on the majority group such that the overall value of the loss is minimized on average with respect to a biased empirical distribution i.e., a weighted mixture of and with weights proportional to and .
Our key idea is to reweight the datapoints from during training such that the model learns to downweight over-represented data points from while simultaneously upweighting the under-represented points from . The challenge in the unsupervised context is that we do not have direct supervision on which points are over- or under-represented and by how much. To resolve this issue, we consider importance sampling [23]. Whenever we are given data from two distributions, w.l.o.g. say and , and wish to evaluate a sample average w.r.t. given samples from , we can do so by reweighting the samples from by the ratio of densities assigned to the sampled points by and . In our setting, the distributions of interest are and respectively. Hence, an importance weighted objective for learning from is:
(3) | ||||
(4) |
where is defined to be the importance weight for .
To estimate the importance weights, we use a binary classifier as described below [12].
Consider a binary classification problem with classes
with training data generated as follows. First, we fix a prior probability for
. Then, we repeatedly sample . If , we independently sample a datapoint , else we sample . Then, as shown in [24], the ratio of densities and assigned to an arbitrary point can be recovered via a Bayes optimal (probabilistic) classifier as:(5) |
where is the probability assigned by the classifier to the point belonging to class . Here, is the ratio of marginals of the labels for two classes. In practice, we do not have direct access to either or and hence, our training data consists of points sampled from the empirical data distributions defined uniformly over and . Further, we may not be able to learn a Bayes optimal classifier and denote the importance weights estimated by the learned classifier for a point as .
Our overall procedure is summarized in Algorithm 1
. We use deep neural networks for parameterizing the binary classifier and the generative model. Given a biased and unbiased dataset along with the network architectures and other standard hyperparameters (e.g., learning rate, optimizer etc.), we first learn a probabilistic binary classifier (Line
2). The learned classifier can provide importance weights for the datapoints from via estimates of the density ratios (Line 3). For the datapoints from , we do not need to perform any reweighting and set the importance weights to 1 (Line 4). Using the combined dataset , we then learn the generative model where the minibatch loss for every gradient update weights the contributions from each datapoint (Lines 6-12).For a practical implementation, it is best to account for some diagnostics and best practices while executing Algorithm 1
. For density ratio estimation, we test that the classifier is calibrated on a held out set. This is a necessary check (but insufficient) for the estimated density ratios to be meaningful. If the classifier is miscalibrated, we can apply standard recalibration techniques such as Platt scaling before estimating the importance weights. Furthermore, while optimizing the model using a weighted objective, there can be an increased variance across the loss contributions from each example in a minibatch due to importance weighting. We did not observe this in our experiments, but techniques such as normalization of weights within a batch can potentially help control the unintended variance introduced within a batch
[12].Theoretical Analysis. The performance of Algorithm 1 critically depends on the quality of estimated density ratios, which in turn is dictated by the training of the binary classifier. We define the expected negative cross-entropy (NCE) objective for a classifier as:
(6) |
In the following result, we characterize the NCE loss for the Bayes optimal classifier.
Let
denote a set of unobserved bias variables. Suppose there exist two joint distributions
and over and . Let and denote the marginals over and for the joint and similar notation for the joint such that(7) |
and , have disjoint supports for . Then, the negative cross-entropy of the Bayes optimal classifier is given as:
(8) |
where .
For example, as we shall see in our experiments in the following section, the inputs can correspond to face images, whereas the unobserved represents sensitive bias factors for a subgroup such as gender, race etc. The proportion of examples belonging a subgroup can differ across the biased and unbiased datasets with the relative proportions given by . Note that the above result only requires knowing these relative proportions and not the true for each . The practical implication is that under the assumptions of Theorem 1, we can check the quality of density ratios estimated by an arbitrary learned classifier by comparing its empirical NCE with the theoretical NCE of the Bayes optimal classifier in Eq. 1 (see Section 4.1).
In this sections, we are interested in investigating two broad questions empirically:
How well can we estimate density ratios for the proposed weak supervision setting?
How effective is the reweighting technique for learning fair generative models on the fairness discrepancy metric proposed in Section 2.3?
We further demonstrate the usefulness of our generated data in downstream applications such as data augmentation for learning a fair classifier in Supplement D.1.
Dataset. We consider the CelebA [14] dataset, which is commonly used for benchmarking deep generative models and comprises of images of faces with 40 labeled binary attributes. We use this attribute information to construct 3 different setting for partitioning the full dataset into and .
Setting 1 (single, bias=0.9): We set to be a single bias variable corresponding to “gender" with values 0 (female) and 1 (male) and .
Specifically, this means that contains the same fraction of male and female images whereas contains 0.9 fraction of females and rest as males.
Setting 2 (single, bias=0.8): We use same bias variable (gender) as Setting 1 with .
Setting 3 (multi): We set as two bias variables corresponding to “gender" and “black hair". In total, we have 4 subgroups: females without black hair (00), males without black hair (01), females with black hair (10), and males with black hair (11). We set .
We emphasize that the attribute information is used only for designing controlled biased and unbiased datasets and faithful evaluation. Our algorithm does not explicitly require such labeled information.
. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better.
Models. We train two classifiers for our experiments: (1) the attribute (e.g. gender) classifier which we use to assess the level of bias present in our final samples; and (2) the density ratio classifier. For both models, we use a variant of ResNet18 [25] on the standard train and validation splits of CelebA. For the generative model, we used a Progressive GAN [26] trained to minimize the Wassertein GAN [27, 28] objective with gradient penalty. Additional details regarding the architectural design and hyperparameters in Supplement B.
For each of the three experiments settings, we can evaluate the quality of the estimated density ratios by comparing empirical estimates of the cross-entropy loss of the density ratio classifier with the cross-entropy loss of the Bayes optimal classifier derived in Eq. 1. We show the results in Table 1 where we find that the two losses are very close, suggesting that we obtain high-quality density ratio estimates that we can use for subsequently training fair generative models. In Supplement C, we show a more fine-grained analysis of the 0-1 accuracies and calibration of the learned models.
Model | Bayes optimal | Empirical |
---|---|---|
single, bias=0.9 | 0.591 | 0.606 |
single, bias=0.8 | 0.604 | 0.652 |
multi | 0.611 | 0.666 |
In Figure 2, we show the distribution of our importance weights for the various latent subgroups. We find that across all the considered settings, the underrepresented subgroups (e.g., males in Figure 2(a), 2(b), females with black hair in 2(c)) are upweighted on average (mean density ratio estimate > 1), while the overrepresented subgroups are downweighted on average (mean density ratio estimate < 1). Also, as expected, the density ratio estimates are closer to 1 when the bias is low (see Figure 2(a) v.s. 2(b)).
We compare our importance weighted approach against three baselines: (1) equi-weight: a WGAN-GP trained on the full dataset that weighs every point equally; (2) unbiased-only: a WGAN-GP trained on the unbiased dataset ; and (3) conditional: a conditional WGAN-GP where the conditioning label indicates whether a data point is from or . In all our experiments, the unbiased-only variant which only uses the unbiased dataset for learning however failed to give any recognizable samples. For a clean presentation of the results due to other methods, we hence ignore this baseline in the results below and defer the reader to the supplementary material for further results.
To evaluate performance under a variety of settings, we also vary the size of the balanced dataset relative to the unbalanced dataset size : perc = {0.1, 0.25, 0.5, 1.0}. Here, perc = 0.1 denotes = 10% of and perc = 1.0 denotes .
We train our gender classifier for evaluation on the entire CelebA training set, and achieve a level of 98% accuracy on the held-out set. For each experimental setting, we evaluate bias mitigation based on the fairness discrepancy metric (Eq. (2)) and also report sample quality based on FID [20].
For the bias = 90 split, we show the samples generated via imp-weight in Figure 3a and the resulting fairness discrepancies in Figure 3b. Our framework generates samples that are slightly lower quality than equi-weight baseline samples shown in Figure 1, but is able to produce almost identical proportion of samples across the two genders. Similar observations hold for bias = 80, as shown in Figure 4. While importance weighting outperforms all baselines on the fairness discrepancy metric as seen in Figure 4a, the baselines do better in a couple of cases for FID (perc in Figure 4b).
We conduct a similar experiment with a multi-attribute split based on gender and the presence of black hair. The attribute classifier for the purpose of evaluation is now trained with a 4-way classification task instead of 2, and achieves an accuracy of roughly 88% on the test set. We refer the reader to Supplement D.2 for corresponding results and analysis.
Fairness & generative modeling.
There is a rich body of work in fair ML, which focus on different notions of fairness (e.g. demographic parity, equality of odds and opportunity) and study methods by which models can perform tasks such as classification in a non-discriminatory way
[5, 29, 30, 31]. Our focus is in the context of fair generative modeling. The vast majority of related work in this area is centered around fair and/or privacy preserving representation learning, which exploit tools from adversarial learning and information theory among others [32, 33, 34, 35, 36, 37]. A unifying principle among these methods is such that a discriminator is trained to perform poorly in predicting an outcome based on a protected attribute. [38]considers transfer learning of race and gender identities as a form of weak supervision for predicting other attributes on datasets of faces. While the end goal for the above works is classification, our focus is on data generation in the presence of dataset bias and we do not require explicit supervision for the protected attributes.
The most relevant prior works in the data generation setting are FairGAN [39] and FairnessGAN [40]. The goal of both methods is to generate fair datapoints and their corresponding labels as a preprocessing technique. This allows for learning a useful downstream classifier and obscures information about protected attributes. Again, these works are not directly comparable to ours as we do not assume explicit supervision regarding the protected attributes during training, and our goal is fair generation given unlabelled biased datasets where the bias factors are latent.
Reweighting. Reweighting datapoints is a common algorithmic technique for problems such as dataset bias and class imbalance [41]. It has often been used in the context of fair classification [42], for example, [43] details reweighting
as a way to remove discrimination without relabeling instances. For reinforcement learning,
[44] used an importance sampling approach for selecting fair policies. There is also a body of work on fair clustering [45, 46, 47, 48] which ensure that the clustering assignments are balanced with respect to some sensitive attribute.Density ratio estimation using classifiers. The use of classifiers for estimating density ratios has a rich history of prior works across ML [12]. For deep generative modeling, density ratios estimated by classifiers have been used for expanding the class of various learning objectives [49, 13, 50]
, evaluation metrics based on two-sample tests
[51, 52, 53, 54, 55, 56, 57], or improved Monte Carlo inference via these models [58, 59, 60, 61]. [58] use importance reweighting for alleviating model bias between and . Closest related is the proposal of [62] to use importance reweighting for learning generative models where training and test distributions differ, but explicit importance weights are provided for at least a subset of the training examples. We consider a more realistic, weakly-supervised setting where we estimate the importance weights using a small unbiased dataset.We considered the task of fair data generation given access to a (potentially small) unbiased dataset and a large biased dataset. For data efficient learning, we proposed an importance weighted objective that corrects bias by reweighting the biased datapoints. These weights are estimated by a binary classifier. Empirically, we showed that our technique outperforms baselines by 57.1% on average in reducing dataset bias on CelebA without incurring a significant reduction in sample quality.
We stress the need for caution in using our techniques and interpreting the empirical findings. For scaling our evaluation, we relied on a pretrained attribute classifier for inferring the bias in the generated data samples. The classifiers we considered are highly accurate w.r.t. the train/test splits, but can have blind spots especially when evaluated on generated data. For future work, we would like to investigate human evaluations on different bias-mitigation approaches as well [63].
Our work presents an initial foray into the field of fair data generation with weak supervision, highlighting several challenges which also serve as opportunities for future work. As a case in point, our work calls for rethinking sample quality metrics for generative models in the presence of dataset bias. On one hand, our approach increases the diversity of generated samples in the sense that the different subgroups are more balanced; at the same time, however, variation across other image features decreases because the newly generated underrepresented samples are learned from a smaller dataset of underrepresented subgroups.
We proposed evaluation metrics in this work based on absolute measures of sample quality w.r.t. , as well as relative discrepancies in sample quality and protected attributes respectively. We duly acknowledge imperfections with these metrics. Besides requiring additional supervision for evaluation (e.g., an accurate gender classifier), it is possible for simple tricks to ‘game’ these metrics when looked in isolation. For example, current metrics such as FID show limited use in evaluating mitigation of bias, as they prefer models trained on larger datasets without any bias correction to avoid even slight compromises on perceptual sample quality. This suggests that such metrics may be unsuitable for model selection in high-stake scenarios where fairness across subgroups is crucial.
In summary, the need for better unsupervised metrics for evaluating fairness is an open and critical direction for future work. Finally, it would be interesting to explore whether even weaker forms of supervision would be possible for this task, e.g., when the biased dataset has a somewhat disjoint but related support from the small, unbiased dataset – this would be highly reflective of the diverse data sources used for training many current and upcoming large-scale ML systems [64].
Domain Adaptation in Computer Vision Applications
, pages 37–55. Springer, 2017.Mmd gan: Towards deeper understanding of moment matching network.
In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2818–2826, 2016.Thirty-Second AAAI Conference on Artificial Intelligence
, 2018.Automatic differentiation in pytorch.
2017.Domain Adaptation in Computer Vision Applications
, pages 37–55. Springer, 2017.Mmd gan: Towards deeper understanding of moment matching network.
In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2818–2826, 2016.Thirty-Second AAAI Conference on Artificial Intelligence
, 2018.Automatic differentiation in pytorch.
2017.Since and have disjoint supports for , we know that for all , there exists a deterministic mapping such that . Further, for all :
(9) | ||||
(10) |
Combining Eqs. 9,10 above with the assumption in Eq. 7, we can simplify the density ratios as:
(11) | ||||
(12) | ||||
(13) | ||||
(14) |
From Eq. 5 and Eq. 11, the Bayes optimal classifier can hence be expressed as:
(15) |
The optimal cross-entropy loss of a binary classifier for density ratio estimation (DRE) can then be expressed as:
(16) | ||||
(17) | ||||
(18) | ||||
(19) |
∎
We used PyTorch [65] for all our experiments. Our overall experimental framework involved three different kinds of models which we describe below.
We use the same architecture and hyperparameters for both the single- and multi-attribute classifiers. Both are variants of ResNet-18 where the output number of classes correspond to the dataset split (e.g. 2 classes for single-attribute, 4 classes for the multi-attribute experiment).
We provide the architectural details in Table 2 below:
Name | Component |
---|---|
conv1 |
conv, 64 filters. stride 2 |
Residual Block 1 | max pool, stride 2 |
Residual Block 2 | |
Residual Block 3 | |
Residual Block 4 | |
Output Layer | average pool stride 1, fully-connected, softmax |
During training, we use a batch size of 64 and the Adam optimizer with learning rate = 0.001. The classifiers learn relatively quickly for both scenarios and we only needed to train for 15 epochs. We used early stopping with the validation set in CelebA to determine the best model to use for downstream evaluation.
We provide the architectural details in Table 2 below:
Name | Component |
---|---|
conv1 | conv, 64 filters. stride 2 |
Residual Block 1 | max pool, stride 2 |
Residual Block 2 | |
Residual Block 3 | |
Residual Block 4 | |
Output Layer | average pool stride 1, fully-connected, softmax |
We also use a batch size of 64, the Adam optimizer with learning rate = 0.0001, and a total of 15 epochs to train the density ratio estimate classifier.
We note a few steps we had to take during the training and validation procedure. Because of the imbalance in both (a) unbalanced/balanced dataset sizes and (b) gender ratios, we found that a naive training procedure encouraged the classifier to predict all data points as belonging to the biased, unbalanced dataset. To prevent this phenomenon from occuring, two minor modifications were necessary:
We balance the distribution between the two datasets in each minibatch: that is, we ensure that the classifier sees equal numbers of data points from the balanced () and unbalanced () datasets for each batch. This provides enough signal for the classifier to learn meaningful density ratios, as opposed to a trivial mapping of all points to the larger dataset.
We apply a similar balancing technique when testing against the validation set. However, instead of balancing the minibatch, we weight the contribution of the losses from the balanced and unbalanced datasets. Specifically, the loss is computed as:
where the subscript pos denotes examples from the balanced dataset () and neg denote examples from the unbalanced dataset ().
The architectural details for the Progressive GAN are provided in Table 4 below:
Generator | Discriminator |
---|---|
Noise | Image |
- | conv, filters. LReLU |
conv, filters. LReLU | conv, filters. LReLU |
conv, filters. LReLU | conv, filters. LReLU |
Upsample | Downsample |
conv, filters. LReLU | conv, filters. LReLU |
conv, filters. LReLU | conv, filters. LReLU |
Upsample | Downsample |
conv, filters. LReLU | conv, filters. LReLU |
conv, filters. LReLU | conv, filters. LReLU |
Upsample | Downsample |
conv, filters. LReLU | conv, filters. LReLU |
conv, filters. LReLU | conv, filters. LReLU |
Upsample | Downsample |
conv, filters. LReLU | Minibatch Stddev |
conv, filters. LReLU | conv, filters. LReLU |
conv, filters. | linear. |
output channels, followed by a leaky ReLU activation (with negative slope
). Pixel normalization is applied before each leaky ReLU activation in the generator. Upsampling is performed via nearest neighbor interpolation, whereas downsampling is performed via mean pooling.
We use a batch size of , and the Adam optimizer with learning rate , and . We train the model using a progressively-grown WGAN-GP with gradient penalty weight . Noise is distributed uniformly on the surface of a hypersphere of radius .
We construct such dataset splits from the full CelebA training set using the following procedure. We initially fix our dataset size to be roughly 135K out of the total 162K based on the total number of females present in the data. Then for each level of bias, we partition 1/4 of males and 1/4 of females into to achieve the 50-50 ratio. The remaining number of examples are used for , where the number of males and females are adjusted to match the desired level of bias (e.g. 90). Finally at each level of unbiased dataset size perc, we discard the appropriate fraction of datapoints from both the male and female category in . For example, for perc = 0.5, we discard half the number of females and half the number of males from .
In Figure 5, we show the calibration curves for the density ratio classifiers for each of the dataset sizes across all levels of bias. As evident from the plots, the classifiers are already calibrated and did not require any post-training recalibration.
Model | Optimal | Empirical |
---|---|---|
single, bias=0.9 | 70% | 68.6% |
single, bias=0.8 | 65% | 63.7% |
multi | 67.6% | 60.4% |
Table 5 shows that our density ratio classifier achieves an accuracy close to the optimal 0-1 accuracy on this dataset.
We note that although it is difficult to directly compare our model to supervised baselines such as FairGAN [39] and FairnessGAN [40] due to the unsupervised nature of our work, we conduct further evaluations on a relevant downstream task classification task, adapted to a fairness setting.
In this task, we augment a biased dataset (165K exmaples) with a "fair" dataset (135K examples) generated by a pre-trained GAN to use for training a classifier, then evaluate the classifier’s performance on a held-out dataset of true examples. We train a conditional GAN using the AC-GAN objective [66], where the conditioning is on an arbitrary downstream attribute of interest (e.g., we consider the “attractiveness” attribute of CelebA as in [40]). Our goal is to learn a fair classifier trained to predict the attribute of interest in a way that is fair with respect to gender, the sensitive attribute.
As an evaluation metric, we use the demographic parity distance (), denoted as the absolute difference in demographic parity between two classifiers and :
We consider 2 AC-GAN variants: (1) equi-weight trained on ; and (2) imp-weight, which reweights the loss by the density ratio estimates. The classifier is trained on both real and generated images for both AC-GAN variants, with the labels given by the conditioned attractiveness values for the respective generations. The classifier is then asked to predict attractiveness for the CelebA test set.
As shown in Table 6, we find that the classifier trained on both real data and synthetic data generated by our imp-weight AC-GAN achieved a much lower than the equi-weight baseline, demonstrating that our method achieves a higher demographic parity with respect to the sensitive attribute, despite the fact that we did not explicitly use labels during training.
Model | Accuracy | NLL | |
---|---|---|---|
Baseline classifier, no data augmentation | 79% | 0.7964 | 0.038 |
equi-weight | 79% | 0.7902 | 0.032 |
imp-weight (ours) | 75% | 0.7564 | 0.002 |
The results for the multi-attribute split based on gender and the presence of black hair are shown in Figure 6. As a reminder to the reader, there could be potential pitfalls in evaluation for data generation due to imperfections of the multi-attribute classifier. For instance, in Figure 6a, some samples classified as females without black hair seem to have darker hair shades based on visual inspection. Even in this challenging setup involving two latent bias factors, we find that the importance weighted approach again outperforms the baselines in almost all cases in mitigating bias in the generated data while admitting a small deterioration in image quality overall.
Additional samples for other experimental configuration are displayed in the following pages.
Comments
There are no comments yet.