1 Introduction
Decomposing data into disjoint independent factors of variations, i.e., learning disentangled representations, is essential for interpretable and controllable machine learning
(bengio2013representation). Recent works have shown that disentangled representation is useful for abstract reasoning (SteenkisteLSB19), fairness (locatello2019fairness; creager2019flexibly)(HigginsPRMBPBBL17) and general predictive performance (locatello2019challenging). While there is no consensus on the definition of disentanglement, existing works define it as learning to separate all factors of variation in the data (bengio2013representation). According to this definition, altering a single underlying factor of variation should only affect a single factor in the learned representation. However, works in learning disentangled representations higgins2016beta,ChenLGD18,locatello2019challenging have shown that this setting comes with a tradeoff between the precision of the representation and the fidelity of the samples. Therefore, learning precise representations for finer factors, i.e., each factor of variation, may not be practical or desirable. We deviate from this stringent assumption to learn groupdisentangled representations, in which a group might include several factors of variation that can covariate. For instance, groups of interest may be content, style, or background. As a result, a change in one component might affect other variables in a group but not on other groups.We present GroupVAE, a vae based framework that leverages weak supervision to learn groupdisentangled representations. In particular, we use paired observations that always share a group of factors. Existing groupdisentangled approaches (bouchacourt2018multi; hosoya2019group) enforce disentangled group representations by using an average or product of approximate group posteriors. However, as group representations are dependent on the observations used for the average or product, observations belonging to the same group may not be encoded to the same latent representations. We address this inconsistency challenge by incorporating a simple but effective regularization based on the kl divergence. Our idea builds on maximizing the elbo of the vae while minimizing the kl divergence between the latent variables that correspond to the group shared by the paired observations.
In summary, we make the following contributions:

We propose a way of learning disentangled representations from paired observations that employs kl regularization between the corresponding groups of latent variables.

We propose groupmig, a mutual informationbased metric for evaluating the effectiveness of group disentanglement methods.

Through extensive evaluation, we show that our GroupVAE’s effectiveness on a wide range of applications. Our evaluation shows significant improvement for group disentanglement, fair facial attribute classification, and 3D shaperelated tasks, including generation, classification, and transfer learning.
2 Background & Notation
vae.
Consider observations sampled i.i.d. from distribution and latent variables
. A vae learns the joint distribution
where is the likelihood function of observations given , are the model parameters of and is the prior of the latent variable . vae are trained to maximize the evidence lower bound (ELBO) on the loglikelihood . This objective averaged over the empirical distribution is given as(1) 
where denotes the learned approximate posterior, the variational parameters of and KL denotes the KullbackLeibler (KL) divergence. VAEs KingmaW13 are frequently used for learning disentangled representations and serve as the basis of our approach.
Weaklysupervised group disentanglement.
We assume the observations and the data generating process can be described by distinct groups . Each group splits into disjoint partitions with arbitrary sizes. Each group consists of nonoverlapping sets of factors of variations. For example, images of 3D shapes (3dshapes18)^{1}^{1}1Samples are shown in Figure 1. can be described through three groups: shape^{2}^{2}2The group shape contains factors such as shape category, shape size and shape color., background^{3}^{3}3The group background contains factors such as floor color, wall color. and view. Without loss of generality, we define two groups (content) and (style independent of content) to describe the generative and inference process. We assume having paired observations for training in a weaklysupervised setting. Each pair of observations shares the same group, i.e., in our case either content or style . During inference, the exact values for content and style are unknown, but only that share a certain group is known. For each observation , we define two latent variables: for content and for style. The goal for groupbased disentanglement is that the representation for the same group as close to each other to ensure consistency.
3 Learning GroupDisentangled Representations
In the following, we introduce GroupVAE, a deep generative model which learns disentangled representations for each group of factors. For simplicity, we limit the formulation of GroupVAE to two groups, content and style, although GroupVAE can be applied to any number of groups. This section first describes the generative and inference model and then introduces our main contributions – the kl regularization and inference scheme. We visualized the generative and inference model in Figures 0(b) and 0(c).
Inference and generative model.
Our model uses paired observations in a weaklysupervised setting. We sample from the empirical data distribution and conditionally sample in an i.i.d. manner, so that and belong to the same group , i.e.,
(2) 
Given , we define two latent variables, as content and as style variables. The data is explained by the generative process:
(3) 
Both and
are assumed to be independent of each other and are sampled from a Normal distribution with zero mean and diagonal unit variance.
is a suitable likelihood function^{4}^{4}4Suitable likelihood functions are, e.g., a Bernoulli likelihood for binary values or a Gaussian likelihood for continuous values. which is parameterized by a deep neural network. The generative model shown in Figure 0(b) is also known as the decoding part seen in Figure 0(a).To perform inference, we approximate the true posterior with the factorized approximate posterior that uses a neural network to amortize the the variational parameters. We specify the inference model as
(4) 
where both approximate posteriors are assume to be a factorized Normal distributions with mean and diagonal covariance . The inference model is visualized as a graphical model in Figure 0(c) and as the encoding part in Figure 0(a). The generative and inference models visualized in Figure 1 apply to as well.
VAE objective for paired observation.
Given paired observations , the VAE framework maximizes the elbo
(5) 
which consists of the reconstruction losses of the observations and (first two terms) and kl divergence between approximate posterior and prior of the latent variables and (third and fourth term). This is a straightforward application of the original elbo in (1) to two sets of observations, and .
kl regularization for group similarity.
Rather than defining an average representation for groups as in (bouchacourt2018multi; hosoya2019group), we propose to enforce consistency between the latent variables by minimizing kl divergence between the latent variables and . Here, denotes the group shared between observations and . and denote the corresponding group variable, e.g., if and share group then the corresponding latent variables are and . Given paired observations from the same group , our objective is to minimize
(6) 
The kl divergence has analytical solutions for Gaussian and Categorical approximate posteriors and is unaffected by the number of shared observations. The analytical solutions can be found in Appendix A.2.
GroupVAE objective and inference.
Given a paired observation in the sharing group , we combine the elbo in (5) and our proposed kl regularization in (6). Our proposed model, GroupVAE, has the following minimization objective
(7) 
where we treat the degree of regularization
as a hyperparameter. We propose an alternating inference strategy to encourage variation in both of the latent variables. If we only utilize observations that belong to one group, e.g., paired observations that always share content, we can obtain a trivial solution for the content latent variable by encoding constant latent variables. We overcome this collapse by alternating the group that the observations belong to during training. In particular, during inference we randomly sample a group
and the paired observation according to group g. We then minimize the kl divergence of the corresponding latent variable. The inference’s pseudo code is shown in Algorithm 1.3.1 Related Work
Unsupervised learning of disentangled representations.
Various regularization methods for unsupervised disentangled representation learning have been presented in existing works (higgins2016beta; kim2018disentangling; ChenLGD18). Even though unsupervised methods have shown promising results to learn disentangled representations, locatello2019challenging showed in a rigorous study that it is impossible to disentangle factors of variations without any supervision or inductive bias. Since then, there has been a shift towards weaklysupervised disentanglement learning. Our work follows this stream of works and focuses on the weaklysupervised regime instead of an unsupervised one.
Weaklysupervised learning of disentangled representations.
shu2019weakly investigated different types of weak supervision and provided a theoretical framework to evaluate disentangled representations. locatello2020weakly proposed to disentangle groups of variations with only knowing the number of shared groups which can be considered as a complementary component to our method. Similar to our method, both these works follow a weaklysupervised setup. However, both approaches focus on the disentanglement of finegrained factors, whereas our focus is to disentangle groups. Before the concept of paired observations was coined by shu2019weakly as “match pairing”, it was already used for geometry and appearance disentanglement (KossaifiTPP18; tran2019disentangling) and groupbased disentanglement (bouchacourt2018multi; hosoya2019group). Closest to our work is MLVAE (bouchacourt2018multi) and GVAE (hosoya2019group). For groupdisentangled representations, MLVAE uses a product of approximate posteriors, whereas GVAE uses an empirical average of the parameters of the approximate posteriors. A thorough analysis of both works is in Appendix B. In contrast, we employ a simple and effective KL regularization that has no dependency on the batch size.
Alignment between factors of variations and learned representations.
Closely related to our work and groupbased disentanglement concepts are studies that learn specific latent variables corresponding to one or several factors of variations (or labels). Dupont18 used both continuous and discrete latent variables to improve unsupervised disentanglement of mixedtype latent factors. creager2019flexibly proposed to minimize the mutual information between the sensitive latent variable and sensitive labels. Similarly, KlysSZ18 proposed to minimize mi between the latent variable and a conditional subspace. Both works (creager2019flexibly; KlysSZ18)
require either supervision, sensitive labels, or conditions to estimate the mutual information, whereas we only use weak supervision for learning disentangled group representations. Concurrent to our work,
sinha2021consistency proposed to use a kl regularization for learning a vae with representation that is consistent with augmented data. While sinha2021consistency use the KL regularization to enforce the encoding to be consistent with changes in the input, our goal is to split the representation into subspaces that correspond to the different groups of variations.4 Evaluation
Here, we evaluate our GroupVAE and compare it to existing approaches. We show that our approach outperforms existing approaches for groupdisentanglement and disentanglement on existing disentanglement benchmarks. Within the context of evaluating group disentanglement, we propose a mibased evaluation metric to assess the degree of group disentanglement. Further, we demonstrate that our approach is generic and can be applied to various applications, including fair classification and 3D shaperelated tasks (reconstruction, classification, and transfer learning).
4.1 Weaklysupervised groupdisentanglement
Experimental settings.
We used three standard datasets on disentangled representation learning: 3D Cars (reed2014learning), 3D Shapes (3dshapes18) and dSprites (dsprites17). Despite the fact that these image datasets are synthetic, disentangling the factors of variation remains a difficult and unresolved task (locatello2019challenging; locatello2020weakly). We use mig (ChenLGD18) and our proposed metric groupmig for quantitative evaluation different approaches. We compare our model, GroupVAE, to unsupervised methods (VAE (higgins2016beta) and FactorVAE (kim2018disentangling)) as well as weaklysupervised methods (AdaGVAE (locatello2019challenging), MLVAE (bouchacourt2018multi), and GVAE (hosoya2019group)). For all methods, we ran a hyperparameter sweep varying regularization strength for five different seeds. We report the median groupmig and mig.
groupmig for evaluating group disentanglement.
The mig (ChenLGD18) is a commonly used evaluation metric for disentanglement. This metric measures the normalized difference between the latent variable dimensions with highest and secondhighest mi values. The higher the mig, the greater the degree of disentanglement is. However, mig can still be high if the style latent variable disentangles all factors of variation whereas the content variable collapse to a constant value. An example of a failure in group disentanglement is shown in Figure 2. Therefore, we introduce groupmig, a metric based on mig, which addresses this issue and quantitatively estimates the mutual information between groups and corresponding latent variables. We define groupmig as
(8) 
where is the number of groups, is the ground truth group, and is an empirical estimate of the mi between continuous variable and
. The values of groupmig is small if the group factors are not represented in the corresponding latent vectors, even though the factor is disentangled within the other variables.
Group labeling.
We define the following groups based on the finegrained factors for each dataset:

dSprite:s

3D Shapes:

3D Cars:
Results.
We consistently outperform weaklysupervised disentanglement models w.r.t. median groupmig over five hyperparameter sweeps of different seeds by at least 25%
(3D Shapes). Further, we also improve on disentanglement w.r.t. mig for two out of three datasets (3D Cars, dSprites). In addition, we show interpolation samples of MLVAE, GVAE, and GroupVAE
^{5}^{5}5We selected models with median groupmig over five hyperparameter sweeps of different seeds. for 3D Shapes in Figure 3. Both MLVAE and GVAE are not able to capture azimuth in the latent representations. Moreover, GVAE encodes almost all factors into the style part and collapses to a constant representation in the content part. The interpolations of GroupVAE show content and style disentanglement, although some factors such as object size and type for 3D Shapes remain entangled. As we assume that factors in a group can covariate, this result is expected as object size and type are in the same group.3D Cars  3D Shapes  dSprites  
Type  Model  
groupMIG  MIG  groupMIG  MIG  groupMIG  MIG  
unsup.  VAE  0.08  0.22  0.10  
unsup.  FactorVAE  0.10  0.27  0.14  
weaklysup.  AdaGVAE  0.15  0.56  0.26  
weaklysup.  MLVAE  0.24  0.07  0.47  0.32  0.11  0.22 
weaklysup.  GVAE  0.27  0.08  0.45  0.31  0.14  0.21 
weaklysup.  GroupVAE (ours)  0.48  0.18  0.60  0.31  0.54  0.27 



4.2 Application to fair classification



We report test accuracy and dp for each sensitive attribute with an average of five experiments. We report the standard error for all test accuracies, but leave out the standard error for all DP results as they were
. We highlight in bold the best results. The column Fair learning refers to whether a model uses any supervision during the fair representation learning phase. For the final classification, all models use full supervision.We examine the problem of learning fair representations for classification problems as an application of our method. In particular, we want to learn fair group representation in which members of any (demographic) groups have an equal probability of being assigned to the positive predicted class. Deep learning algorithms have been proven to be biased against specific demographic groups or populations
(mehrabi2021survey). It is critical that classification models can produce accurate predictions without discriminating against certain groups in highstakes and saferelated applications. In this context, we propose to learn fair representations by learning two distinct groups of representations: a predictive representation for evaluating the downstream task and a representation to account for the sensitive factors, e.g., gender or agespecific attributes. The latter representation is solely utilized for training and not for downstream tasks.Learning fair representations consist of a twostep optimization scheme. First, we train GroupVAE with pairs of observations sharing either sensitive and nonsensitive attributes. Second, we train a simple MLP for attribute classification using the nonsensitive mean representation. We measure classification accuracy and dp. dp measures whether the predictive outcome is independent of a sensitive attribute. A completely fair model would attain a dp value of 0.0, whereas a biased model can have a dp up to 1.0. We compare against MLP and CNN baselines, and FFVAE creager2019flexibly which learns fair representations by using a supervised loss on the sensitive attributes and a total correlation loss. We used two datasets: dSpritesUnfair creager2019flexibly,trauble2020independence and CelebA liu2015deep. dSpritesUnfair is a modified image dataset based on dSprites with binarized factors of variations and is sampled with shape and xposition being highly correlated. For CelebA, an image dataset of celebrity faces with 40 binary attribute labels, we predict “bald” and “attractive” in two separate experiments. For predicting “bald”, we use the attributes “male” and “young” as sensitive attributes whereas we use the attributes “BigNose”, “HeavyMakeUp”, “Male” and “WearingLipstick” as sensitive attributes for predicting “attractive”. We argue that these attributes have a weak correlation but a strong correlation with the predictive attribute. However, several CelebA attributes significantly correlate, making this a difficult dataset for fairness classification. We refer to the Appendix
C.2 for the detailed experimental settings.Results.
We report the fair classification results in Table 2. Overall, the results in Table 2 show that weaklysupervised fair representation learning (GroupVAE) outperforms supervised fair representation learning (FFVAE). Further, we either get competitive or even outperform the supervised baselines (MLP, CNN). Surprisingly, when evaluating dSpritesUnfair the demographic parity for all models is relatively low, and the strong correlation between shape and xposition does not seem to affect the classification. The test accuracy and dp of the sensitive attributes of all the competitive models are very close to each other. Nevertheless, among all models, our method achieves the highest test accuracy and lowest dp. For predicting “bald” in CelebA, even though both MLP and CNN baselines achieve high test accuracy, the dp shows an extremely biased classification towards genderspecific and malespecific attributes. In contrast, our GroupVAE achieves the lowest dp but still attain competitive classification accuracy, i.e., second highest test accuracy after the CNN performance. When predicting “attractive”, GroupVAE decreases the bias of all sensitive attributes and increases the test accuracy compared to all other models.
4.3 Application to 3D point cloud tasks
In addition to evaluating image datasets, we show experiments on 3D point clouds for reconstruction and classification. We experimented with FoldingNet yang2018foldingnet, a deep autoencoder that learns to reconstruct 3D point clouds in an unsupervised way. Unlike VAEs, the FoldingNet autoencoder is deterministic and does not optimize the representation to be a probabilistic distribution. Instead of converting the autoencoder into a VAE, we use a similar approach as ghosh2019variational
. We assume the embedding of autoencoder to be Normal distributed with constant variance. Given this assumption, the KL divergence between the corresponding embeddings reduces to a simple L2 regularization, and we can inject noise to regularize the decoding. We evaluate three tasks, 3D point cloud reconstruction, classification, and transfer learning. We measure the Chamfer Distance (CD) and the Earth Mover’s Distance (EMD) to assess reconstruction quality and report accuracy to assess classification and transfer learning performance. We compare to FoldingNet (unsupervised) and DGCNN (supervised) wang2019dynamic, a dynamic graphbased classification approach. For assessing the transfer learning capability, we use a linear SVM classifier on the extracted representation. We used two datasets for training: FG3D liu2021fine and ShapeNetV2 chang2015shapenet. FG3D contains 24,730 shapes with annotations of basic categories (Airplane, Car, and Chair) and finegrained subcategories. ShapeNetV2 contains 51,127 shapes with annotations of 55 categories. For transfer learning, we also use ModelNet40 wu20153d.
Results.
Table 3(a) shows that weaklysupervised training improves upon 3D point cloud reconstruction for both FG3D and ShapeNetV2. Table 3(b) shows the classification and transfer results. Our approach GroupFoldingNet improves point cloud classification compared to the original FoldingNet and is competitive with the supervised approach when training with FG3D. We outperform both supervised and unsupervised transfer learning performances when training with FG3D and evaluating ShapeNetV2 and ModelNet40. We are competitive to the supervised approach when training with ShapeNetV2 and evaluating on ModelNet40. In particular, the transfer learning performance with FG3D as the training set highlights the capabilities of weaklysupervised group disentanglement as it can learn 3D point clouds of three classes and transfer it to ShapeNetV2, a largescale dataset with 55 classes. We also visualize point cloud reconstructions and interpolations of three different classes using our approach in Figure 4. The reconstructions show that our approach is better than FoldingNet in reconstructing finer details. Further, the interpolations show that our approach can learn an interpretable representation.




5 Conclusion & discussion
We proposed a simple KL regularization for VAEs to enforce group disentanglement through weak supervision. We empirically showed that our model outperforms existing approaches in group disentanglement. Further, we demonstrated that learning groupdisentangled representations outperforms performance on fair image classification and 3D shaperelated tasks (reconstruction, classification, and transfer learning) and is even competitive to supervised approaches.
There are several possible directions for future work. In comparison to unsupervised representation learning, weaklysupervised learning, by definition, requires some weak form of supervision. Although we only need knowledge of whether two observations share a specific group, this limits the approach. Further, we require group labels for the entire dataset for training and evaluation. For reallife applications, datasets may not be fully labeled, and performance may suffer under this setting. Future investigation of group disentanglement in a low data or a “semi” weaklysupervised regime can allow group disentanglement learning to transfer to largescale and more realistic settings. Another promising direction is investigating models with more than two groups. Even though we chose to focus on applications with two groups in this work, our method can generalize to more than two groups, which is a promising direction for future work.
Acknowledgments
We thank Hooman Shayani and Tonya Custis for useful discussions and comments on the paper.
References
Appendix A GroupVAE
a.1 Joint Learning of Continuous and Discrete Groups
The generative model defined in the main Section 4 assumes both content and style representations to be Gaussian distributed. However, many datagenerating processes rely on discrete factors which is usually difficult to capture with continuous variables. In these cases, we can define the generative model as
(9)  
(10)  
(11) 
For inference, we use a GumbelSoftmax reparameterization JangGP17; MaddisonMT17, a continuous distribution on the simplex that can approximate categorical samples for . Similar to the kl divergence between two Normal distributions, the kl divergence between two Categorical distributions can also be computed in closed form.
a.2 Closedform Solutions for the KL Regularization
In the case of both and being factorized Gaussian distributions, the KL regularization has the analytical solution
(12) 
In the case of and , the KL has the analytical solution
(13) 
Appendix B Analysis of Existing GroupDisentanglement Approaches
In this Section, we give further details about the content approximate posterior proposed by Bouchacourt et al. bouchacourt2018multi, and Hosoya hosoya2019group. Further, we analyze the proposed approaches and show its limitations.
b.1 MLVAE and GVAE
As described in Subsection 3, we restrict to two groups and define corresponding latent variables and given observation ^{6}^{6}6In similar fashion, we define two latent variables and for observation .. However, both works also apply to any number of groups. For paired observations with shared group factor , the loss objectives for MLVAE bouchacourt2018multi and GVAE hosoya2019group are
(14)  
(15) 
The loss objectives and are very similar. The only exceptions are the group approximate posteriors, for and for .
bouchacourt2018multi assume the group approximate posterior to be a product of the individual approximate posteriors sharing the same group
(16) 
The product of two or more Normal distributions is Normal distributed, and thus the kl term can still be calculated in closedform.
hosoya2019group uses an empirical average over the parameters of the individual approximate posteriors. The group approximate posterior is defined as
(17) 
b.2 Analysis
Both MLVAE and GVAE enforce disentanglement through the regularization in the last two terms of eq:mlvaeeq:gvae. This regularization was also used in VAE higgins2016beta which regularizes a tradeoff between disentanglement and reconstruction. The two kl terms in eq:mlvaeeq:gvae can be decomposed similar to the elbo and kl decomposition in hoffman2016elbo; ChenLGD18. We consider the objective in (14) averaged over the empirical distribution
. Each training sample denoted by a unique index and treated as random variable
. We simplify and refer to as the aggregated posterior hoffman2016elbo. We can decompose the first kl in (14)^{7}^{7}7We can decompose the kl of GVAE in (15) similarly. as(18) 
where . We show the full derivation in the next Subsection B.3. Minimizing the averaged kl between the content and style latent variables () and the prior also leads to minimization of the total correlation of content variables and style variables (the last two terms in (18)). The total correlation quantifies the amount of information shared between multiple random variables, i.e., low total correlation indicates high independence between the variables. Even though this objective motivates disentangled content and style representations, the group representation depends on the number of samples used for the averaging. Further, both bouchacourt2018multi and hosoya2019group only average over the content group. There are no structural nor optimization constraints that prevent the style latent variable from encoding all factors of variation.



Sensitivity to group batch size.
MLVAE and GVAE use different types of averaging over group latent variables. In realistic settings, always having a certain number of observations that share the same group variations can be difficult. For instance, when training MLVAE and GVAE with dSprites, the performance and its variance is correlated with the number of shared observations. We visualized these findings in Figure 5 (c).
Visualization of collapse.
We visualize such behavior in Figure 5 (a) on a GVAE model trained on 3DShapes with two groups of variations {object color, object size and object type} and {floor color, wall color, azimuth}. Ideally, contains high mutual information with group factors and contains high mutual information with group factors . However, most information is captured in , whereas only a little information about object type is contained in .
b.3 KL Decomposition
Here, we show the full derivation for (18). For a given group the KL decompose as follows:
(19)  
(20)  
(21)  
(22)  
(23)  
(24) 
where denote the empirical data distribution.
Appendix C Experimental Setup
c.1 Disentanglement Study
All hyperparameters for optimization and model architectures are listed in Table 4. We compare our approach, GroupVAE, to four different models: VAE higgins2016beta, AdaGVAE locatello2020weakly, MLVAE bouchacourt2018multi and GVAE hosoya2019group. To fairly compare all models, we used the same architecture and optimization settings for all models and only varied the range of the regularization. We ran five experiments for every hyperparameter set with different random seeds (). In total, we ran 240 experiments. Each experiment ran on GPU clusters consisting of Nvidia V100 or RTX 6000 for approximately 23 hours.
Datasets and group sampling.
We evaluated our approach on three datasets: 3D Cars reed2014learning, 3D Shapes 3dshapes18 and dSprites dsprites17. All datasets contain images of size with pixels normalized between 0 and 1. For training, given observations and groups , we sample uniformly from all groups and the observation uniform from all observations which share the same group values as .
Evaluating disentanglement.
In addition to comparing group disentanglement, we also used mig ChenLGD18 to compare the models’ ability to disentangle all factors of variation. ChenLGD18 proposed MIG as an unbiased and hyperparameterdependent evaluation metric to measure the mutual information between each ground truth factor and each dimension in the computed representation. The MIG is calculated as the average difference between the highest and secondhighest normalized mutual information of each factor. The score is computed as
(25) 
where and is the number of known factors.



c.2 Fairness



We ran five experiments for every hyperparameter set with different random seeds (). In total, we ran 550 experiments. Each experiment ran on GPU clusters consisting of Nvidia V100 or RTX 6000 for approximately 23 hours.
Models
For the fair classification experiments we used the same common hyperparameters and model architecture as in the disentanglement studies (Table 4 (a) and (b)) for GroupVAE, GVAE and MLVAE. In addition, we implemented two simple baselines, an mlp and a cnn. The architecture for these two models are described in Table 5. For the supervised fair classification, we implemented FFVAE creager2019flexibly with the same encoder and decoder networks as in Table 4 (b) and the FFVAE discriminator as in Table 5
. The baselines are trained with a crossentropy loss between the logits of the network and the binary label “HeavyMakeup”. We used different number of latent dimensions which is shown in Table
5 (c).Sensitive and nonsensitive latent variables.
Similar to the content and style disentanglement setup, we define two groups, sensitive and nonsensitive. GroupVAE can be optimized to learn from weakly supervised observations sharing either sensitive or nonsensitive group values. FFVAE creager2019flexibly can be seen as the supervised approach of learning sensitive and nonsensitive representations. FFVAE maximizes the ELBO objective (reconstruction loss and KL divergence between approximate posterior and prior). In addition, the objective regularizes the discriminative ability of the sensitive latent variable with in a supervised manner (how well can the model classify sensitive labels from sensitive latent variable?) and the disentanglement with (how well is the sensitive latent variable disentangled from the nonsensitive latent variable?).
Datasets.
For comparability with FFVAE creager2019flexibly, we used similar dataset settings for CelebA li2018deep and dSpritesUnfair. Both datasets contain images with pixels normalized between 0 and 1. We used the predefined train, validation, and test split of CelebA li2018deep, whereas in dSpritesUnfair we use a random split of 80% train, 5% validation, and 15% test.
dSpritesUnfair.
dSpritesUnfair is a modified version of dSprites dsprites17. The two components are the binarization of the factors of variation and biased sampling. dSprites contains images which are described by five factors of variation. We binarized the factors of variations following these criterion creager2019flexibly:

Shape

Scale

Rotation

Xposition

Yposition
Similar to trauble2020independence, we enforce correlations between shape and xposition through a biased sampling. In the training set, we sample these two factors from a joint distribution
(26) 
where determines the strength of the.correlation and is set to in our experiments. The smaller , the higher the correlation between the two factors.
Model selection.
As shown in creager2019flexibly, there is a tradeoff between classification accuracy and demographic parity. Thus, model selection based on only one of these metrics compromises the other. We propose to use the difference between the two metrics as a way to do model selection. We coin this metric FairGap (FG) and define it as
(27) 
FG is high if accuracy is high and the average demographic is low, resulting in a fair classifier. We select the model on the test set of CelebA and dSprites based on the FG of the validation set.