Log In Sign Up

Learning Unbiased Representations via Mutual Information Backpropagation

We are interested in learning data-driven representations that can generalize well, even when trained on inherently biased data. In particular, we face the case where some attributes (bias) of the data, if learned by the model, can severely compromise its generalization properties. We tackle this problem through the lens of information theory, leveraging recent findings for a differentiable estimation of mutual information. We propose a novel end-to-end optimization strategy, which simultaneously estimates and minimizes the mutual information between the learned representation and the data attributes. When applied on standard benchmarks, our model shows comparable or superior classification performance with respect to state-of-the-art approaches. Moreover, our method is general enough to be applicable to the problem of “algorithmic fairness”, with competitive results.


page 1

page 2

page 3

page 4


A note on the unbiased estimation of mutual information

Estimators for mutual information are typically biased. However, in the ...

Learning Disentangled Representations via Mutual Information Estimation

In this paper, we investigate the problem of learning disentangled repre...

Matching Text with Deep Mutual Information Estimation

Text matching is a core natural language processing research problem. Ho...

Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

In this work, we develop a novel regularizer to improve the learning of ...

Learning Not to Learn: Training Deep Neural Networks with Biased Data

We propose a novel regularization algorithm to train deep neural network...

End-to-End Learning of Geometrical Shaping Maximizing Generalized Mutual Information

GMI-based end-to-end learning is shown to be highly nonconvex. We apply ...

Geometric Constellation Shaping for Fiber-Optic Channels via End-to-End Learning

End-to-end learning has become a popular method to optimize a constellat...

1 Introduction

The need for proper data representations is ubiquitous in machine learning and computer vision 

[Bengio2013RLR]. Indeed, given a learning task, the competitiveness of the proposed models crucially depends upon the data representation one relies on. In the last decade, the mainstream strategy for designing feature representations switched from hand-crafting to learning them in a data-driven fashion [collobert2011natural, Alex, szegedy2015going, mnih2015human, he2016deep, huang2017densely, xie2018aggregated]

. In this context, deep neural networks have shown an extraordinary efficacy in learning hierarchical representations via backpropagation 

[BackProp]. However, while learning representations from data allows achieving remarkable results in a broad plethora of tasks, it leads to the following shortcoming: a representation may inherit the intrinsic bias of the dataset used for training.

Figure 1: Problem setting. When learning a feature representation from the data itself (top), we may undesirably capture the inherent bias of the dataset (here, exemplified by colors), as opposed to learning the desired patterns (here, represented by shapes). This results in models that poorly generalize when deployed into unbiased scenarios (bottom).

This is highly undesirable, because it leads a model to poorly generalize in scenarios different from the training one (the so-called “domain shift” issue [NameTheDataset]).

In this paper, we are interested in learning representations that are discriminative for the supervised learning task of interest, while being invariant to certain specified

biased attributes of the data. By “biased attribute”, we mean an inherent bias of the dataset, which is assumed to be known and follows a certain distribution during training. At test time, the distribution of such attribute may abruptly change, thus tampering the generalization capability of the model and affecting its performance for the given task [zisserman2018, moyer2018neurips, kim2019cvpr].

One intuitive example is provided in Figure 1: we seek to train a shape classifier

, but each shape has a distinct color – the biased attribute. Unfortunately, a model can fit the training distribution by discriminating either the color or the shape. Among the two options, we are interested in the latter only, because the first one does not allow generalizing to shapes with different colors. Thus, if we were capable of learning a classifier while unlearning the color, we posit that it would better generalize to shapes with arbitrary colors. Like other prior works  

[VFAE_2016_fairness, moyer2018neurips, kim2019cvpr, zisserman2018], we operate in a scenario where the labels of biased attributes are assumed to be known. An example of application domain in which the hypothesis of having known labels for the bias holds, is algorithmic fairness [kleinberg2016inherent, donini2018empirical, zhang_2018_fairness, wang_2019_fairness], where the user specifies which attributes the algorithm has to be invariant to (e.g.

, learning a face recognition system which is not affected by gender or ethnicity biases).

In this paper, we tackle this problem through the lens of information theory. Since mutual information can be used to quantify the nonlinear dependency of the learned feature space with respect to the dataset bias, we argue that a good strategy to face the aforementioned problem is minimizing the mutual information between the learned representation and the biased attributes. This would result in a data representation that is statistically independent from the specified bias, and that, in turn, would generalize better.

Unfortunately, the estimation of the mutual information is not a trivial problem [poole2019variational]. In the context of representation learning, two bodies of work proposed solutions to the problem of learning unbiased representations via information theoretic measures: one that relies on adversarial training  [zisserman2018, kim2019cvpr], and one based on variational inference [moyer2018neurips]. Adversarial methods [zisserman2018, kim2019cvpr] learn unbiased representations by “fooling” a classifier trained to predict the attribute from the learned representation. Such condition is argued to be a proxy for the minimization of the mutual information [kim2019cvpr]. However, since the mathematical principles that govern adversarial training are nowadays still elusive [jin2019local, beyondNash], a key difficulty is how to properly balance between learning the task and unlearning the attribute. A better control on this aspect can be achieved by the sound theoretical framework of variational inference, which properly formalizes the prior and the conditional dependences among variables. However, when implementing those methods in practice, approximations need to be done to replace the computationally intractable posterior with an auxiliary distribution, but at the cost of several assumptions of independence among the variables. Moreover, such methods are more problematic to scale to complex computer vision tasks, and have been applied mostly on synthetic or toy datasets [VFAE_2016_fairness, moyer2018neurips].

Due to the aforementioned difficulties, in this paper, we seek to leverage the mathematical soundness of mutual information as a means to avoid adversarial training. To this end, we devise a computational pipeline that relies on a neural estimator for the mutual information (MINE [belghazi18a]). This module provides a more reliable estimate of the mutual information [poole2019variational], while still being fully differentiable and, therefore, trainable via backpropagation [BackProp]. Endowed with this model, we propose a training scheme where we alternate between (i) optimizing the estimator and (ii) learning a representation that is both discriminative for the desired task and statistically independent from the specified bias. In practice, first, we train a classifier to minimize the discriminative loss for the given task, regularized by the mutual information between the feature representation and the attributes. Second, we update the MINE parameters in order to tailor the mutual information to the current learned representation.

A key and strong aspect of the proposed approach is that – in contrast with adversarial methods – the module that estimates the mutual information is not competing with the feature extractor. For this reason, MINE can be trained until convergence at every training step, avoiding the need to carefully balance between steps (i) and (ii), and guaranteeing an updated estimate of the mutual information throughout the training process. In adversarial methods such as [kim2019cvpr], where the estimate for the mutual information is modeled via a discriminator that the feature extractor seeks to fool [Ganin, Ganin2], one cannot train an optimal discriminator at every training iteration. Indeed, if one trains an optimal bias discriminator, the feature extractor will no longer be able to fool it, due to the fact that gradients will become too small [arjovsky2017iclr] – and the adversarial game will not reach optimality. This difference is a key novelty of the proposed computational pipeline, which scores favorably with respect to prior work on different computer vision benchmarks, from color-biased classification to age-invariant recognition of people attributes.

Furthermore, a critical aspect of this line of work [zisserman2018, kim2019cvpr] is how to balance between learning the desired task and “unlearning” the dataset bias, which is a core, open issue [zhang_2018_fairness]. The training strategy proposed in this paper allows for a very simple strategy to govern this important problem. Indeed, as we will show later in the experimental analysis, a very effective approach is selecting the models whose learned representation distribution has the lowest mutual information with that of the biased attribute. We empirically show that these models are also the ones that better generalize to unbiased settings. Most notably, this also provides us with a simple cross-validation strategy for the crucial hyper-parameters: without using any validation data, we can select the optimal model as the one that achieves the best fitting to the data, while better minimizing the mutual information. The importance of this contribution is that, when dealing with biased datasets, also the validation set will likely suffer from the same bias, making hyper-parameter selection a thorny problem. Our proposed method properly responds to this problem, whereas former works have not addressed the issue [kim2019cvpr].

Paper outline.

In Section 2, we discuss the related literature. In Sections 3 and 4, we formalize the problem and describe the proposed method, which is empirically validated in Section 5. Concluding remarks are drawn in Section 6.

2 Related Work

The problem of learning unbiased representations has been explored in several sub-fields. In the following section, we cover the most related literature, with particular focus on works that approach our same problem formulation, highlighting similarities and differences.

In domain adaptation [Daume2006, Blitzer2006, Saenko2010], the goal is learning representations that generalize well to a (target) domain of interest, for which only unlabeled – or partially labeled – samples are available at training time, leveraging annotations from a different (source) distribution. In domain generalization, the goal is to better generalize to unseen domains, by relying on one or more source distributions [muandet2013icml, li2017iccv]. Adversarial approaches for domain adaptation [Ganin, Ganin2, ADDA, volpi2018cvpr] and domain generalization [shankar2018iclr, Zunino2019] are very related to our work: their goal is indeed learning representations that do not contain the domain bias, and therefore better generalize in out-of-distribution settings. Differently, in our problem formulation we aim at learning representations that are invariant towards specific attributes that are given at training time.

A similar formulation is related to the so-called “algorithmic fairness” [kleinberg2016inherent]. The problem here is learning representations that do not rely on sensitive attributes (such as, e.g., gender, age or ethnicity), in order to prevent from learning discriminant capabilities towards such protected categories. Our methods can be applied in this setting, in order to minimize the mutual information between the learned representation and the sensitive attribute (interpreted as a bias). In these settings, it is important to notice that a “fairer” representation does not necessarily generalize better than a standard one: the trade-off between accuracy and fairness is termed “fairness price” [kleinberg2016inherent, donini2018empirical, zhang_2018_fairness, wang_2019_fairness].

There is a number of works that share our same goal and problem formulation. Alvi et al. [zisserman2018] learn unbiased representations through the minimization of a confusion loss, learning a representation that does not inherit information related to specified attributes. Kim et al. [kim2019cvpr], similar to us, propose to minimize the mutual information between learned features and the bias. However, they face the optimization problem through adversarial training: in practice, in their implementation [kim2019cvpr-code], the authors rely on a discriminator trained to detect the bias as an estimator for the mutual information, and learn unbiased representations by trying to fool this module, drawing inspiration from the solution proposed by Ganin and Lempitsky [Ganin] for domain adaptation. Moyer et al. [moyer2018neurips] also introduce a penalty term based on mutual information, to achieve representations that are invariant to some factors. In contrast with related works [zisserman2018, kim2019cvpr, moyer2018neurips], it shows that adversarial training is not necessary to minimize such objective, and the problem is approached in terms of variational inference, relying on Variational Auto-Encoders (VAEs [VAE]). Closely related to Moyer et al., other works [VFAE_2016_fairness, Zemel_2019_fairness] impose a prior on the representation and the underlying data generative factors (e.g.

, feature vectors are distributed as a factorized Gaussian).

Our proposed solution does not fit under the class of adversarial approaches [zisserman2018, kim2019cvpr], nor it is based on VAE [moyer2018neurips], and provides several advantages over both. With respect to adversarial strategies, our method has the advantage of relying on a module estimating the mutual information [belghazi18a] that is not competing with the network trained to learn an unbiased representation. In our computational pipeline, we do not learn unbiased representation by “fooling” the estimator, but by minimizing the information that it measures. The difference is subtle, but brings a crucial advantage: in adversarial methods, the discriminator (estimator) cannot be trained until convergence at every training step, otherwise gradients flowing through it would be close to zero almost everywhere in the parameter space [arjovsky2017iclr], preventing from learning an unbiased representation. In our case, the estimator can be trained until convergence at every training step, improving the quality of its measure without any drawbacks. Furthermore, our solution can easily scale to large architectures (e.g., for complex computer vision tasks) in a straightforward fashion. While this is true also for adversarial methods [zisserman2018, kim2019cvpr], we posit that it might not be the case for methods based on VAEs [moyer2018neurips], where one has to simultaneously train a feature extractor/encoder and a decoder.

3 Problem Formulation

We operate in a setting where data are shaped as triplets , where represents a generic datapoint, denotes the ground truth label related to a task of interest and encodes a vector of given attributes. We are interested in learning a representation of that allows performing well on the given task, with the constraint of not retaining information related to . In other words, we desire to learn a model that, when fed with , produces a representation which is maximally discriminative with respect to , while being invariant with respect to .

In this work, we formalize the invariance of with respect to through the lens of information theory, imposing a null mutual information . Specifically, we constrain the discriminative training (finalized to learn the task of interest) by imposing , where and

are the random variables associated with

and , respectively. In formulæ, we obtain the following constrained optimization


where and define the two sets of parameters of the objective , which can be tailored to learn the task of interest. With , we refer to the trainable parameters of a module that maps a datapoint into the corresponding feature representation (that is, ). With , we denote the trainable parameters of a classifier that predicts from a feature vector (that is, ). The constraint does not depend upon , but only upon , since obeys to and .

In order to optimize the objective in (1), we must adopt an estimator of the mutual information. Before detailing our approach, in the following paragraph we cover the background required for a basic understanding of mutual information estimation, with focus on the path we pursue in this work.

Background on information theory. The mutual information between two random variables is given by


denotes the joint probability of the two variables and

represent the two marginals. As an alternative to covariance and other linear indicators of statistical dependence, mutual information can account for generic inter-relationships between , going beyond simple correlation [cavazza2016kernelized, CAVAZZA201925].

The main drawback with mutual information relates to its difficult computation, since the probability distributions

, and are not known in practice. Recently, a general purpose and efficient estimator for mutual information has been proposed by Belghazi et al. [belghazi18a]. They propose a neural network based approximation to compute the following lower bound for the mutual information :


When implementing

as a feed-forward neural network, the maximization in Eq. (

2) can be efficiently solved via backpropagation [belghazi18a]. As a result, we can approximate with , the so-called “Mutual Information Neural Estimator” (MINE [belghazi18a]). An appealing aspect of MINE is its fully differentiable nature, that enables end-to-end optimization of objectives that rely on mutual information computations.

Endowed with all relevant background, in the following section we detail our approach, which is based on the optimization of a Lagrangian for the objective (1). By relying on MINE [belghazi18a], we can efficiently estimate the mutual information and backpropagate through the different modules, in order to unbias the feature representation which is learnt to solve a given supervised learning task.

4 Method

In the following, we detail how we approach Eq. (1), both in terms of theoretical foundations and practical implementation.

4.1 Optimization problem

Figure 2: Model overview. The neural network devised for the given task is the concatenation of the blue module (feature extractor

) and the green module (logit layer

). Solid lines indicate the forward flow, dashed lines indicate gradient backpropagations. The feature extractor takes in input samples and outputs feature vectors . The logit layer takes in input the feature vectors and outputs predictions . To optimize for the given task, these modules can be trained by minimizing the cross-entropy between predictions and labels . The orange module [belghazi18a] estimates the mutual information between the feature vectors and the attributes . To estimate the mutual information,

processes the concatenation of feature vectors and attributes from the joint distribution and the marginals. Following Belghazi et al. 

[belghazi18a], we approximate sampling from the marginal by shuffling the batch of attributes (). The estimation of the mutual information is the maximum w.r.t. of the output of the orange module .

In order to proceed with a more tractable problem, we consider the Lagrangian of Eq. (1)


where the first term is a loss associated with the task of interest, whose minimization ensures that the learned representation is sufficient for our purposes. The second term is the mutual information between the learned representation and the given attributes. The hyper-parameter balances the trade-off between optimizing for a given task and minimizing the mutual information.

Concerning the first term of the objective, we will consider classification tasks throughout this work, and thus we assume that our aim is minimizing the cross-entropy loss between the output of the model and the ground truth .


where is the softmax function and is the number of given datapoints.

Concerning the second term of the objective in Eq. (1), as already mentioned, the analytical formulation of the mutual information is of scarce utility to evaluate . Indeed, we do not explicitly know the probability distributions that the learned representation and the attributes obey to. Therefore, we need an estimator for the mutual information , with the requirement of being differentiable with respect to the model parameters .

In order to attain our targeted goal, we take advantage of the work by Belghazi et al. [belghazi18a] (Eq. (2)), and exploit a second neural network

(“statistics network”) to estimate the mutual information. We therefore introduce an additional loss function


that, once maximized, provides an estimate of the mutual information


In Eq. (5), the notation reflects that we rely on the empirical distributions of features and attributes, the operator “” indicates vector concatenations and “ne” stands for “neural estimator” [belghazi18a]. The loss also depends on , since Eq. (5) depends on . Combining the pieces together, we obtain the following problem


Intuitively, the inner maximization problem ensures a reliable estimate of the mutual information between the learned representation and the attributes. The outer minimization problem is aimed at learning a representation that is at the same time optimal for the given task and unbiased with respect to the attributes.

4.2 Implementation Details

Concerning the modules introduced in Section 3, we implement the feature extractor (which computes features from datapoints ) and the classifier (which predicts labels from ) as feed-forward neural networks. The classifier is implemented as a shallow logit layer to accomplish predictions on the task of interest. As already mentioned, the model is also a neural network; it accepts in input the concatenation of feature vectors and attribute vectors , and through Eq. (5) allows estimating the mutual information between the two random variables. The nature of the modules allow to optimize the objective functions in (7) via backpropagation [BackProp]. Figure 2 portrays the connections between the different elements, and how the losses (4) and (5) originate.

A crucial point that needs to be addressed when jointly optimizing the two terms of Eq. (7) is that, while the distribution of the attributes is static, the distribution of the feature embeddings depends on , which changes throughout the learning trajectory. For this reason, the mutual information estimator needs to be constantly updated during training, because an estimate , associated with at step , is no longer reliable at step . To cope with this issue, we devise an iterative procedure where, prior to every gradient descent update on , we update MINE on the current model, through the inner maximizer in Eq. (7). This guarantees a reliable mutual information estimation.

1:Input: Dataset , initialized weights , , , learning rates , , hyper-parameters .
2:Output: learned weights ,
3:Initialize: , ,
4:for  do
5:     for  do (train MINE)
6:          sample mini-batches ,
7:          evaluate (Eq. (5))
9:     sample mini-batches ,
10:     evaluate (Eq. (4)) and (Eq. (5))
Algorithm 1 Learning Unbiased Representations

As already mentioned, one key difference with adversarial methods is that we can train MINE until convergence prior to each gradient descent step on the feature extractor, without the risk of obtaining gradients whose magnitude is close to zero [arjovsky2017iclr]

, since our estimator is not a discriminator (being the mutual information unbounded, sometimes gradient clipping is actually beneficial 

[belghazi18a]). The full training procedure is detailed in Algorithm 1.

Training techniques. Before discussing our results, we briefly comment below some techniques that we could appreciate to generally increase the stability of the proposed training procedure. While code and hyper-parameters can be found in the Supplementary Material, we believe that the reader can benefit from the discussion.
(a) Despite MINE [belghazi18a]

can estimate the mutual information between continuous random variables, we observed that the estimation is eased (in terms of speed and stability) if the attribute labels

are discrete. (b) We observed an increased stability in training MINE [belghazi18a] for lower-dimensional representations and attributes . For this reason, as we will discuss in Section 5, feature extractors with low-dimensional embedding layer are favored in our settings. (c) The feature extractor receives gradients related to both and : since the mutual information is unbounded, the latter may dominate the former. Following Belghazi et al. [belghazi18a], we overcome this issue via gradient clipping (we refer to original work for details). (d) We observed that training MINE requires large mini-batches: when this was unfeasible due to memory issues, we relied on gradient accumulation. (e) We observed that using vanilla gradient descent over Adam optimizer [AdamOptimizer] eases training MINE [belghazi18a] in most of our experiments.

5 Experiments

Figure 3: Left: digit examples for each class from training (here with ) and test set. Right: Women and Men images from the two splits of the training set of the IMDB dataset.

In the following, we show the effectiveness of models trained via Algorithm 1 in a series of benchmarks. First, we report results related to the setup proposed by Kim et al. [kim2019cvpr] – learning to recognize color-biased digits without relying on color information. Next, we show that our proposed solution can scale to higher-capacity models and more difficult tasks, through the IMBD benchmark [zisserman2018, kim2019cvpr], where the goal is classifying people age from images of their face, without relying on the gender bias. Finally, we show that our method can also be applied as it is to learn “fair” classifiers, by training models on the German dataset [german-dataset].

5.1 Digit Recognition

Experimental setup. Following the setting defined by Kim et al. [kim2019cvpr], we consider a digit classification task where each digit, originally from MNIST [MNIST], shows an artificially induced color bias. More specifically, in the training set (with

samples), digit colors are drawn from Gaussian distributions, whose mean values are different for each class. In the test set (with

samples), digits show random colors. The benchmark is designed with seven different standard deviation values

(equally spaced between and ): the lower the value, the more difficult the task, since the model can fit the training set by recognizing colors instead of shapes, thus poorly generalizing (see Figure 3). To extract the color information (the attribute , recalling notation from Section 3), the maximum pixel value is encoded in a binary vector with 24-bit (8 bits per channel). Since the background is always black, the maximum value reflects the digit color.

Concerning the model, we exploit a convolutional neural network 

[LeNet] with architecture conv-pool-conv-pool-fc-fc-softmax. The output of the second fully connected layer () is given in input to both the logit layer and MINE (Figure 2). The architecture of the statistics network

in MINE is a multi-layer perceptron (MLP) with 3 layers. More architectural details can be found in the Supplementary Material. We compare models trained via Algorithm 

1 with the solutions proposed by Kim et al. [kim2019cvpr] and Alvi et al. [zisserman2018], averaging across runs and using accuracy as a metric. Before comparing against related work, we discuss how crucial hyper-parameters can be selected in our setting.

Hyper-parameter choice. We discuss in the following the model behavior as we modify , that governs the trade-off between learning a task and minimizing the mutual information between features and attributes.

Figure 4: Digit experiment – ablation study. Evolution of mutual information estimation (left), test accuracy (middle) and training accuracy (right) for models trained on digits with and (top and bottom, respectively). Models are trained with Algorithm 1 with (baseline, blue), (orange) and (green). Increasing the value of the hyper-parameter allows reducing the mutual information between the learned representation () and the attributes (). In turn, models better generalize to unbiased samples (test set). Further plots in the Supplementary.

Figure 4 reports the evolution of mutual information estimation (left), accuracy on test samples (middle) and accuracy on training sample (right) for models trained with in blue, orange and green, respectively, for (top and bottom, respectively). It can be observed that the mutual information between embeddings and color attributes can be reduced by increasing . Importantly, this results in a significantly higher accuracy on (unbiased) test samples. The importance of this result is twofold: on the one hand, it is a proof of concept of the intuition that lowering the mutual information does help generalizing to unbiased sources; on the other, it provides us with a possible cross-validation strategy to pick a proper value (the one that allows minimizing the mutual information more efficiently). As can be observed in the plots on the right, the training procedure becomes more unstable when we increase . Therefore, in order to select the proper hyper-parameter, we can choose the highest value that allows the model fitting the data (i.e., minimizing ) and reducing the mutual information (i.e., minimizing ).

Another important hyper-parameter is the number of iterations used to train MINE [belghazi18a] prior to each gradient update on the feature extractor ( in Algorithm 1). We observed that, in general, the higher the number of iterations the better. This was expected, because the estimate of the mutual information becomes more reliable. In the results proposed in the following paragraph, we set . We refer to the Supplementary Material for details regarding the other less critical hyper-parameters.

Color variance

ERM () 0.476 0.005 0.542 0.004 0.664 0.007 0.720 0.010 0.785 0.003 0.838 0.002 0.870 0.001

Alvi et al. [zisserman2018]
0.676 0.713 0.794 0.825 0.868 0.89 0.917

Kim et al. [kim2019cvpr]
0.818 0.882 0.911 0.929 0.936 0.954 0.955


Table 1: Digit experiment – comparison with related work. Experimental results on colored digit classification for different levels of variance () in the color distribution. The first row reports results related to models trained via standard Empirical Risk Minimization (ERM). The second row reports results obtained by Alvi et al. [zisserman2018]. The third row reports the results published by Kim et al. [kim2019cvpr]. The last row reports results achieved with our method (with .)

Comparison with related work. We report in Table 1 the comparison between our method with and related works [kim2019cvpr, zisserman2018]. We can observe consistently improved results in all the benchmarks (different ’s). We emphasize that our method is more effective as the bias is more severe (small ’s). It is also important to stress that Kim et al. [kim2019cvpr] do not introduce any strategy to search the hyper-parameters that balance the adversarial game, whereas in this work the hyper-parameter search is efficiently resolved. Furthermore, the authors do not report any statistics around their results (e.g., average and standard deviation across different runs), making a fair comparison difficult.

5.2 IMDB: Removing the Age Bias

Experimental setup. Following related works [zisserman2018, kim2019cvpr], we consider the IMDB dataset [imdb_dataset] as benchmark. It contains cropped images of celebrity faces with ground truth annotations related to gender and age. Alvi et al. [zisserman2018] consider two subsets of the training set that are severely biased for what concerns age: the EB1 (“Extreme Bias”) split ( samples) only contains images of women with an age in the range 0-30, and men who are older than 40; vice versa, the EB2 split ( samples) only contains images of men with an age in the range 0-30, and women older than 40 (see Figure 3). The test set ( samples) contains faces without any restrictions on age/gender (uniformly samples). The goal here is learning an age-agnostic model, to overcome the bias present in the dataset.

Following previous work [zisserman2018, kim2019cvpr], we encode the age attribute (our biased attribute,

) using bins of 5 years, via one-hot encoding. We use a ResNet-50 


model pre-trained on ImageNet

[ImageNet] as classifier, modified with a 128-dimensional fully connected layer before the logit layer. This narrower embedding serves as our , and the reduced dimension eases the estimation of the mutual information, while not causing any detrimental effect in terms of accuracy. For each split (EB1 and EB2), we train the model through Algorithm 1 and evaluate it on the test set and on the split not used for training. We followed the same procedure detailed in Section 5.1 to choose the hyper-parameter , obtaining and for EB1 and EB2 splits, respectively; we set . We compare our results with the ones published by related works [zisserman2018, kim2019cvpr], using accuracy as a metric. We limited the training sets to only samples: this choice was due to the fact that with the whole training sets we could observe baselines () significantly higher than published results [kim2019cvpr], whereas they are comparable for models trained on a subset.

Results. Table 5 reports our results. In all our experiments, we observe accuracy improvements with respect to the baseline (). In general, training on one split and testing on the other is more challenging than testing on the (neutral) test set, as confirmed by the baseline results (ERM, first row). In all the different protocols, our method (last row) has superior performance than Alvi et al. [zisserman2018], and comparable performance with Kim et al. [kim2019cvpr].

These results confirm that our method can effectively remove biased, detrimental information even when modeling more complex data with higher-capacity models. In this case though, the improvements are more limited than the ones we showed in the digit experiment. One of the reasons might be that age and gender information cannot be decoupled as efficiently as shape and color. In other words, removing age information may not always bring accuracy improvements.

Train on EB1 Train on EB2 Method EB2 Test EB1 Test ERM () Alvi et al. [zisserman2018]  [kim2019cvpr]  [kim2019cvpr]  [kim2019cvpr]  [kim2019cvpr] Kim et al. [kim2019cvpr] Ours
Figure 5: IMDB experiment. (Table on the Left). We compare against related work in the Table on the left. The first row reports results obtained by setting (ERM baseline). The last row reports results obtained with our method, by setting for EB1 and for EB2. Each column reports results associated with the indicated test set. (Plots on the Right). Train on EB2. (Bottom-Right). Evolution over iterations of the test accuracy reported on the last column for our method (, green) and baseline models (, blue). (Top-Right). The mutual information is closer to when using our method. Our results were averaged over different runs.

5.3 Learning Fair Representations

Experimental setup. We explored the potentiality of our method in the context of algorithmic fairness with the popular UCI dataset German [german-dataset]. The dataset is composed of

samples of customer descriptions with both categorical and continuous attributes. The binary, ground truth label is the risk degree associated with a customer, either good or bad. The goal is to learn a model to predict the customer rate with the constraint of removing the information about the customer age (binarized according to

). This problem is different with respect to the previous ones: here the invariance towards sensitive attribute does not imply a better generalization on the test set as it happens with, e.g., digit recognition. The removal of the protected attribute is done for the sake of obtaining a fair representation [kleinberg2016inherent, donini2018empirical, zhang_2018_fairness, wang_2019_fairness].

German experiment Method SVM [donini2018empirical] FERM [donini2018empirical] NN [mary2019fairness] NN +  [mary2019fairness] Ours () Acc. EO
Figure 6: Fairness experiment – comparison with related work and ablation study. (Table on the Left). We compare against results as reported in [mary2019fairness]. For accuracy (first row), the higher the better. For EO (last row), the lower the better (i.e., the “fairer”). (Plots on the Right). The barplots show how the two considered metrics vary as we modify the hyper-parameter . EO (top) is significantly reduced as we set higher values of . Vice versa, test accuracy (bottom) is only slightly affected. Our results were averaged across different runs.

Following previous work, we implemented the feature extractor as a single-layer MLP with 64 units in the hidden layer. MINE’s statistics network is a shallow network with 64 hidden units. We randomly split the dataset in 70% training samples and 30% test samples, and use accuracy and Equal Opportunity (EO)111Equal Opportunity measures the discrepancy between the TP rates of “protected” and “non-protected” populations. Here, . as comparison metrics, averaging across different runs. The goal is to find a balance between reducing EO (i.e., learning a fairer representation) without observing a too severe decrease in accuracy.

Results. In the right plots of Table 6, we show how the performance varies when increasing from (standard Empirical Risk Minimization) to . It can be observed that our method allows training fairer models (i.e., reduced EO), while maintaining a good performance on test. For , the fairness price is close to zero (i.e., the accuracy does not decrease), while the fairness is substantially improved. We report the comparison with related works on Table 6 (left). Notice that the FERM method [donini2018empirical] directly optimizes for fairness, while we do not. This experiment is a proof of concept to show that the fairness community might benefit from our approach, although our main goal is bias removal in contexts where it can improve the model’s generalization capabilities.

6 Conclusions

We propose a training procedure to learn representations that are not biased towards dataset-specific attributes. We leverage a neural estimator for the mutual information [belghazi18a], devising a method that can be easily implemented in arbitrary architectures, and that relies on a training procedure which is more principled and reliable than adversarial training. When compared with the state of the art [zisserman2018, kim2019cvpr], it shows competitive results, with the advantage of a robust hyper-parameter selection procedure. Moreover, the proposed solution has competitive performance even in the fairness setting, where the goal is to find a trade-off between attribute invariance and accuracy.


7 Implementation Details

In the following paragraphs, we provide the implementation details. We carried out all of our experiments using TensorFlow 

222 Concerning the architectures used, please refer to Figure 7.

To ease the discussion, we can divide the optimization problem presented in our work into the following two


where the learning rates associated to (8) and (9) are and , respectively. We use the same notation of Algorithm 1 (in the paper).

Digit experiment.

We train our models for 150 epochs, using mini-batches of size

. The learning rates and are both set to . We use Adam [AdamOptimizer] as optimizer for (8) and (9). For each gradient update to optimize (8) with respect to , we update MINE parameters times (). That is, we perform 80 update steps to optimize (9), as to better train MINE (see Section 8 for a detailed discussion around this choice).

IMDB experiment. For both training splits (EB1 and EB2) we restrict the training set to samples This choice is motivated by the fact that using the whole training sets we observed higher baselines results then the ones published in previous art [kim2019cvpr]. We trained each model for epochs with mini-batch size set to . The learning rate is set to ; the learning rate is set to . We use Adam [AdamOptimizer] as optimizer for (8) and vanilla gradient descent for (9). We found a number of MINE iterations to be sufficient in order to estimate the mutual information throughout training.

German experiment. We adopted the same settings as previous art that uses this benchmark [moyer2018neurips]. The data samples available are split in training and test (randomly picked in each run). The model is trained for epochs with mini-batch size set to . The learning rate is set to ; the learning rate is set to . We use Adam [AdamOptimizer] as optimizer for (8), and vanilla gradient descent for (9). We set to a number of MINE iterations .

[width=]./images/Architecture2.png DigitIMDBGerman

Figure 7: Description of the architectures (classifiers and statistics networks) for the experiments on Digits (left), IMDB (middle) and German (right).

8 Discussion on the Hyper-Parameters

In this section, we discuss the hyper-parameters that we adopted throughout the experiments reported in this work.

Choice of the number of iterations to update MINE. We found that increasing the number of iterations to estimate stabilizes the overall training procedure, as shown in Figure 8. As our intuition behind this fact, we posit that the better the estimation of the mutual information through MINE is, the more precise and effective the gradients are. The only drawback we observed is the increased computational cost, since the time increases linearly with the number of iterations employed to estimate the mutual information.

Choice of the hyper-parameter . The hyper-parameter regulates the trade-off between minimizing the task loss and reducing the mutual information between the biased attribute and the learned representation in (8). In Section 5 of the paper, we describe how to properly tune it. We report in Figure 9 the complete version of the analysis reported in the manuscript for the Digit experiment. We report the evolution of mutual information, test accuracy and training accuracy for different values of the hyper-parameter , while is fixed to be equal to one of the following values: or .

Figure 8: Training (cross-entropy) loss (left) and training accuracy (right) with for different number of iterations of MINE () on the digit recognition task (setting ). An increased number of iterations ( in blue, orange and green, respectively) has the effect of stabilizing the training procedure, allows the model minimizing the loss function and fitting the training data. The charts report the average of 3 runs.


Figure 9: Values for mutual information (left column), test accuracy (middle column) and train accuracy (right column). We accounted for the different color, modelled by different (check Section 5 of the paper), and here represented by different rows. It is visible how a decrease in the (estimated) mutual information correlates with an improved performance.