. In this context, deep neural networks have shown an extraordinary efficacy in learning hierarchical representations via backpropagation[BackProp]. However, while learning representations from data allows achieving remarkable results in a broad plethora of tasks, it leads to the following shortcoming: a representation may inherit the intrinsic bias of the dataset used for training.
This is highly undesirable, because it leads a model to poorly generalize in scenarios different from the training one (the so-called “domain shift” issue [NameTheDataset]).
In this paper, we are interested in learning representations that are discriminative for the supervised learning task of interest, while being invariant to certain specifiedbiased attributes of the data. By “biased attribute”, we mean an inherent bias of the dataset, which is assumed to be known and follows a certain distribution during training. At test time, the distribution of such attribute may abruptly change, thus tampering the generalization capability of the model and affecting its performance for the given task [zisserman2018, moyer2018neurips, kim2019cvpr].
One intuitive example is provided in Figure 1: we seek to train a shape classifier
, but each shape has a distinct color – the biased attribute. Unfortunately, a model can fit the training distribution by discriminating either the color or the shape. Among the two options, we are interested in the latter only, because the first one does not allow generalizing to shapes with different colors. Thus, if we were capable of learning a classifier while unlearning the color, we posit that it would better generalize to shapes with arbitrary colors. Like other prior works[VFAE_2016_fairness, moyer2018neurips, kim2019cvpr, zisserman2018], we operate in a scenario where the labels of biased attributes are assumed to be known. An example of application domain in which the hypothesis of having known labels for the bias holds, is algorithmic fairness [kleinberg2016inherent, donini2018empirical, zhang_2018_fairness, wang_2019_fairness], where the user specifies which attributes the algorithm has to be invariant to (e.g.
, learning a face recognition system which is not affected by gender or ethnicity biases).
In this paper, we tackle this problem through the lens of information theory. Since mutual information can be used to quantify the nonlinear dependency of the learned feature space with respect to the dataset bias, we argue that a good strategy to face the aforementioned problem is minimizing the mutual information between the learned representation and the biased attributes. This would result in a data representation that is statistically independent from the specified bias, and that, in turn, would generalize better.
Unfortunately, the estimation of the mutual information is not a trivial problem [poole2019variational]. In the context of representation learning, two bodies of work proposed solutions to the problem of learning unbiased representations via information theoretic measures: one that relies on adversarial training [zisserman2018, kim2019cvpr], and one based on variational inference [moyer2018neurips]. Adversarial methods [zisserman2018, kim2019cvpr] learn unbiased representations by “fooling” a classifier trained to predict the attribute from the learned representation. Such condition is argued to be a proxy for the minimization of the mutual information [kim2019cvpr]. However, since the mathematical principles that govern adversarial training are nowadays still elusive [jin2019local, beyondNash], a key difficulty is how to properly balance between learning the task and unlearning the attribute. A better control on this aspect can be achieved by the sound theoretical framework of variational inference, which properly formalizes the prior and the conditional dependences among variables. However, when implementing those methods in practice, approximations need to be done to replace the computationally intractable posterior with an auxiliary distribution, but at the cost of several assumptions of independence among the variables. Moreover, such methods are more problematic to scale to complex computer vision tasks, and have been applied mostly on synthetic or toy datasets [VFAE_2016_fairness, moyer2018neurips].
Due to the aforementioned difficulties, in this paper, we seek to leverage the mathematical soundness of mutual information as a means to avoid adversarial training. To this end, we devise a computational pipeline that relies on a neural estimator for the mutual information (MINE [belghazi18a]). This module provides a more reliable estimate of the mutual information [poole2019variational], while still being fully differentiable and, therefore, trainable via backpropagation [BackProp]. Endowed with this model, we propose a training scheme where we alternate between (i) optimizing the estimator and (ii) learning a representation that is both discriminative for the desired task and statistically independent from the specified bias. In practice, first, we train a classifier to minimize the discriminative loss for the given task, regularized by the mutual information between the feature representation and the attributes. Second, we update the MINE parameters in order to tailor the mutual information to the current learned representation.
A key and strong aspect of the proposed approach is that – in contrast with adversarial methods – the module that estimates the mutual information is not competing with the feature extractor. For this reason, MINE can be trained until convergence at every training step, avoiding the need to carefully balance between steps (i) and (ii), and guaranteeing an updated estimate of the mutual information throughout the training process. In adversarial methods such as [kim2019cvpr], where the estimate for the mutual information is modeled via a discriminator that the feature extractor seeks to fool [Ganin, Ganin2], one cannot train an optimal discriminator at every training iteration. Indeed, if one trains an optimal bias discriminator, the feature extractor will no longer be able to fool it, due to the fact that gradients will become too small [arjovsky2017iclr] – and the adversarial game will not reach optimality. This difference is a key novelty of the proposed computational pipeline, which scores favorably with respect to prior work on different computer vision benchmarks, from color-biased classification to age-invariant recognition of people attributes.
Furthermore, a critical aspect of this line of work [zisserman2018, kim2019cvpr] is how to balance between learning the desired task and “unlearning” the dataset bias, which is a core, open issue [zhang_2018_fairness]. The training strategy proposed in this paper allows for a very simple strategy to govern this important problem. Indeed, as we will show later in the experimental analysis, a very effective approach is selecting the models whose learned representation distribution has the lowest mutual information with that of the biased attribute. We empirically show that these models are also the ones that better generalize to unbiased settings. Most notably, this also provides us with a simple cross-validation strategy for the crucial hyper-parameters: without using any validation data, we can select the optimal model as the one that achieves the best fitting to the data, while better minimizing the mutual information. The importance of this contribution is that, when dealing with biased datasets, also the validation set will likely suffer from the same bias, making hyper-parameter selection a thorny problem. Our proposed method properly responds to this problem, whereas former works have not addressed the issue [kim2019cvpr].
2 Related Work
The problem of learning unbiased representations has been explored in several sub-fields. In the following section, we cover the most related literature, with particular focus on works that approach our same problem formulation, highlighting similarities and differences.
In domain adaptation [Daume2006, Blitzer2006, Saenko2010], the goal is learning representations that generalize well to a (target) domain of interest, for which only unlabeled – or partially labeled – samples are available at training time, leveraging annotations from a different (source) distribution. In domain generalization, the goal is to better generalize to unseen domains, by relying on one or more source distributions [muandet2013icml, li2017iccv]. Adversarial approaches for domain adaptation [Ganin, Ganin2, ADDA, volpi2018cvpr] and domain generalization [shankar2018iclr, Zunino2019] are very related to our work: their goal is indeed learning representations that do not contain the domain bias, and therefore better generalize in out-of-distribution settings. Differently, in our problem formulation we aim at learning representations that are invariant towards specific attributes that are given at training time.
A similar formulation is related to the so-called “algorithmic fairness” [kleinberg2016inherent]. The problem here is learning representations that do not rely on sensitive attributes (such as, e.g., gender, age or ethnicity), in order to prevent from learning discriminant capabilities towards such protected categories. Our methods can be applied in this setting, in order to minimize the mutual information between the learned representation and the sensitive attribute (interpreted as a bias). In these settings, it is important to notice that a “fairer” representation does not necessarily generalize better than a standard one: the trade-off between accuracy and fairness is termed “fairness price” [kleinberg2016inherent, donini2018empirical, zhang_2018_fairness, wang_2019_fairness].
There is a number of works that share our same goal and problem formulation. Alvi et al. [zisserman2018] learn unbiased representations through the minimization of a confusion loss, learning a representation that does not inherit information related to specified attributes. Kim et al. [kim2019cvpr], similar to us, propose to minimize the mutual information between learned features and the bias. However, they face the optimization problem through adversarial training: in practice, in their implementation [kim2019cvpr-code], the authors rely on a discriminator trained to detect the bias as an estimator for the mutual information, and learn unbiased representations by trying to fool this module, drawing inspiration from the solution proposed by Ganin and Lempitsky [Ganin] for domain adaptation. Moyer et al. [moyer2018neurips] also introduce a penalty term based on mutual information, to achieve representations that are invariant to some factors. In contrast with related works [zisserman2018, kim2019cvpr, moyer2018neurips], it shows that adversarial training is not necessary to minimize such objective, and the problem is approached in terms of variational inference, relying on Variational Auto-Encoders (VAEs [VAE]). Closely related to Moyer et al., other works [VFAE_2016_fairness, Zemel_2019_fairness] impose a prior on the representation and the underlying data generative factors (e.g.
, feature vectors are distributed as a factorized Gaussian).
Our proposed solution does not fit under the class of adversarial approaches [zisserman2018, kim2019cvpr], nor it is based on VAE [moyer2018neurips], and provides several advantages over both. With respect to adversarial strategies, our method has the advantage of relying on a module estimating the mutual information [belghazi18a] that is not competing with the network trained to learn an unbiased representation. In our computational pipeline, we do not learn unbiased representation by “fooling” the estimator, but by minimizing the information that it measures. The difference is subtle, but brings a crucial advantage: in adversarial methods, the discriminator (estimator) cannot be trained until convergence at every training step, otherwise gradients flowing through it would be close to zero almost everywhere in the parameter space [arjovsky2017iclr], preventing from learning an unbiased representation. In our case, the estimator can be trained until convergence at every training step, improving the quality
of its measure without any drawbacks.
Furthermore, our solution can easily scale to large architectures (e.g., for complex computer vision tasks) in a straightforward fashion. While this is true also for adversarial methods [zisserman2018, kim2019cvpr], we posit that it might not be the case for methods based on VAEs [moyer2018neurips], where one has to simultaneously train a feature extractor/encoder and a decoder.
3 Problem Formulation
We operate in a setting where data are shaped as triplets , where represents a generic datapoint, denotes the ground truth label related to a task of interest and encodes a vector of given attributes. We are interested in learning a representation of that allows performing well on the given task, with the constraint of not retaining information related to . In other words, we desire to learn a model that, when fed with , produces a representation which is maximally discriminative with respect to , while being invariant with respect to .
In this work, we formalize the invariance of with respect to through the lens of information theory, imposing a null mutual information . Specifically, we constrain the discriminative training (finalized to learn the task of interest) by imposing , where and
are the random variables associated withand , respectively. In formulæ, we obtain the following constrained optimization
where and define the two sets of parameters of the objective , which can be tailored to learn the task of interest. With , we refer to the trainable parameters of a module that maps a datapoint into the corresponding feature representation (that is, ). With , we denote the trainable parameters of a classifier that predicts from a feature vector (that is, ). The constraint does not depend upon , but only upon , since obeys to and .
In order to optimize the objective in (1), we must adopt an estimator of the mutual information. Before detailing our approach, in the following paragraph we cover the background required for a basic understanding of mutual information estimation, with focus on the path we pursue in this work.
Background on information theory. The mutual information between two random variables is given by
denotes the joint probability of the two variables andrepresent the two marginals. As an alternative to covariance and other linear indicators of statistical dependence, mutual information can account for generic inter-relationships between , going beyond simple correlation [cavazza2016kernelized, CAVAZZA201925].
The main drawback with mutual information relates to its difficult computation, since the probability distributions, and are not known in practice. Recently, a general purpose and efficient estimator for mutual information has been proposed by Belghazi et al. [belghazi18a]. They propose a neural network based approximation to compute the following lower bound for the mutual information :
as a feed-forward neural network, the maximization in Eq. (2) can be efficiently solved via backpropagation [belghazi18a]. As a result, we can approximate with , the so-called “Mutual Information Neural Estimator” (MINE [belghazi18a]). An appealing aspect of MINE is its fully differentiable nature, that enables end-to-end optimization of objectives that rely on mutual information computations.
Endowed with all relevant background, in the following section we detail our approach, which is based on the optimization of a Lagrangian for the objective (1). By relying on MINE [belghazi18a], we can efficiently estimate the mutual information and backpropagate through the different modules, in order to unbias the feature representation which is learnt to solve a given supervised learning task.
In the following, we detail how we approach Eq. (1), both in terms of theoretical foundations and practical implementation.
4.1 Optimization problem
In order to proceed with a more tractable problem, we consider the Lagrangian of Eq. (1)
where the first term is a loss associated with the task of interest, whose minimization ensures that the learned representation is sufficient for our purposes. The second term is the mutual information between the learned representation and the given attributes. The hyper-parameter balances the trade-off between optimizing for a given task and minimizing the mutual information.
Concerning the first term of the objective, we will consider classification tasks throughout this work, and thus we assume that our aim is minimizing the cross-entropy loss between the output of the model and the ground truth .
where is the softmax function and is the number of given datapoints.
Concerning the second term of the objective in Eq. (1), as already mentioned, the analytical formulation of the mutual information is of scarce utility to evaluate . Indeed, we do not explicitly know the probability distributions that the learned representation and the attributes obey to. Therefore, we need an estimator for the mutual information , with the requirement of being differentiable with respect to the model parameters .
In order to attain our targeted goal, we take advantage of the work by Belghazi et al. [belghazi18a] (Eq. (2)), and exploit a second neural network
(“statistics network”) to estimate the mutual information. We therefore introduce an additional loss function
that, once maximized, provides an estimate of the mutual information
In Eq. (5), the notation reflects that we rely on the empirical distributions of features and attributes, the operator “” indicates vector concatenations and “ne” stands for “neural estimator” [belghazi18a]. The loss also depends on , since Eq. (5) depends on . Combining the pieces together, we obtain the following problem
Intuitively, the inner maximization problem ensures a reliable estimate of the mutual information between the learned representation and the attributes. The outer minimization problem is aimed at learning a representation that is at the same time optimal for the given task and unbiased with respect to the attributes.
4.2 Implementation Details
Concerning the modules introduced in Section 3, we implement the feature extractor (which computes features from datapoints ) and the classifier (which predicts labels from ) as feed-forward neural networks. The classifier is implemented as a shallow logit layer to accomplish predictions on the task of interest. As already mentioned, the model is also a neural network; it accepts in input the concatenation of feature vectors and attribute vectors , and through Eq. (5) allows estimating the mutual information between the two random variables. The nature of the modules allow to optimize the objective functions in (7) via backpropagation [BackProp]. Figure 2 portrays the connections between the different elements, and how the losses (4) and (5) originate.
A crucial point that needs to be addressed when jointly optimizing the two terms of Eq. (7) is that, while the distribution of the attributes is static, the distribution of the feature embeddings depends on , which changes throughout the learning trajectory. For this reason, the mutual information estimator needs to be constantly updated during training, because an estimate , associated with at step , is no longer reliable at step . To cope with this issue, we devise an iterative procedure where, prior to every gradient descent update on , we update MINE on the current model, through the inner maximizer in Eq. (7). This guarantees a reliable mutual information estimation.
As already mentioned, one key difference with adversarial methods is that we can train MINE until convergence prior to each gradient descent step on the feature extractor, without the risk of obtaining gradients whose magnitude is close to zero [arjovsky2017iclr]
, since our estimator is not a discriminator (being the mutual information unbounded, sometimes gradient clipping is actually beneficial[belghazi18a]). The full training procedure is detailed in Algorithm 1.
Training techniques. Before discussing our results, we briefly comment below some techniques that we could appreciate to generally increase the stability of the proposed training procedure. While code and hyper-parameters can be found in the Supplementary Material, we believe that the reader can benefit from the discussion.
(a) Despite MINE [belghazi18a]
can estimate the mutual information between continuous random variables, we observed that the estimation is eased (in terms of speed and stability) if the attribute labelsare discrete. (b) We observed an increased stability in training MINE [belghazi18a] for lower-dimensional representations and attributes . For this reason, as we will discuss in Section 5, feature extractors with low-dimensional embedding layer are favored in our settings. (c) The feature extractor receives gradients related to both and : since the mutual information is unbounded, the latter may dominate the former. Following Belghazi et al. [belghazi18a], we overcome this issue via gradient clipping (we refer to original work for details). (d) We observed that training MINE requires large mini-batches: when this was unfeasible due to memory issues, we relied on gradient accumulation. (e) We observed that using vanilla gradient descent over Adam optimizer [AdamOptimizer] eases training MINE [belghazi18a] in most of our experiments.
In the following, we show the effectiveness of models trained via Algorithm 1 in a series of benchmarks. First, we report results related to the setup proposed by Kim et al. [kim2019cvpr] – learning to recognize color-biased digits without relying on color information. Next, we show that our proposed solution can scale to higher-capacity models and more difficult tasks, through the IMBD benchmark [zisserman2018, kim2019cvpr], where the goal is classifying people age from images of their face, without relying on the gender bias. Finally, we show that our method can also be applied as it is to learn “fair” classifiers, by training models on the German dataset [german-dataset].
5.1 Digit Recognition
Experimental setup. Following the setting defined by Kim et al. [kim2019cvpr], we consider a digit classification task where each digit, originally from MNIST [MNIST], shows an artificially induced color bias. More specifically, in the training set (with
samples), digit colors are drawn from Gaussian distributions, whose mean values are different for each class. In the test set (with
samples), digits show random colors. The benchmark is designed with seven different standard deviation values(equally spaced between and ): the lower the value, the more difficult the task, since the model can fit the training set by recognizing colors instead of shapes, thus poorly generalizing (see Figure 3). To extract the color information (the attribute , recalling notation from Section 3), the maximum pixel value is encoded in a binary vector with 24-bit (8 bits per channel). Since the background is always black, the maximum value reflects the digit color.
Concerning the model, we exploit a convolutional neural network[LeNet] with architecture conv-pool-conv-pool-fc-fc-softmax. The output of the second fully connected layer () is given in input to both the logit layer and MINE (Figure 2). The architecture of the statistics network
in MINE is a multi-layer perceptron (MLP) with 3 layers. More architectural details can be found in the Supplementary Material. We compare models trained via Algorithm1 with the solutions proposed by Kim et al. [kim2019cvpr] and Alvi et al. [zisserman2018], averaging across runs and using accuracy as a metric. Before comparing against related work, we discuss how crucial hyper-parameters can be selected in our setting.
Hyper-parameter choice. We discuss in the following the model behavior as we modify , that governs the trade-off between learning a task and minimizing the mutual information between features and attributes.
Figure 4 reports the evolution of mutual information estimation (left), accuracy on test samples (middle) and accuracy on training sample (right) for models trained with in blue, orange and green, respectively, for (top and bottom, respectively). It can be observed that the mutual information between embeddings and color attributes can be reduced by increasing . Importantly, this results in a significantly higher accuracy on (unbiased) test samples. The importance of this result is twofold: on the one hand, it is a proof of concept of the intuition that lowering the mutual information does help generalizing to unbiased sources; on the other, it provides us with a possible cross-validation strategy to pick a proper value (the one that allows minimizing the mutual information more efficiently). As can be observed in the plots on the right, the training procedure becomes more unstable when we increase . Therefore, in order to select the proper hyper-parameter, we can choose the highest value that allows the model fitting the data (i.e., minimizing ) and reducing the mutual information (i.e., minimizing ).
Another important hyper-parameter is the number of iterations used to train MINE [belghazi18a] prior to each gradient update on the feature extractor ( in Algorithm 1). We observed that, in general, the higher the number of iterations the better. This was expected, because the estimate of the mutual information becomes more reliable. In the results proposed in the following paragraph, we set . We refer to the Supplementary Material for details regarding the other less critical hyper-parameters.
|ERM ()||0.476 0.005||0.542 0.004||0.664 0.007||0.720 0.010||0.785 0.003||0.838 0.002||0.870 0.001|
Alvi et al. [zisserman2018]
Kim et al. [kim2019cvpr]
Comparison with related work. We report in Table 1 the comparison between our method with and related works [kim2019cvpr, zisserman2018]. We can observe consistently improved results in all the benchmarks (different ’s). We emphasize that our method is more effective as the bias is more severe (small ’s). It is also important to stress that Kim et al. [kim2019cvpr] do not introduce any strategy to search the hyper-parameters that balance the adversarial game, whereas in this work the hyper-parameter search is efficiently resolved. Furthermore, the authors do not report any statistics around their results (e.g., average and standard deviation across different runs), making a fair comparison difficult.
5.2 IMDB: Removing the Age Bias
Experimental setup. Following related works [zisserman2018, kim2019cvpr], we consider the IMDB dataset [imdb_dataset] as benchmark. It contains cropped images of celebrity faces with ground truth annotations related to gender and age. Alvi et al. [zisserman2018] consider two subsets of the training set that are severely biased for what concerns age: the EB1 (“Extreme Bias”) split ( samples) only contains images of women with an age in the range 0-30, and men who are older than 40; vice versa, the EB2 split ( samples) only contains images of men with an age in the range 0-30, and women older than 40 (see Figure 3). The test set ( samples) contains faces without any restrictions on age/gender (uniformly samples). The goal here is learning an age-agnostic model, to overcome the bias present in the dataset.
Following previous work [zisserman2018, kim2019cvpr], we encode the age attribute (our biased attribute,
) using bins of 5 years, via one-hot encoding. We use a ResNet-50[ResNet]
model pre-trained on ImageNet[ImageNet] as classifier, modified with a 128-dimensional fully connected layer before the logit layer. This narrower embedding serves as our , and the reduced dimension eases the estimation of the mutual information, while not causing any detrimental effect in terms of accuracy. For each split (EB1 and EB2), we train the model through Algorithm 1 and evaluate it on the test set and on the split not used for training. We followed the same procedure detailed in Section 5.1 to choose the hyper-parameter , obtaining and for EB1 and EB2 splits, respectively; we set . We compare our results with the ones published by related works [zisserman2018, kim2019cvpr], using accuracy as a metric. We limited the training sets to only samples: this choice was due to the fact that with the whole training sets we could observe baselines () significantly higher than published results [kim2019cvpr], whereas they are comparable for models trained on a subset.
Results. Table 5 reports our results. In all our experiments, we observe accuracy improvements with respect to the baseline (). In general, training on one split and testing on the other is more challenging than testing on the (neutral) test set, as confirmed by the baseline results (ERM, first row). In all the different protocols, our method (last row) has superior performance than Alvi et al. [zisserman2018], and comparable performance with Kim et al. [kim2019cvpr].
These results confirm that our method can effectively remove biased, detrimental information even when modeling more complex data with higher-capacity models. In this case though, the improvements are more limited than the ones we showed in the digit experiment. One of the reasons might be that age and gender information cannot be decoupled as efficiently as shape and color. In other words, removing age information may not always bring accuracy improvements.
5.3 Learning Fair Representations
Experimental setup. We explored the potentiality of our method in the context of algorithmic fairness with the popular UCI dataset German [german-dataset]. The dataset is composed of
samples of customer descriptions with both categorical and continuous attributes. The binary, ground truth label is the risk degree associated with a customer, either good or bad. The goal is to learn a model to predict the customer rate with the constraint of removing the information about the customer age (binarized according to). This problem is different with respect to the previous ones: here the invariance towards sensitive attribute does not imply a better generalization on the test set as it happens with, e.g., digit recognition. The removal of the protected attribute is done for the sake of obtaining a fair representation [kleinberg2016inherent, donini2018empirical, zhang_2018_fairness, wang_2019_fairness].
Following previous work, we implemented the feature extractor as a single-layer MLP with 64 units in the hidden layer. MINE’s statistics network is a shallow network with 64 hidden units. We randomly split the dataset in 70% training samples and 30% test samples, and use accuracy and Equal Opportunity (EO)111Equal Opportunity measures the discrepancy between the TP rates of “protected” and “non-protected” populations. Here, . as comparison metrics, averaging across different runs. The goal is to find a balance between reducing EO (i.e., learning a fairer representation) without observing a too severe decrease in accuracy.
Results. In the right plots of Table 6, we show how the performance varies when increasing from (standard Empirical Risk Minimization) to . It can be observed that our method allows training fairer models (i.e., reduced EO), while maintaining a good performance on test. For , the fairness price is close to zero (i.e., the accuracy does not decrease), while the fairness is substantially improved. We report the comparison with related works on Table 6 (left). Notice that the FERM method [donini2018empirical] directly optimizes for fairness, while we do not. This experiment is a proof of concept to show that the fairness community might benefit from our approach, although our main goal is bias removal in contexts where it can improve the model’s generalization capabilities.
We propose a training procedure to learn representations that are not biased towards dataset-specific attributes. We leverage a neural estimator for the mutual information [belghazi18a], devising a method that can be easily implemented in arbitrary architectures, and that relies on a training procedure which is more principled and reliable than adversarial training. When compared with the state of the art [zisserman2018, kim2019cvpr], it shows competitive results, with the advantage of a robust hyper-parameter selection procedure. Moreover, the proposed solution has competitive performance even in the fairness setting, where the goal is to find a trade-off between attribute invariance and accuracy.
7 Implementation Details
In the following paragraphs, we provide the implementation details. We carried out all of our experiments using TensorFlow222https://www.tensorflow.org/. Concerning the architectures used, please refer to Figure 7.
To ease the discussion, we can divide the optimization problem presented in our work into the following two
We train our models for 150 epochs, using mini-batches of size. The learning rates and are both set to . We use Adam [AdamOptimizer] as optimizer for (8) and (9). For each gradient update to optimize (8) with respect to , we update MINE parameters times (). That is, we perform 80 update steps to optimize (9), as to better train MINE (see Section 8 for a detailed discussion around this choice).
IMDB experiment. For both training splits (EB1 and EB2) we restrict the training set to samples This choice is motivated by the fact that using the whole training sets we observed higher baselines results then the ones published in previous art [kim2019cvpr]. We trained each model for epochs with mini-batch size set to . The learning rate is set to ; the learning rate is set to . We use Adam [AdamOptimizer] as optimizer for (8) and vanilla gradient descent for (9). We found a number of MINE iterations to be sufficient in order to estimate the mutual information throughout training.
German experiment. We adopted the same settings as previous art that uses this benchmark [moyer2018neurips]. The data samples available are split in training and test (randomly picked in each run). The model is trained for epochs with mini-batch size set to . The learning rate is set to ; the learning rate is set to . We use Adam [AdamOptimizer] as optimizer for (8), and vanilla gradient descent for (9). We set to a number of MINE iterations .
8 Discussion on the Hyper-Parameters
In this section, we discuss the hyper-parameters that we adopted throughout the experiments reported in this work.
Choice of the number of iterations to update MINE. We found that increasing the number of iterations to estimate stabilizes the overall training procedure, as shown in Figure 8. As our intuition behind this fact, we posit that the better the estimation of the mutual information through MINE is, the more precise and effective the gradients are. The only drawback we observed is the increased computational cost, since the time increases linearly with the number of iterations employed to estimate the mutual information.
Choice of the hyper-parameter . The hyper-parameter regulates the trade-off between minimizing the task loss and reducing the mutual information between the biased attribute and the learned representation in (8). In Section 5 of the paper, we describe how to properly tune it. We report in Figure 9 the complete version of the analysis reported in the manuscript for the Digit experiment. We report the evolution of mutual information, test accuracy and training accuracy for different values of the hyper-parameter , while is fixed to be equal to one of the following values: or .