Log In Sign Up

Invariant Representations with Stochastically Quantized Neural Networks

by   Mattia Cerrato, et al.
University of Mainz

Representation learning algorithms offer the opportunity to learn invariant representations of the input data with regard to nuisance factors. Many authors have leveraged such strategies to learn fair representations, i.e., vectors where information about sensitive attributes is removed. These methods are attractive as they may be interpreted as minimizing the mutual information between a neural layer's activations and a sensitive attribute. However, the theoretical grounding of such methods relies either on the computation of infinitely accurate adversaries or on minimizing a variational upper bound of a mutual information estimate. In this paper, we propose a methodology for direct computation of the mutual information between a neural layer and a sensitive attribute. We employ stochastically-activated binary neural networks, which lets us treat neurons as random variables. We are then able to compute (not bound) the mutual information between a layer and a sensitive attribute and use this information as a regularization factor during gradient descent. We show that this method compares favorably with the state of the art in fair representation learning and that the learned representations display a higher level of invariance compared to full-precision neural networks.


page 1

page 2

page 3

page 4


Semi-FairVAE: Semi-supervised Fair Representation Learning with Adversarial Variational Autoencoder

Adversarial learning is a widely used technique in fair representation l...

README: REpresentation learning by fairness-Aware Disentangling MEthod

Fair representation learning aims to encode invariant representation wit...

A Theory of Usable Information Under Computational Constraints

We propose a new framework for reasoning about information in complex sy...

A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations

Learning disentangled representations of textual data is essential for m...

Which Mutual-Information Representation Learning Objectives are Sufficient for Control?

Mutual information maximization provides an appealing formalism for lear...

Deep Fair Clustering via Maximizing and Minimizing Mutual Information

Fair clustering aims to divide data into distinct clusters, while preven...

Higher-order mutual information reveals synergistic sub-networks for multi-neuron importance

Quantifying which neurons are important with respect to the classificati...

1. Introduction

Representation learning algorithms based on neural networks are being employed extensively in information retrieval and data mining applications. The social impact of what the general public refers to as “AI” is now a topic of much discussion, with regulators in the EU even putting forward legal proposals which would require practitioners to “[…] minimise the risk of unfair biases embedded in the model […]” (aiproposal). Such proposals refer to biases with regard to individual characteristics which are protected by the law, such as gender and ethnicity. The concern is that models trained on biased data might then learn those biases (barocas-hardt-narayanan), therefore perpetuating historical discrimination against certain groups of individuals.

In recent years, research on group fairness

in machine learning has focused on formalizing definitions and measuring model bias

(verma2018fairness), introducing concepts as disparate impact, disparate mistreatment, statistical parity, etc.. As an example, a classification model might display disparate impact if it assigns positive outcomes (e.g. getting a loan) with different rates to different groups of people (e.g. men and women). Disparate mistreatment

, on the other hand, is the situation where a classifier misassigns negative outcomes with different rates across groups


Figure 1. Left: sketch of a Stochastically Quantized Neural Network. The stochastic layer , holding the quantized neurons, is shown with

. During the forward step the input features are extracted via feature extraction layers before they enter the stochastic layer. After the stochastic layer further classification layers are used for classification the class label

. During the backpropagation the loss function is evaluated e.g. via binary cross entropy for the class label

and by calculating the mutual information of the stochastic layer and the sensitive attribute . The influence of the two loss functions are controlled via the parameter

. Right: sketch of a stochastically quantized neuron. The neuron is sigmoid-activated, but the sigmoid output is employed as the parameter for a Bernoulli distribution, which we then sample from. This enables the interpretation of

as a random variable and as a random vector, in turn allowing us to compute .

One possible approach to the group fairness problem is fair representation learning. Fair representation learning is a set of techniques which learns new representations of the original data where information about protected characteristics such as gender or ethnicity has been removed. While the first work in the field (zemel2013learning) employed probabilistic modeling, numerous authors have employed neural networks as the base learning algorithm (xie2017controllable; moyer2018invariant; cerrato2020constraining; fair_pair_metric; madras). While different authors have opted for different mathematical formulations of the “fair representation learning problem”, the core idea is to remove sensitive information from the network’s representations. If one is able to guarantee that the network has no information about whether an individual’s gender identity, for instance, it follows that the decisions undertaken by the model are independent of it. One possible formalization for the above desideratum relies on computing the mutual information between the -th neural layer and a sensitive attribute representing the protected information. If the mutual information equals , for instance, one would obtain group-invariant representations and therefore unbiased decisions which do not rely on . Estimating mutual information over high-dimensional distributions is however highly challenging in general. One line of research which has attracted much attention is based on the Information Bottleneck framework (tishby2015deep) and relies on quantizing the neural representations after model training. After quantization, it is possible to use density estimation techniques to approximate the distribution of a layer . However, this technique suffers from theoretical issues as a neural network is usually a determinstic function of its input and the mutual information is therefore infinite (goldfeld2019estimating). If the interest is not exact computation but optimization, authors have obtained differentiable upper bounds which rely on a variational approximation of the encoding distribution (moyer2018invariant; belghazi2018mutual). Another possible theoretical grounding for fair representation learning relies on learning an adversary network which tries to predict sensitive information from the representations (madras). The loss of an “infinitely powerful” adversary may then be used to provide a bound on the mutual information.

In this work, we provide an alternative theoretical justification for fair representation learning by directly computing the mutual information term and then minimizing it. Our approach avoids the theoretical issues in the Information Bottleneck framework by employing stochastically-quantized neural networks. These networks have a low-precision activation (1 bit) which may be interpreted as a random variable following the Bernoulli distribution. Thus, we are able to obtain a finite, exact value for the mutual information between a neuron and the sensitive attribute without relying on variational approximation (moyer2018invariant), adversarial learning (madras) or adding noise to the representations (goldfeld2019estimating). We then employ this exact value to upper bound the overall mutual information in layer , , by assuming independence between the random variables representing each neuron in the layer. We then use this term in optimization, obtaining invariant representations and thus fair decisions. Furthermore, we also show that it is possible to relax the independence assumption and still compute via density estimation. Our experimentation proves that quantized, low-precision models are able to obtain group-invariant representations in a fair classification setting and avoid training set biases in image data. Our contributions can be summarized as follows:

  • We show how to compute the mutual information between a stochastic binary neuron in layer

    and a discrete random variable

    representing a sensitive attribute .

  • We employ the computation above and an independence assumption to compute the mutual information for the whole layer . This computation may be used as an optimization objective.

  • We relax the assumption above and employ density estimation to compute the mutual information .

  • We perform experiments on three fair classification datasets and one invariant representation learning dataset, showing that our low-precision model is competitive with the state of the art.

2. Related Work

Fair and Invariant Representation Learning. Algorithmic fairness has attracted significant attention from the academic community and general public in recent years, thanks in no small part to the ProPublica/COMPAS debate. COMPAS is a software which was intended to assist US courtroom judges in evaluating the risk of recidivism in people seeking to be released on bail. ProPublica (machine_bias) reported the COMPAS software as being discriminatory towards black people. More specifically, ProPublica found that the false positive rates were different for black and white people. We refer the reader to a contribution by Rudin et al. for a summary of the debate (Rudin2020Age). To the best of our knowledge, however, the first contribution in this area dates back to 1996, when Friedman and Nisselbaum (friedman1996bias) contributed that automatic decision systems need to be developed with particular attention to systemic discrimination and moral reasoning. The need to tackle automatic discrimination is also part of EU-level law in the GDPR, Recital 71 in particular (malgieri2020gdpr).

One possible way to tackle the issues described above is to employ fair representation learning. In fair representation learning, sensitive and law-protected information (ethnicity, gender identity, etc.) is treated as a “nuisance factor” which needs to be removed from the data . While excluding in training could be regarded as a “color-blind” approach to fairness (we owe this definition to Zehlike et al. (zehlike2018reducing)), it is often not enough to obtain unbiased decisions, as complex statistical correlations between and may still exist. Fair representation learning techniques learn a projection of into a latent feature space where all information about has been removed. One seminal contribution in this area is due to Zemel et al. (zemel2013learning), who employed probabilistic modeling. Since then, neural networks have been extensively used in this space. Some proposals (xie2017controllable; madras) employ adversarial learning, a technique due to Ganin et al. (ganin2016jmlr) in which two networks are pitted against one another in predicting and removing information about . Another line of work (Louizos2016TheVF; moyer2018invariant) employs variational inference to approximate the intractable distribution . A combination of architecture design (Louizos2016TheVF) and information-theoretic loss functions (moyer2018invariant; gretton2012kernel) may then be employed to encourage invariance of the neural representations with regard to . Our proposal differs in that we employ a stochastically quantized neural network for which it is possible to compute the mutual information between the -th neuron at layer and the sensitive attribute . This lets us avoid variational approximations of the target distribution and provides a more stable training objective for representation invariance compared to adversarial training.

Mutual Information and Neural Networks. Our approach is related to the Deep Information Bottleneck framework proposed by Tishby et al. (tishby2015deep). In this line of work, the core proposal is to employ mutual information to explain the generalization properties of deep neural networks. The authors found that as the number of layers grows, neural networks display faster compression rates of the input. This is observed by computing throughout gradient descent steps. As classical learning theory is unable to explain the generalization capabilities of very deep, overparametrized neural networks (zhang2021understanding), the authors’ contribution garned significant attention and scrutiny. Goldfeld and Polyanskiy (goldfeld2020information) have taken exception in particular to Tishby et al.’s strategy to compute mutual information in a neural network, which relies on quantization and binning of the neural activations. This creates a discrepancy between the analysis model (the quantized network) and the actual model (which is not quantized). The number of bins is also observed to be critical to the result of the analysis (a well-known issue with mutual information estimation (kraskov2004estimating)). Furthermore, neural networks are deterministic functions and in this situation is ill-defined as is not a random variable. To solve these issues, Goldfeld et al. have put forward a proposal to compute mutual information in a neural network by adding a small amount of Gaussian noise to the activations (goldfeld2019estimating), which makes it possible to draw parallels to Gaussian noise channels and derive a rate-optimal mutual information estimator.

We take inspiration from Goldfeld and Polyanskiy’s criticism of Tishby et al. and propose a stochastically-quantized neural network for which is computable as each neuron follows a Bernoulli distribution during training. We propose two different strategies to compute the layer-wise mutual information and minimize this measure during training.

3. Method

Our contribution deals with learning fair (group-invariant) representations in a principled way by employing stochastically quantized neural networks. In this section, we provide a theoretical motivation for our work by contextualizing it in an information-theoretic framework similar to the one introduced in the “Information Bottleneck” literature (tishby2015deep; goldfeld2019estimating) (Section 3.1). As previously mentioned, the work done so far in this space has approximated measuring the information theoretic quantities in neural networks either via post-training quantization (tishby2015deep), adversarial bounding (madras), variational approximations (moyer2018invariant) or the addition of stochastic noise in the representations (goldfeld2019estimating; cerrato2020constraining). Our approach is instead to employ stochastically quantized neural network to exactly compute the mutual information between a sensitive attribute and any neuron in the network. We show how this approach may be used to compute group-invariant representations in Section 3.2 and 3.3.

3.1. Invariant Representations and Mutual Information

A feedforward neural network with layers may be formalized by employing a sequence of “layer functions” which computes the neural activations given an input :


where is a weight matrix,

is a bias vector,

is an activation function, and

is the size of the -th layer. We now define neural representations as applications of to the random variable which follows the empirical distribution of the samples:

We note that, in a supervised learning setting, the last layer

is a reproduction of . Representation invariance may be formalized as an information-theoretic objective in which the representations of the -th neural layer display minimal mutual information with regard to a sensitive attribute :


where is a threshold where the information can be considered minimal, and is the mutual information between and

. In general, given two joint continuous random variables

and taking values over the sample space , the mutual information between them is defined as


where the integrals may be replaced by sums if and are discrete random variables and is the measure related to . Mutual information is an attractive measure in invariant representation learning, as two random variables are statistically independent if and only if . Thus, one may obtain -invariant representations by minimizing the mutual information between and , therefore certifying that no information about the sensitive attribute is contained in the representations. Minimizing this objective might however remove all information about the original data and, if available, the labels . This may be understood easily by assuming a constant representation function and seeing that . The most simple example of this is to reason about . Setting all network parameters to 0 then solves the problem in Equation 3 for any value of . To avoid this issue, one might want to guarantee that some information about is preserved. This changes the setting to a problem which closely resembles the Information Bottleneck problem (tishby2015deep):


where is a positive real number. However, constrained optimization is highly challenging in neural networks: in practice, previous work in this area has focused on, e.g., minimizing a reconstruction loss as a surrogate for the constraint (madras).

At the same time, computing is highly nontrivial. Since the distribution of is non-parametric and in general unknown, one would need to resort to density estimation to approximate it from data samples. However, the number of needed samples scales exponentially with the dimensionality of (paninski2003estimation). Furthermore, mutual information is in general ill-defined in neural networks: if is a random variable representing the empirical data distribution and is a deterministic injective function, the mutual information is either constant (when is discrete) or infinite (when is continuous) (goldfeld2019estimating). As activation functions in neural networks such as sigmoid and htan are injective and real-valued, is infinite. When the ReLU activation function is employed, the mutual information is finite but still vacuous111Proper contextualization of this result requires some additional preliminary results, which we will avoid reporting here. We refer the interested reader to Goldfeld and Polyanskiy (goldfeld2020information)..

Previous work in fair representation learning has circumvented this issue by employing different techniques. One possible approach is to perform density estimation by grouping the real-valued activations into a finite number of bins (tishby2015deep). Some authors have instead relied on variational approximations of the intractable distribution , which leads to a tractable upper bound (moyer2018invariant). Lastly, it is possible to bound the mutual information term with the loss of an adversary network which tries to predict from (madras; ganin2016jmlr; xie2017controllable).

Our approach is instead to employ stochastically-quantized neural networks, a particular setup of binary networks (binarynet) that enables the interpretation of neural activations as (discrete) random variables. This approach has multiple benefits: It does not rely on variational approximations or adversarial training; it lets us treat as a random variable, avoiding the infinite mutual information issue described above; lastly, it avoids adding noise to the representations as a way to obtain stochasticity (goldfeld2019estimating).

3.2. Mutual Information Computation via Bernoulli activations

Our methodology relies on stochastically quantized neural networks, i.e., neural network models in which the activations are stochastic. More specifically, we employ binary neural networks, i.e., quantized nets in which either the weights or the activations have 1-bit precision. The rationale for which these networks have been originally developed is to employ them on low-memory systems or where real time performance is of critical importance (binarynet). Let be the -th layer in a binary-activated network and be the activation of its -th neuron. may be computed deterministically from the activations of the previous layer and the learnable weights and biases connecting the two:

where is the indicator function.

In this paper, however, we employ stochastic quantization of binary neurons, which we compute by sampling from , where:



is the sigmoid function

. We then sample from this distribution to compute the actual activation . Thus, may be interpreted as a random variable following the Bernoulli distribution: . The entropy of a random variable following the Bernoulli distribution has a closed, analytical form:


We then consider the whole layer as a stochastic random vector . If one assumes independence between the activations, it is then easy to compute the relevant mutual information measure for the whole layer. For instance, for the random vector :


This derivation can be easily expanded for a layer of any size. The conditional entropy may be computed by selecting those representations which are obtained from a data point for which and then computing as described in Equation 6. Under the assumption that activations are independent, it is therefore possible to exactly compute the mutual information in stochastically-activated quantized neural networks. Thus, given a stochastically quantized binary neural network, we can impose representation invariance with respect to

by stochastic gradient descent over a loss function

that directly incorporates :



is the Kullback-Leibler divergence and

is a trade-off parameter weighting the importance of representation invariance and accuracy on .

It is worthwhile to mention that only a single layer needs to be stochastically quantized for the above loss function to be computable. Therefore, employing “hybrid” networks with a mix of binary-precision and full-precision layers is a feasible option which we find to be a strong performer in most cases (see Table 1).

The independence assumption, which is critical to the derivation above, warrants some discussion, however. While the sampling of each Bernoulli variable is indeed independent, the parameters of each underlying distribution are not independent. This is due to them depending from the activations at the previous layer , as it is made apparent by Equation 5. Thus, independence holds only conditionally with regard to the Bernoulli parameter vector , e.g., . Independence assumptions are relatively common in practice in variational approximation, where, e.g., the encoding distribution

is sampled from an isotropic Gaussian distribution

(Louizos2016TheVF). Our methodology is able to relax this constraint by taking advantage of the quantization intrinsic to our base model. This is the topic of the next section.

3.3. Mutual Information Computation via Density Estimation

First, we report here the general formula for entropy when is a discrete variable:


Recalling the definition of Mutual Information in Equation 4 and its relationship to entropy in Equation 8, we see that an estimate of the joint and the joint conditional is needed to compute the mutual information in a network. This is highly nontrivial in full-precision neural networks. In that situation, is not a random variable (see Section 3.1) and each neuron may take any of values, where is the precision the activations are computed with. In our setup, however, each neuron may only take values in

. Thus, we are able to estimate the joint distribution of the random vector

from data via a simple density estimation scheme: plainly put, we count the occurrences of each possible activation vector .

In general, density estimation may be performed, in its simplest form, via building a histogram for the data at hand . Let us define as the histogram function, as a binning function and as the bin width. Formally, for , and can be defined as:

where iterates over the data samples and is the center of the bin where

lies. Given the function above, one is able to estimate the probability density function for a data sample by computing

. Density estimation may also be computed with adaptive, data-dependant bin sizes or by using kernels to replace the binning function . Parametric density estimation may also be employed when one has prior knowledge about the data following some parametric distribution, e.g. a Gaussian. These techniques have been extensively employed when estimating mutual information for high-dimensional distributions (kraskov2004estimating) such as neural activations. In our setup, however, we employ a stochastic quantization scheme which returns binary values. This lets us abstract away the choice of , which is known to have a major influence in mutual information estimation in neural networks (goldfeld2019estimating; goldfeld2020information). Assuming for simplicity of presentation that , we are able to estimate by counting the frequencies of all possible activation vectors, i.e. the realizations of the random vector , or, equivalently, the output of the layer function . The general density computation for a layer of size follows:


Where, with some abuse of notation, indicates that a realization of the activation vector is equal to the -th element of the sequence222Here, any ordering of the set is viable. Thus, we avoid specifying the sequence formally. of binary vectors associated with the set . This procedure may also be employed to compute by only selecting those data samples for which and adjusting the normalization factor accordingly. Therefore, we are able to compute the term in Equation 7

by estimating the underlying probabilities.

One limitation of the methodology presented in this subsection is that the number of samples required to estimate the joint distribution scales exponentially with the number of neurons in the layer . In practice,

may be chosen as an hyperparameter and kept as low as needed. However, a very small sized layer placed at any point in the network may limit the network’s expressivity. Another possible solution to this matter is to perform multiple forward passes so to obtain better estimates for

and , which we however did not investigate presently.

In the next section, we test our methodology on three fair classification datasets and one image dataset in which invariant representations are beneficial for classification accuracy. We show that we are able to obtain solid performances with both the strategies described above, therefore showing that the independence assumption is both realistic and non-critical.

4. Experiments

Our experimental setting focuses on analyzing the performance of the methodology presented in Section 3.2 in fair classification and invariant representation learning settings. Our experimentation aims to answer the following questions:

Q1. Is the present methodology able to learn fair models which avoid discrimination? A1. Yes. We analyze our models’ accuracy and fairness by measuring the area under curve (AUC) and their disparate impact / disparate mistreatment. We compare them with full-precision neural networks trained for fair classification (Section 4.4) and see that they are able to find strong accuracy/fairness tradeoffs under the assumption that both are equally important. Furthermore, we show that they are able to remove nuisance factors from image datasets (Section 4.7).

Q2. Do stochastically quantized neural networks learn invariant representations? A2. Yes. Our models display comparable levels of representation invariance when compared with full-precision invariant representation learning methods. We show in Section 4.1 that supervised classifiers trained with the learned representations and the sensitive data are unable to generalize to the test set with any sort of consistency, as their accuracy is close to random guessing performance.

The datasets we employed are presented in Section 4.1. We then discuss the fairness metrics in Section 4.2. Finally, we describe the general experimental setup in Section 4.3.

Figure 2. Experiment results for our Bernoulli entropy model (BinaryBernoulli), our joint density estimation model (BinaryMI), an adversarial classifier (AdvCls

) and a fair variational autoencoder (

VFAE). The dotted line represent the line of equivalent fairness/accuracy tradeoffs and goes through the model closest (under the norm) to perfect accuracy and perfect fairness. We show on top our best performing models for the AUC/1-GPA tradeoff, and on the bottom the best performing models for AUC/1-AUDC.

4.1. Datasets

COMPAS. This dataset was released by ProPublica, an US-based consortium (machine_bias), as part of an empirical study on ML software which US judges employ to evaluate the risk of further crimes by individuals who have been previously arrested. The ground truth here is whether an individual committed a crime in the following two years. The sensitive attribute is the individual’s ethnicity. Machine learning models trained on this dataset may display disparate impact (zafar2017fairness), thus our objective is to minimize GPA while maximizing accuracy (AUC).

Adult. This dataset is part of the UCI repository (dua2019uci). The ground truth is whether an individual’s annual salary is over 50K$ per year or not (adult). This dataset has been shown to be biased against gender (Louizos2016TheVF; zemel2013learning; cerrato2020constraining).

Bank marketing. In this dataset, the classification goal is whether an invidivual will subscribe a term deposit. Models trained on this dataset may display both disparate impact and disparate mistreatment with regard to age, more specifically on whether individuals are under 25 and over 65 years of age.

Figure 3.

Top row: Biased-MNIST training images. Each class is associated with a specific background color. Bottom row: testing images. The testing background color is selected at random.

Biased-MNIST. This is an image dataset based on the well-known MNIST Handwritten Digits database in which the background has been modified so to display a color bias (bahng2020learning). More specifically, a nuisance factor is introduced which is highly correlated with the ground truth and whose values represent the background color in the training set. Ten different colors are pre-selected for each value of and inserted as a background in the training images with high probability (). The test images, on the other hand, have background color chosen at random. In Figure 3, samples from the training and test data are shown. Thus, the background color/nuisance factor provides a very strong training bias. The simplest strategy for a model to achieve high accuracy on the training set is to overfit the background color. Therefore, models that are unable to learn invariant representations and decisions will inevitably overfit the training set (bahng2020learning).

Figure 4. Representation results for our Bernoulli entropy model (BinaryBernoulli), our joint density estimation model (BinaryMI), an adversarial classifier (AdvCls) and a fair variational autoencoder (VFAE). The dotted line represent the line of equivalent fairness/accuracy tradeoffs and goes through the model closest (under the

norm) to (1, 1). A Random Forest (RF) classifier is trained on the extracted representations

. In the top row, model AUC is compared with RF 1-ADRG, i.e. the absolute distance to random guess. This is computed by subtracting the RF accuracy to the majority class ratio in the dataset and taking the absolute value. In the bottom row, we compare the model AUC and the RF AUC.

4.2. Metrics

Group-dependent Pairwise Accuracy.

We employ this metric to test the disparate mistreatment of our models. We first present it in its original formulation, which was developed for fair ranking applications (fair_pair_metric).

Let be a set of protected groups such that every instance inside the dataset belongs to one of these groups. The group-dependent pairwise accuracy is then defined as the accuracy of a ranker on instances which are labeled more relevant belonging to group and instances labeled less relevant belonging to group . Since a fair ranker should not discriminate against protected groups, the difference should be close to zero. In the following, we call the Group-dependent Pairwise Accuracy GPA.

This metric may be employed in classification experiments by considering a classifier’s accuracy when computing and .

Area under Discrimination Curve (AUDC)

We take the discrimination as a measure of disparate impact as previously done in the literature (zemel2013learning), which is given by:

where denotes that the -th example has a value of equal to 1. We then generalize this metric in a similar fashion to how accuracy may be generalized to obtain a classifier’s area under the curve (AUC): we evaluate the measure above for different classification thresholds and then compute the area under this curve. In the following, we will refer to this measure as AUDC. Contrary to AUC, lower values are better.

4.3. Experimental setup

We split all datasets into 3 internal and 3 external folds. On the 3 internal folds, we employ a Bayesian optimization technique to find the best hyperparameters for our model. A summary of our models’ best hyperparameters may be found in Table 1. As our interest is to obtain models which are both fair and accurate, we employ Bayesian optimization to maximize the sum of the models’ AUC, 1-GPA and 1-AUDC. We set the maximum number of iterations to . The best hyperparameter setting found this way is then evaluated on the 3 external folds and reported. We relied on the Weights & Biases platform for an implementation of Bayesian optimization and overall experiment tracking (wandb). On the fairness datasets, we compare with an adversarial classifier (AdvCls in the Figures) trained as described by Xie et al. (xie2017controllable) and a variational fair autoencoder (VFAE) (Louizos2016TheVF). We obtained publicly available implementations for these models and optimized the hyperparameters with the same strategy we employed for our models. We report our model relying on the Bernoulli entropy computation (Section 3.2) as BinaryBernoulli in the figures, while the model relying on density estimation to compute the joint (Section 3.3) is referred to as BinaryMI.

BinaryMI BinaryBernoulli
Batch size COMPAS
N. hidden layers COMPAS
Hidden layer size COMPAS
Hybrid COMPAS yes yes
Banks yes yes
Adult yes yes
Learning Rate
Optimizer ADAM
Epochs 100
Table 1. Best hyperparameter combinations for BinaryMI and BinaryBernoulli. Learning rate, optimizer and number of epochs were kept fixed for both models across datasets.

4.4. Fair and Invariant Classification

We report plots analyzing the accuracy/fairness tradeoff of our models trained for classification in Figure 2. We take AUC as our measure of accuracy and both 1-GPA and 1-AUDC as the fairness metric. The ideal model in the fair classification setting displays maximal AUC and little to no GPA/AUDC and would appear at the very top right in Figure 2. This result is not attainable on the datasets we consider, as there is usually some correlation between and , which prevents us to obtain perfectly fair and accurate decisions. Thus, one needs to consider possible accuracy/fairness tradeoffs. We assume a balanced accuracy/fairness tradeoff and consider as the “best” model the one closest to under the norm. We then show all equivalent tradeoff points as a dotted line. We see that on both COMPAS and Adult, our BinaryMI model is able to find a stronger tradeoff than the competitors, with BinaryBernoulli very close to an equivalent tradeoff. The same may be said for the third dataset, Banks, in which however our two models find very different tradeoffs, with BinaryBernoulli preferring an almost perfectly fair result and BinaryMI finding a very accurate model. Compared to adversarial learning and variational inference, our models either find the best tradeoff or lie closest to the best tradeoff line.

4.5. Fair and Invariant Representations

We also analyze the accuracy/fairness tradeoff of the representations learned by our models. We extract neural activations from all the networks considered at the penultimate layer . We then split them into folds as described in Section 4.3. We then report in Figure 4 the performance of a Random Forest algorithm with 1000 base estimators trained to predict the sensitive attribute associated with each representation. The rationale here is that representations that are invariant to should provide no useful training signal to a supervised classifier trying to predict, which will have performance close or equal to random guessing (i.e. predicting the majority value for ). We measure this by computing the absolute distance to random guess (ADRG). In Figure 4 we see that BinaryMI is able to find the best tradeoff between informativeness on (bottom row) and invariance to . Our best BinaryBernoulli model representations, which are comparatively not very invariant on COMPAS, performed strongly in this regard on both Adult and Banks.

(a) 1-GPA for BinaryBernoulli and AdvCls while varying the parameter.
(b) AUC for BinaryBernoulli and AdvCls while varying the parameter.
Figure 5. Stability analysis for BinaryBernoulli and AdvCls.

4.6. Stability and Sensitivity Analysis

It is important that fair classification models are able to find different tradeoffs between fairness and accuracy depending on the application requirements. This tradeoff may be regulated, in practical terms, via a parameter which weights the importance of accuracy and invariance in the loss function (see Section 3.2). This idea is common, to the best of our knowledge, to all the fair representation learning algorithms developed so far. We explore how the performance of our model changes with in Figures 4(a) and 4(b), where we report the performance of a BinaryBernoulli model trained on the COMPAS dataset and an adversarial classifier for comparison. What we observe in Figure 4(a) is that our model displays a relatively stable performance. As grows, so does the fairness of the model in terms of 1-GPA (linear correlation ). The model is highly sensitive to values of between and . The same trend, but reversed, may be observed for AUC in Figure 4(b) (). We observed similar trends for BinaryMI, with correlations of and for 1-GPA and AUC, respectively. Thus, we reason that our proposal may be employed under different fairness requirements with minimal changes (a tweak of the parameter). The adversarial classifier, on the other hand, displays little correlation between and its performance. While the best models are on par or almost on par with BinaryBernoulli, this happens for arbitrary values of . The adversarial model does not seem to be able to explore the fairness/accuracy tradeoff quite as well, and the effect of the parameter is unpredictable. We posit that this behavior may be due to the difficulty of striking a balance between the predictive power of the two sub-networks which predict and alternatively (xie2017controllable). This is a well-known issue for generative adversarial models which pit different networks against each other (Chu2020Smoothness).

4.7. Biased-MNIST

Vanilla ReBias LearnedMixin RUBi BinaryMI BinaryBernoulli
0.995 72.1 76.0 78.2 90.4 89.08 90.64
0.990 89.1 88.1 88.3 93.6 88.54 96.02
Table 2. Accuracies for the biased MNIST experiments. Results for other methodologies as reported by Bahng et al. (bahng2020learning). We report results for two bias levels and . A higher value of implies a higher chance of a biased sample in the training set.

We report our method’s performance on the Biased-MNIST dataset in Table 2. We also report results from Bahng et al. (bahng2020learning) as a comparison. To enable this comparison, we experimented with the same setup as the authors’ by training our model for 80 epochs. We, however, selected our best hyperparameters with a Bayesian optimization strategy and employed an Alexnet-like structure (krizhevsky2012imagenet)

by alternating (binary) convolutional layers and max-pooling layers. We then tested with two different bias levels, i.e. the probability of a training sample displaying a specific color bias. We observe that

BinaryBernoulli is the strongest performer on both bias levels, even when compared with other full-precision strategies. BinaryMI has a comparable performance to a “vanilla” convolutional network when the probability of training bias is . However, it displays better scaling to the higher bias level than the baseline method. In this dataset, we see that our methodology is also a strong performer when removing biases from image data is necessary for classification accuracy.

5. Conclusion and Future Work

In this paper we proposed a methodology to compute the mutual information between a stochastically activated neuron and a sensitive attribute. We then generalized this methodology into two different strategies to compute the mutual information between a layer of neural activations and a sensitive attribute. Both our strategies perform strongly on both fair classification datasets and invariant image classification. Furthermore, our methodology displays high stability to changes of the accuracy/fairness tradeoff parameter , especially when compared to adversarial learning (xie2017controllable; madras). A possible direction for further development is to employ the methodologies discussed in this paper to revisit the debate on the information bottleneck problem introduced by Tishby et al. (tishby2015deep). As our models are stochastically quantized, they naturally lend themselves to mutual information computation, avoiding many of the common issues in estimating information measures in neural networks (goldfeld2019estimating). While the methodology presented in Section 3.3 does not seem to scale to very wide networks, it is possible to repeat the sampling (the stochastic quantization) as many times as needed. Furthermore, we would like to test the capabilities of the presented models in domain adaptation scenarios, where adversarial models are still used extensively (ganin2016jmlr; madras).