1. Introduction
Representation learning algorithms based on neural networks are being employed extensively in information retrieval and data mining applications. The social impact of what the general public refers to as “AI” is now a topic of much discussion, with regulators in the EU even putting forward legal proposals which would require practitioners to “[…] minimise the risk of unfair biases embedded in the model […]” (aiproposal). Such proposals refer to biases with regard to individual characteristics which are protected by the law, such as gender and ethnicity. The concern is that models trained on biased data might then learn those biases (barocashardtnarayanan), therefore perpetuating historical discrimination against certain groups of individuals.
In recent years, research on group fairness
in machine learning has focused on formalizing definitions and measuring model bias
(verma2018fairness), introducing concepts as disparate impact, disparate mistreatment, statistical parity, etc.. As an example, a classification model might display disparate impact if it assigns positive outcomes (e.g. getting a loan) with different rates to different groups of people (e.g. men and women). Disparate mistreatment, on the other hand, is the situation where a classifier misassigns negative outcomes with different rates across groups
(zafar2017fairness).One possible approach to the group fairness problem is fair representation learning. Fair representation learning is a set of techniques which learns new representations of the original data where information about protected characteristics such as gender or ethnicity has been removed. While the first work in the field (zemel2013learning) employed probabilistic modeling, numerous authors have employed neural networks as the base learning algorithm (xie2017controllable; moyer2018invariant; cerrato2020constraining; fair_pair_metric; madras). While different authors have opted for different mathematical formulations of the “fair representation learning problem”, the core idea is to remove sensitive information from the network’s representations. If one is able to guarantee that the network has no information about whether an individual’s gender identity, for instance, it follows that the decisions undertaken by the model are independent of it. One possible formalization for the above desideratum relies on computing the mutual information between the th neural layer and a sensitive attribute representing the protected information. If the mutual information equals , for instance, one would obtain groupinvariant representations and therefore unbiased decisions which do not rely on . Estimating mutual information over highdimensional distributions is however highly challenging in general. One line of research which has attracted much attention is based on the Information Bottleneck framework (tishby2015deep) and relies on quantizing the neural representations after model training. After quantization, it is possible to use density estimation techniques to approximate the distribution of a layer . However, this technique suffers from theoretical issues as a neural network is usually a determinstic function of its input and the mutual information is therefore infinite (goldfeld2019estimating). If the interest is not exact computation but optimization, authors have obtained differentiable upper bounds which rely on a variational approximation of the encoding distribution (moyer2018invariant; belghazi2018mutual). Another possible theoretical grounding for fair representation learning relies on learning an adversary network which tries to predict sensitive information from the representations (madras). The loss of an “infinitely powerful” adversary may then be used to provide a bound on the mutual information.
In this work, we provide an alternative theoretical justification for fair representation learning by directly computing the mutual information term and then minimizing it. Our approach avoids the theoretical issues in the Information Bottleneck framework by employing stochasticallyquantized neural networks. These networks have a lowprecision activation (1 bit) which may be interpreted as a random variable following the Bernoulli distribution. Thus, we are able to obtain a finite, exact value for the mutual information between a neuron and the sensitive attribute without relying on variational approximation (moyer2018invariant), adversarial learning (madras) or adding noise to the representations (goldfeld2019estimating). We then employ this exact value to upper bound the overall mutual information in layer , , by assuming independence between the random variables representing each neuron in the layer. We then use this term in optimization, obtaining invariant representations and thus fair decisions. Furthermore, we also show that it is possible to relax the independence assumption and still compute via density estimation. Our experimentation proves that quantized, lowprecision models are able to obtain groupinvariant representations in a fair classification setting and avoid training set biases in image data. Our contributions can be summarized as follows:

We show how to compute the mutual information between a stochastic binary neuron in layer
and a discrete random variable
representing a sensitive attribute . 
We employ the computation above and an independence assumption to compute the mutual information for the whole layer . This computation may be used as an optimization objective.

We relax the assumption above and employ density estimation to compute the mutual information .

We perform experiments on three fair classification datasets and one invariant representation learning dataset, showing that our lowprecision model is competitive with the state of the art.
2. Related Work
Fair and Invariant Representation Learning. Algorithmic fairness has attracted significant attention from the academic community and general public in recent years, thanks in no small part to the ProPublica/COMPAS debate. COMPAS is a software which was intended to assist US courtroom judges in evaluating the risk of recidivism in people seeking to be released on bail. ProPublica (machine_bias) reported the COMPAS software as being discriminatory towards black people. More specifically, ProPublica found that the false positive rates were different for black and white people. We refer the reader to a contribution by Rudin et al. for a summary of the debate (Rudin2020Age). To the best of our knowledge, however, the first contribution in this area dates back to 1996, when Friedman and Nisselbaum (friedman1996bias) contributed that automatic decision systems need to be developed with particular attention to systemic discrimination and moral reasoning. The need to tackle automatic discrimination is also part of EUlevel law in the GDPR, Recital 71 in particular (malgieri2020gdpr).
One possible way to tackle the issues described above is to employ fair representation learning. In fair representation learning, sensitive and lawprotected information (ethnicity, gender identity, etc.) is treated as a “nuisance factor” which needs to be removed from the data . While excluding in training could be regarded as a “colorblind” approach to fairness (we owe this definition to Zehlike et al. (zehlike2018reducing)), it is often not enough to obtain unbiased decisions, as complex statistical correlations between and may still exist. Fair representation learning techniques learn a projection of into a latent feature space where all information about has been removed. One seminal contribution in this area is due to Zemel et al. (zemel2013learning), who employed probabilistic modeling. Since then, neural networks have been extensively used in this space. Some proposals (xie2017controllable; madras) employ adversarial learning, a technique due to Ganin et al. (ganin2016jmlr) in which two networks are pitted against one another in predicting and removing information about . Another line of work (Louizos2016TheVF; moyer2018invariant) employs variational inference to approximate the intractable distribution . A combination of architecture design (Louizos2016TheVF) and informationtheoretic loss functions (moyer2018invariant; gretton2012kernel) may then be employed to encourage invariance of the neural representations with regard to . Our proposal differs in that we employ a stochastically quantized neural network for which it is possible to compute the mutual information between the th neuron at layer and the sensitive attribute . This lets us avoid variational approximations of the target distribution and provides a more stable training objective for representation invariance compared to adversarial training.
Mutual Information and Neural Networks. Our approach is related to the Deep Information Bottleneck framework proposed by Tishby et al. (tishby2015deep). In this line of work, the core proposal is to employ mutual information to explain the generalization properties of deep neural networks. The authors found that as the number of layers grows, neural networks display faster compression rates of the input. This is observed by computing throughout gradient descent steps. As classical learning theory is unable to explain the generalization capabilities of very deep, overparametrized neural networks (zhang2021understanding), the authors’ contribution garned significant attention and scrutiny. Goldfeld and Polyanskiy (goldfeld2020information) have taken exception in particular to Tishby et al.’s strategy to compute mutual information in a neural network, which relies on quantization and binning of the neural activations. This creates a discrepancy between the analysis model (the quantized network) and the actual model (which is not quantized). The number of bins is also observed to be critical to the result of the analysis (a wellknown issue with mutual information estimation (kraskov2004estimating)). Furthermore, neural networks are deterministic functions and in this situation is illdefined as is not a random variable. To solve these issues, Goldfeld et al. have put forward a proposal to compute mutual information in a neural network by adding a small amount of Gaussian noise to the activations (goldfeld2019estimating), which makes it possible to draw parallels to Gaussian noise channels and derive a rateoptimal mutual information estimator.
We take inspiration from Goldfeld and Polyanskiy’s criticism of Tishby et al. and propose a stochasticallyquantized neural network for which is computable as each neuron follows a Bernoulli distribution during training. We propose two different strategies to compute the layerwise mutual information and minimize this measure during training.
3. Method
Our contribution deals with learning fair (groupinvariant) representations in a principled way by employing stochastically quantized neural networks. In this section, we provide a theoretical motivation for our work by contextualizing it in an informationtheoretic framework similar to the one introduced in the “Information Bottleneck” literature (tishby2015deep; goldfeld2019estimating) (Section 3.1). As previously mentioned, the work done so far in this space has approximated measuring the information theoretic quantities in neural networks either via posttraining quantization (tishby2015deep), adversarial bounding (madras), variational approximations (moyer2018invariant) or the addition of stochastic noise in the representations (goldfeld2019estimating; cerrato2020constraining). Our approach is instead to employ stochastically quantized neural network to exactly compute the mutual information between a sensitive attribute and any neuron in the network. We show how this approach may be used to compute groupinvariant representations in Section 3.2 and 3.3.
3.1. Invariant Representations and Mutual Information
A feedforward neural network with layers may be formalized by employing a sequence of “layer functions” which computes the neural activations given an input :
(1)  
(2) 
where is a weight matrix,
is a bias vector,
is an activation function, and
is the size of the th layer. We now define neural representations as applications of to the random variable which follows the empirical distribution of the samples:We note that, in a supervised learning setting, the last layer
is a reproduction of . Representation invariance may be formalized as an informationtheoretic objective in which the representations of the th neural layer display minimal mutual information with regard to a sensitive attribute :(3) 
where is a threshold where the information can be considered minimal, and is the mutual information between and
. In general, given two joint continuous random variables
and taking values over the sample space , the mutual information between them is defined as(4) 
where the integrals may be replaced by sums if and are discrete random variables and is the measure related to . Mutual information is an attractive measure in invariant representation learning, as two random variables are statistically independent if and only if . Thus, one may obtain invariant representations by minimizing the mutual information between and , therefore certifying that no information about the sensitive attribute is contained in the representations. Minimizing this objective might however remove all information about the original data and, if available, the labels . This may be understood easily by assuming a constant representation function and seeing that . The most simple example of this is to reason about . Setting all network parameters to 0 then solves the problem in Equation 3 for any value of . To avoid this issue, one might want to guarantee that some information about is preserved. This changes the setting to a problem which closely resembles the Information Bottleneck problem (tishby2015deep):
minimize  
s.t. 
where is a positive real number. However, constrained optimization is highly challenging in neural networks: in practice, previous work in this area has focused on, e.g., minimizing a reconstruction loss as a surrogate for the constraint (madras).
At the same time, computing is highly nontrivial. Since the distribution of is nonparametric and in general unknown, one would need to resort to density estimation to approximate it from data samples. However, the number of needed samples scales exponentially with the dimensionality of (paninski2003estimation). Furthermore, mutual information is in general illdefined in neural networks: if is a random variable representing the empirical data distribution and is a deterministic injective function, the mutual information is either constant (when is discrete) or infinite (when is continuous) (goldfeld2019estimating). As activation functions in neural networks such as sigmoid and htan are injective and realvalued, is infinite. When the ReLU activation function is employed, the mutual information is finite but still vacuous^{1}^{1}1Proper contextualization of this result requires some additional preliminary results, which we will avoid reporting here. We refer the interested reader to Goldfeld and Polyanskiy (goldfeld2020information)..
Previous work in fair representation learning has circumvented this issue by employing different techniques. One possible approach is to perform density estimation by grouping the realvalued activations into a finite number of bins (tishby2015deep). Some authors have instead relied on variational approximations of the intractable distribution , which leads to a tractable upper bound (moyer2018invariant). Lastly, it is possible to bound the mutual information term with the loss of an adversary network which tries to predict from (madras; ganin2016jmlr; xie2017controllable).
Our approach is instead to employ stochasticallyquantized neural networks, a particular setup of binary networks (binarynet) that enables the interpretation of neural activations as (discrete) random variables. This approach has multiple benefits: It does not rely on variational approximations or adversarial training; it lets us treat as a random variable, avoiding the infinite mutual information issue described above; lastly, it avoids adding noise to the representations as a way to obtain stochasticity (goldfeld2019estimating).
3.2. Mutual Information Computation via Bernoulli activations
Our methodology relies on stochastically quantized neural networks, i.e., neural network models in which the activations are stochastic. More specifically, we employ binary neural networks, i.e., quantized nets in which either the weights or the activations have 1bit precision. The rationale for which these networks have been originally developed is to employ them on lowmemory systems or where real time performance is of critical importance (binarynet). Let be the th layer in a binaryactivated network and be the activation of its th neuron. may be computed deterministically from the activations of the previous layer and the learnable weights and biases connecting the two:
where is the indicator function.
In this paper, however, we employ stochastic quantization of binary neurons, which we compute by sampling from , where:
(5) 
where
is the sigmoid function
. We then sample from this distribution to compute the actual activation . Thus, may be interpreted as a random variable following the Bernoulli distribution: . The entropy of a random variable following the Bernoulli distribution has a closed, analytical form:(6) 
We then consider the whole layer as a stochastic random vector . If one assumes independence between the activations, it is then easy to compute the relevant mutual information measure for the whole layer. For instance, for the random vector :
(7)  
(8) 
This derivation can be easily expanded for a layer of any size. The conditional entropy may be computed by selecting those representations which are obtained from a data point for which and then computing as described in Equation 6. Under the assumption that activations are independent, it is therefore possible to exactly compute the mutual information in stochasticallyactivated quantized neural networks. Thus, given a stochastically quantized binary neural network, we can impose representation invariance with respect to
by stochastic gradient descent over a loss function
that directly incorporates :(9) 
where
is the KullbackLeibler divergence and
is a tradeoff parameter weighting the importance of representation invariance and accuracy on .It is worthwhile to mention that only a single layer needs to be stochastically quantized for the above loss function to be computable. Therefore, employing “hybrid” networks with a mix of binaryprecision and fullprecision layers is a feasible option which we find to be a strong performer in most cases (see Table 1).
The independence assumption, which is critical to the derivation above, warrants some discussion, however. While the sampling of each Bernoulli variable is indeed independent, the parameters of each underlying distribution are not independent. This is due to them depending from the activations at the previous layer , as it is made apparent by Equation 5. Thus, independence holds only conditionally with regard to the Bernoulli parameter vector , e.g., . Independence assumptions are relatively common in practice in variational approximation, where, e.g., the encoding distribution
is sampled from an isotropic Gaussian distribution
(Louizos2016TheVF). Our methodology is able to relax this constraint by taking advantage of the quantization intrinsic to our base model. This is the topic of the next section.3.3. Mutual Information Computation via Density Estimation
First, we report here the general formula for entropy when is a discrete variable:
(10) 
Recalling the definition of Mutual Information in Equation 4 and its relationship to entropy in Equation 8, we see that an estimate of the joint and the joint conditional is needed to compute the mutual information in a network. This is highly nontrivial in fullprecision neural networks. In that situation, is not a random variable (see Section 3.1) and each neuron may take any of values, where is the precision the activations are computed with. In our setup, however, each neuron may only take values in
. Thus, we are able to estimate the joint distribution of the random vector
from data via a simple density estimation scheme: plainly put, we count the occurrences of each possible activation vector .In general, density estimation may be performed, in its simplest form, via building a histogram for the data at hand . Let us define as the histogram function, as a binning function and as the bin width. Formally, for , and can be defined as:
where iterates over the data samples and is the center of the bin where
lies. Given the function above, one is able to estimate the probability density function for a data sample by computing
. Density estimation may also be computed with adaptive, datadependant bin sizes or by using kernels to replace the binning function . Parametric density estimation may also be employed when one has prior knowledge about the data following some parametric distribution, e.g. a Gaussian. These techniques have been extensively employed when estimating mutual information for highdimensional distributions (kraskov2004estimating) such as neural activations. In our setup, however, we employ a stochastic quantization scheme which returns binary values. This lets us abstract away the choice of , which is known to have a major influence in mutual information estimation in neural networks (goldfeld2019estimating; goldfeld2020information). Assuming for simplicity of presentation that , we are able to estimate by counting the frequencies of all possible activation vectors, i.e. the realizations of the random vector , or, equivalently, the output of the layer function . The general density computation for a layer of size follows:(11)  
Where, with some abuse of notation, indicates that a realization of the activation vector is equal to the th element of the sequence^{2}^{2}2Here, any ordering of the set is viable. Thus, we avoid specifying the sequence formally. of binary vectors associated with the set . This procedure may also be employed to compute by only selecting those data samples for which and adjusting the normalization factor accordingly. Therefore, we are able to compute the term in Equation 7
by estimating the underlying probabilities.
One limitation of the methodology presented in this subsection is that the number of samples required to estimate the joint distribution scales exponentially with the number of neurons in the layer . In practice,
may be chosen as an hyperparameter and kept as low as needed. However, a very small sized layer placed at any point in the network may limit the network’s expressivity. Another possible solution to this matter is to perform multiple forward passes so to obtain better estimates for
and , which we however did not investigate presently.In the next section, we test our methodology on three fair classification datasets and one image dataset in which invariant representations are beneficial for classification accuracy. We show that we are able to obtain solid performances with both the strategies described above, therefore showing that the independence assumption is both realistic and noncritical.
4. Experiments
Our experimental setting focuses on analyzing the performance of the methodology presented in Section 3.2 in fair classification and invariant representation learning settings. Our experimentation aims to answer the following questions:
Q1. Is the present methodology able to learn fair models which avoid discrimination? A1. Yes. We analyze our models’ accuracy and fairness by measuring the area under curve (AUC) and their disparate impact / disparate mistreatment. We compare them with fullprecision neural networks trained for fair classification (Section 4.4) and see that they are able to find strong accuracy/fairness tradeoffs under the assumption that both are equally important. Furthermore, we show that they are able to remove nuisance factors from image datasets (Section 4.7).
Q2. Do stochastically quantized neural networks learn invariant representations? A2. Yes. Our models display comparable levels of representation invariance when compared with fullprecision invariant representation learning methods. We show in Section 4.1 that supervised classifiers trained with the learned representations and the sensitive data are unable to generalize to the test set with any sort of consistency, as their accuracy is close to random guessing performance.
The datasets we employed are presented in Section 4.1. We then discuss the fairness metrics in Section 4.2. Finally, we describe the general experimental setup in Section 4.3.
4.1. Datasets
COMPAS. This dataset was released by ProPublica, an USbased consortium (machine_bias), as part of an empirical study on ML software which US judges employ to evaluate the risk of further crimes by individuals who have been previously arrested. The ground truth here is whether an individual committed a crime in the following two years. The sensitive attribute is the individual’s ethnicity. Machine learning models trained on this dataset may display disparate impact (zafar2017fairness), thus our objective is to minimize GPA while maximizing accuracy (AUC).
Adult. This dataset is part of the UCI repository (dua2019uci). The ground truth is whether an individual’s annual salary is over 50K$ per year or not (adult). This dataset has been shown to be biased against gender (Louizos2016TheVF; zemel2013learning; cerrato2020constraining).
Bank marketing. In this dataset, the classification goal is whether an invidivual will subscribe a term deposit. Models trained on this dataset may display both disparate impact and disparate mistreatment with regard to age, more specifically on whether individuals are under 25 and over 65 years of age.
BiasedMNIST. This is an image dataset based on the wellknown MNIST Handwritten Digits database in which the background has been modified so to display a color bias (bahng2020learning). More specifically, a nuisance factor is introduced which is highly correlated with the ground truth and whose values represent the background color in the training set. Ten different colors are preselected for each value of and inserted as a background in the training images with high probability (). The test images, on the other hand, have background color chosen at random. In Figure 3, samples from the training and test data are shown. Thus, the background color/nuisance factor provides a very strong training bias. The simplest strategy for a model to achieve high accuracy on the training set is to overfit the background color. Therefore, models that are unable to learn invariant representations and decisions will inevitably overfit the training set (bahng2020learning).
4.2. Metrics
Groupdependent Pairwise Accuracy.
We employ this metric to test the disparate mistreatment of our models. We first present it in its original formulation, which was developed for fair ranking applications (fair_pair_metric).
Let be a set of protected groups such that every instance inside the dataset belongs to one of these groups. The groupdependent pairwise accuracy is then defined as the accuracy of a ranker on instances which are labeled more relevant belonging to group and instances labeled less relevant belonging to group . Since a fair ranker should not discriminate against protected groups, the difference should be close to zero. In the following, we call the Groupdependent Pairwise Accuracy GPA.
This metric may be employed in classification experiments by considering a classifier’s accuracy when computing and .
Area under Discrimination Curve (AUDC)
We take the discrimination as a measure of disparate impact as previously done in the literature (zemel2013learning), which is given by:
where denotes that the th example has a value of equal to 1. We then generalize this metric in a similar fashion to how accuracy may be generalized to obtain a classifier’s area under the curve (AUC): we evaluate the measure above for different classification thresholds and then compute the area under this curve. In the following, we will refer to this measure as AUDC. Contrary to AUC, lower values are better.
4.3. Experimental setup
We split all datasets into 3 internal and 3 external folds.
On the 3 internal folds, we employ a Bayesian optimization technique to find the best hyperparameters for our model.
A summary of our models’ best hyperparameters may be found in Table 1.
As our interest is to obtain models which are both fair and accurate, we employ Bayesian optimization to maximize the sum of the models’ AUC, 1GPA and 1AUDC.
We set the maximum number of iterations to .
The best hyperparameter setting found this way is then evaluated on the 3 external folds and reported. We relied on the Weights & Biases platform for an implementation of Bayesian optimization and overall experiment tracking (wandb).
On the fairness datasets, we compare with an adversarial classifier (AdvCls in the Figures) trained as described by Xie et al. (xie2017controllable) and a variational fair autoencoder (VFAE) (Louizos2016TheVF). We obtained publicly available implementations for these models and optimized the hyperparameters with the same strategy we employed for our models. We report our model relying on the Bernoulli entropy computation (Section 3.2) as BinaryBernoulli
in the figures, while the model relying on density estimation to compute the joint (Section 3.3) is referred to as BinaryMI
.
BinaryMI  BinaryBernoulli  
COMPAS  
Banks  
Adult  
Batch size  COMPAS  
Banks  
Adult  
N. hidden layers  COMPAS  
Banks  
Adult  
Hidden layer size  COMPAS  
Banks  
Adult  
Hybrid  COMPAS  yes  yes 
Banks  yes  yes  
Adult  yes  yes  
Learning Rate  
Optimizer  ADAM  
Epochs  100 
4.4. Fair and Invariant Classification
We report plots analyzing the accuracy/fairness tradeoff of our models trained for classification in Figure 2. We take AUC as our measure of accuracy and both 1GPA and 1AUDC as the fairness metric. The ideal model in the fair classification setting displays maximal AUC and little to no GPA/AUDC and would appear at the very top right in Figure 2. This result is not attainable on the datasets we consider, as there is usually some correlation between and , which prevents us to obtain perfectly fair and accurate decisions. Thus, one needs to consider possible accuracy/fairness tradeoffs. We assume a balanced accuracy/fairness tradeoff and consider as the “best” model the one closest to under the norm. We then show all equivalent tradeoff points as a dotted line. We see that on both COMPAS and Adult, our BinaryMI
model is able to find a stronger tradeoff than the competitors, with BinaryBernoulli
very close to an equivalent tradeoff. The same may be said for the third dataset, Banks, in which however our two models find very different tradeoffs, with BinaryBernoulli
preferring an almost perfectly fair result and BinaryMI
finding a very accurate model. Compared to adversarial learning and variational inference, our models either find the best tradeoff or lie closest to the best tradeoff line.
4.5. Fair and Invariant Representations
We also analyze the accuracy/fairness tradeoff of the representations learned by our models. We extract neural activations from all the networks considered at the penultimate layer . We then split them into folds as described in Section 4.3. We then report in Figure 4 the performance of a Random Forest algorithm with 1000 base estimators trained to predict the sensitive attribute associated with each representation. The rationale here is that representations that are invariant to should provide no useful training signal to a supervised classifier trying to predict, which will have performance close or equal to random guessing (i.e. predicting the majority value for ). We measure this by computing the absolute distance to random guess (ADRG). In Figure 4 we see that BinaryMI is able to find the best tradeoff between informativeness on (bottom row) and invariance to . Our best BinaryBernoulli model representations, which are comparatively not very invariant on COMPAS, performed strongly in this regard on both Adult and Banks.
4.6. Stability and Sensitivity Analysis
It is important that fair classification models are able to find different tradeoffs between fairness and accuracy depending on the application requirements. This tradeoff may be regulated, in practical terms, via a parameter which weights the importance of accuracy and invariance in the loss function (see Section 3.2). This idea is common, to the best of our knowledge, to all the fair representation learning algorithms developed so far. We explore how the performance of our model changes with in Figures 4(a) and 4(b), where we report the performance of a BinaryBernoulli model trained on the COMPAS dataset and an adversarial classifier for comparison. What we observe in Figure 4(a) is that our model displays a relatively stable performance. As grows, so does the fairness of the model in terms of 1GPA (linear correlation ). The model is highly sensitive to values of between and . The same trend, but reversed, may be observed for AUC in Figure 4(b) (). We observed similar trends for BinaryMI, with correlations of and for 1GPA and AUC, respectively. Thus, we reason that our proposal may be employed under different fairness requirements with minimal changes (a tweak of the parameter). The adversarial classifier, on the other hand, displays little correlation between and its performance. While the best models are on par or almost on par with BinaryBernoulli, this happens for arbitrary values of . The adversarial model does not seem to be able to explore the fairness/accuracy tradeoff quite as well, and the effect of the parameter is unpredictable. We posit that this behavior may be due to the difficulty of striking a balance between the predictive power of the two subnetworks which predict and alternatively (xie2017controllable). This is a wellknown issue for generative adversarial models which pit different networks against each other (Chu2020Smoothness).
4.7. BiasedMNIST
Vanilla  ReBias  LearnedMixin  RUBi  BinaryMI  BinaryBernoulli  

0.995  72.1  76.0  78.2  90.4  89.08  90.64 
0.990  89.1  88.1  88.3  93.6  88.54  96.02 
We report our method’s performance on the BiasedMNIST dataset in Table 2. We also report results from Bahng et al. (bahng2020learning) as a comparison. To enable this comparison, we experimented with the same setup as the authors’ by training our model for 80 epochs. We, however, selected our best hyperparameters with a Bayesian optimization strategy and employed an Alexnetlike structure (krizhevsky2012imagenet)
by alternating (binary) convolutional layers and maxpooling layers. We then tested with two different bias levels, i.e. the probability of a training sample displaying a specific color bias. We observe that
BinaryBernoulli is the strongest performer on both bias levels, even when compared with other fullprecision strategies. BinaryMI has a comparable performance to a “vanilla” convolutional network when the probability of training bias is . However, it displays better scaling to the higher bias level than the baseline method. In this dataset, we see that our methodology is also a strong performer when removing biases from image data is necessary for classification accuracy.5. Conclusion and Future Work
In this paper we proposed a methodology to compute the mutual information between a stochastically activated neuron and a sensitive attribute. We then generalized this methodology into two different strategies to compute the mutual information between a layer of neural activations and a sensitive attribute. Both our strategies perform strongly on both fair classification datasets and invariant image classification. Furthermore, our methodology displays high stability to changes of the accuracy/fairness tradeoff parameter , especially when compared to adversarial learning (xie2017controllable; madras). A possible direction for further development is to employ the methodologies discussed in this paper to revisit the debate on the information bottleneck problem introduced by Tishby et al. (tishby2015deep). As our models are stochastically quantized, they naturally lend themselves to mutual information computation, avoiding many of the common issues in estimating information measures in neural networks (goldfeld2019estimating). While the methodology presented in Section 3.3 does not seem to scale to very wide networks, it is possible to repeat the sampling (the stochastic quantization) as many times as needed. Furthermore, we would like to test the capabilities of the presented models in domain adaptation scenarios, where adversarial models are still used extensively (ganin2016jmlr; madras).