Fair Meta-Learning: Learning How to Learn Fairly

11/06/2019 ∙ by Dylan Slack, et al. ∙ 0

Data sets for fairness relevant tasks can lack examples or be biased according to a specific label in a sensitive attribute. We demonstrate the usefulness of weight based meta-learning approaches in such situations. For models that can be trained through gradient descent, we demonstrate that there are some parameter configurations that allow models to be optimized from a few number of gradient steps and with minimal data which are both fair and accurate. To learn such weight sets, we adapt the popular MAML algorithm to Fair-MAML by the inclusion of a fairness regularization term. In practice, Fair-MAML allows practitioners to train fair machine learning models from only a few examples when data from related tasks is available. We empirically exhibit the value of this technique by comparing to relevant baselines.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in the field of meta-learning provide methods to train machine learning models that can better generalize to new tasks using previous experiences. A known issue in developing fair machine learning classifiers is data collection. Data can be biased in collection, have minimal training examples, or otherwise be unrepresentative of the true testing population

Kallus and Zhou (2018); Holstein et al. (2019); Coston et al. (2019). Can we adapt meta-learning approaches to handle issues in fairness related to minimal data and bias in the distribution of training data when there is related task data available?

We note that overall the problem of training fair machine learning models with very little task specific training data is relatively unstudied. Related work includes methods to transfer fair machine models. Madras et. al. propose an adversarial learning approach called LAFTR for fair transfer learning

Madras et al. (2018)

. Schumman et. al. provide theoretical guarantees surrounding transfer fairness related to equalized odds and opportunity and suggest another adversarial approach aimed at transferring into new domains with different sensitive attributes

Schumann et al. (2019). Additionally, Lan and Huan observe that the predictive accuracy of transfer learning across domains can be improved at the cost of fairness Lan and Huan (2017). Related to fair transfer learning, Dwork et. al. use a decoupled classifier technique to train a selection of classifiers fairly for each sensitive group in a data set Dwork et al. (2018). Developing models that are able to achieve satisfactory levels of fairness and accuracy with only minimal data available in the desired task could be an important avenue for future work. To address the proposed question, we introduce a fair meta-learning approach: Fair-MAML.

2 Fairness Setting

We assume a binary fair classification scenario. We have features , labels and sensitive attributes . We consider the positive outcome in and the protected group in . We train a classifier parameterized by , to be accurate and fair with respect to and .

We consider two notions of fairness demographic parity and equal opportunity. Demographic parity (or disparate impact Feldman et al. (2015)) can be described as:


Here, we present demographic parity Dwork et al. (2012) as a ratio. If the ratio is closer to , it indicates more fairness. Next, equal opportunity Hardt et al. (2016) requires that the protected groups have equivalent true positive rates. Similarly, a value closer to indicates high levels of fairness:


Another oft-noted notion of fairness equalized odds Hardt et al. (2016) also includes the false positive rate. In this work however, we only consider the two earlier mentioned definitions of group fairness.

3 Fair-MAML

3.1 Fair-MAML Algorithm

In the meta-learning scenario used in this paper, we train to learn a new task drawn over a distribution of tasks using examples drawn from . Additionally, we assume can be optimized through gradient descent. We define a task as . Each task includes a fairness regularization term

and fairness hyperparameter

. Additionally, it has a dataset consisting of features , labels , and sensitive attribute such that

as well as loss function


In order to train a fair meta-learning model, we adapt Model-Agnostic Meta-Learning or MAML Finn et al. (2017) to our fair meta-learning framework and introduce Fair-MAML by including the regularization term and fairness hyperparameter . MAML is trained by optimizing performance of across a variety of tasks after one gradient step. The Fair-MAML algorithm is given in algorithm 1.

0:  : distribution over tasks
0:  , : step size hyperparameters
  randomly initialize
  while not done do
     Sample batch of tasks
     for all  do
        Sample datapoints } from
        Evaluate using and
        Compute updated parameters:
        Sample new datapoints from to be used in the meta-update
     end for
     Update using each
  end while
Algorithm 1 Fair-MAML

3.2 Fairness Regularizers

Fair-MAML requires that second derivatives be computed through a Hessian-vector product in order to calculate the meta-loss function which can be computationally intensive and time-consuming. We propose two regularization terms for demographic parity and equal opportunity that are quick to compute. First considering demographic parity, let

denote the protected instances in and :


This regularizer incurs penalty if the probability that the protected group receives positive outcomes is low. Our value assumption is that we attempt to adjust upwards the rate at which the protected class receives positive outcomes.

We also propose a regularizer for equal opportunity. Let denote the instances within that are both protected and have the positive outcome in .


We have a similar value assumption for equal opportunity. We adjust the true positive rate of the protected class upwards.

4 Experiments

4.1 Synthetic Experiment

We illustrate the usefulness of Fair-MAML as opposed to a regularized pre-trained model through a synthetic example based on Zafar et. al Zafar et al. (2017)

. We generate two Gaussian distributions using the means and covariances from Zafar et. al. To simulate a variety of tasks, we generate

labels by letting all the positively labeled points be those above a randomly generated line through with slope randomly selected on the range . We add a sensitive feature using the same technique from Zafar et. al. Full data generation detail can be found in section 6.1 in the appendix.

In order to assess the fine-tuning capacity of Fair-MAML and the pre-trained neural network, we introduce a more difficult fine-tuning task. During training, the two classes were separated clearly by a line. For fine-tuning, we set each of the binary class labels to a specific distribution. In this scenario, a straight line cannot clearly divide the two classes. We assigned sensitive attributes using the same strategy as above. Additionally, we only fine-tuned with

positive-outcome examples from the protected class.

We trained Fair-MAML using a neural network consisting of two hidden layers of

nodes and the ReLU activation function. Full training details can be found in

6.2 in the appendix. We present the biased example task in figure 1. We give comprehensive results over a variety of examples in the appendix in figure 3. In the new task, there is an unseen configuration of positively labeled points. It was not possible for positively labeled points to fall below during training. Fair-MAML is able to perform well with respect to both fairness and accuracy on the fine-tuning task when only biased fine-tuning data is available while the pre-trained network fails.

Figure 1: An example decision boundary from the pre-trained neural network, MAML, and Fair-MAML on the synthetic example (note: Fair-MAML is MAML with ). Points that are colored the same as the side of the boundary are correct. Blue is the positive class. Only points in the positive outcome and protected class are given for the fine-tuning task. Fair-MAML is able to handle such an imbalance of training points on a previously unseen task while the pre-trained neural network fails—illustrating that Fair-MAML has learned a more useful internal representation.

4.2 Communities and Crime Experiment

Next we consider an example using the Communities and Crime data set Lichman (2013). The goal is to predict the violent crime rate from community demographic and crime relevant information. We convert this data set to a multi-task format by using each state as a different task. We convert violent crime rate into a binary label by whether the community is in the top of violent crime rate within a state. We add a binary sensitive column for whether African-Americans are the highest or second highest population in a community in terms of percentage racial makeup.

We trained two Fair-MAML models—one with the demographic parity regularizer from equation 3 and another with the equal opportunity regularizer from equation 4. We used a neural network with two hidden layers of nodes. Additionally, we trained two transfer LAFTR models Madras et al. (2018) on the transfer tasks using the demographic parity and equal opportunity adversarial objectives. We also trained a pre-trained network regularized for both demographic parity and equal opportunity. We set =10 in Fair-MAML and fine-tuned on communities for both Fair-MAML and the pre-trained network. We used the fairness regularizers for both Fair-MAML and the pre-trained network while fine-tuning. Full training details can be found in section 6.3 in the appendix.

When training a MLP from the encoder on each of the transfer tasks, we found that LAFTR struggled to produce useful results with only

training points from the new task over any number of training epochs. We found that we were able to get reasonable results from LAFTR using

fine-tuning points and epochs of optimization. It makes sense that a smaller number of training epochs for the new task is unsuccessful because the MLP trained on the fairly encoded data is trained from scratch. The results are presented in figure 2. Fair-MAML is able to achieve the best balance of fairness and accuracy and demonstrates strong ability to generalize to new states using only communities.

Figure 2: The accuracy/fairness trade off for the communities and crimes example sweeping over a range of ’s. The data presented is the mean across three runs on each using randomly selected hold out tasks. Higher accuracy and fairness values closer to indicate more successful outcomes. The pre-trained neural network and Fair-MAML received fine-tuning points and were optimized for epoch. We did not find useful results using LAFTR with only fine-tuning points or with a minimal number of fine-tuning epochs, so the LAFTR example given here is with fine-tuning points and epochs of optimization. Fair-MAML scores the strongest levels of accuracy and fairness.

5 Conclusions

We propose studying transfer fairness in situations with minimal task specific training data. We showed the usefulness of weight-based meta-learning through the introduction of Fair-MAML to accurately/fairly train models on both synthetic and real data sets from only a couple training points. We leave evaluating other meta-learning approaches including first order MAML to future work.


  • [1] A. Coston, K. N. Ramamurthy, D. Wei, K. R. Varshney, S. Speakman, Z. Mustahsan, and S. Chakraborty (2019) Fair transfer learning with missing protected attributes. In

    Proceedings of the AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, Honolulu, HI, USA

    Cited by: §1.
  • [2] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, New York, NY, USA, pp. 214–226. External Links: ISBN 978-1-4503-1115-1, Link, Document Cited by: §2.
  • [3] C. Dwork, N. Immorlica, A. T. Kalai, and M. Leiserson (2018) Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency, pp. 119–133. Cited by: §1.
  • [4] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259–268. Cited by: §2.
  • [5] C. Finn, P. Abbeel, and S. Levine (2017-06–11 Aug) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1126–1135. External Links: Link Cited by: §3.1.
  • [6] M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 3323–3331. External Links: ISBN 978-1-5108-3881-9, Link Cited by: §2, §2.
  • [7] K. Holstein, J. Wortman Vaughan, H. Daumé,III, M. Dudik, and H. Wallach (2019) Improving fairness in machine learning systems: what do industry practitioners need?. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA, pp. 600:1–600:16. External Links: ISBN 978-1-4503-5970-2, Link, Document Cited by: §1.
  • [8] N. Kallus and A. Zhou (2018-10–15 Jul) Residual unfairness in fair machine learning from prejudiced data. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 2439–2448. External Links: Link Cited by: §1.
  • [9] C. Lan and J. Huan (2017) Discriminatory transfer. Workshop on Fairness, Accountability, and Transparency in Machine Learning. Cited by: §1.
  • [10] M. Lichman (2013) UCI machine learning repository. External Links: Link Cited by: §4.2.
  • [11] D. Madras, E. Creager, T. Pitassi, and R. Zemel (2018) Learning adversarially fair and transferable representations. International Conference on Machine Learning. Cited by: §1, §4.2, §6.3.3.
  • [12] C. Schumann, X. Wang, A. Beutel, J. Chen, H. Qian, and E. H. Chi (2019) Transfer of machine learning fairness across domains. CoRR abs/1906.09688. External Links: Link, 1906.09688 Cited by: §1.
  • [13] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi (2017) Fairness constraints: mechanisms for fair classification. AISTATS. Cited by: §4.1.

6 Appendix

Figure 3: Additional synthetic examples.

6.1 Data Generation Synthetic Task

The first distribution (1) is set to and the second (2) is set to . During training, we simulate a variety of tasks by dividing the class labels along a line with y-intercept of and a slope randomly selected on the range . All points above the line in terms of their

-coordinate receive a positive outcome while those below are negative. Using the formulation from Zafar et. al., we create a sensitive feature by drawing from a Bernoulli distribution where the probability of the example being in the protected class is:

where . Here, controls the correlation between the sensitive attribute and class labels. The lower , the more correlation and unfairness. We randomly select from the range to simulate a variety in fairness between tasks.

6.2 Training Details Synthetic Task

We randomly generated synthetic tasks that we cached before training. We sampled examples from each task during meta-training, used a meta-batch size of for Fair-MAML, and performed a single epoch of optimization within the internal MAML loop. We trained Fair-MAML for meta-iterations. For the pre-trained neural network, we performed a single epoch of optimization for each task. We trained over batches of tasks per batch to match the training set size used by Fair-MAML.

The loss used is the cross-entropy loss between the prediction and the true value using the demographic parity regularizer from equation 3. We use a neural network with two hidden layers consisting of nodes and the ReLU activation function. We used the softmax activation function on the last layer. When training with Fair-MAML, we used examples and performed one gradient step update. We set the step size to , used the Adam optimizer to update the meta-loss with learning rate set to . We pre-trained a baseline neural network on the same architecture as Fair-MAML. To one-shot update the pre-trained neural network we experimented with step sizes of and ultimately found that yielded the best trade offs between accuracy and fairness. Additionally, we tested values during training and fine-tuning of .

6.3 Communities and Crime Experiment Details

6.3.1 Additional pre-processing details

The Communities and Crime data set has data from states ranging in number of communities from to communities per state. We only used states with or more communities leaving states. We held out randomly selected states for testing and trained using states.

6.3.2 Fair-MAML Training Details

We set and cached meta-batches of size states for training. For testing, we randomly selected communities from the hold out task that we used for fine-tuning and evaluated on whatever number of communities were left over. The number of evaluation communities is guaranteed to be at least because we only included states with or more communities.

Each of the two layers of nodes used the ReLU activation function. We trained Fair-MAML with one gradient step using a step size of and a meta-learning rate of using the Adam optimizer. We trained the model for meta-iterations. In order to assess Fair-MAML, we trained a neural network regularized for fairness using the same architecture and training data. We fine-tuned the neural network for each of the assessment tasks. We used a learning rate of for training and assessed learning rates of for fine-tuning. We found the fine-tuning rate of to perform the best trade offs between accuracy and fairness and present results using this learning rate. We varied over incremented by for the demographic parity regularizer. We found higher ’s to work better for the equal opportunity regularizer and varied from incremented by .

6.3.3 LAFTR Training Details

We used the same transfer methodology and hyperparameters as described in Madras et. al. [11] and used a neural network with a hidden layer of nodes as the encoder. We used another neural network with a hidden layer of nodes as the MLP to be trained on the fairly encoded representation. We used the demographic parity and equal opportunity adversarial objectives for the first and second LAFTR model respectively.

We trained each encoder for epochs and swept over a range of : . We trained with all the data not held out as one of the testing tasks. Though we do not include the results in presentation, we were able to generate similar results with LAFTR to Fair-MAML using training points from the new task after epochs of optimization.