Log In Sign Up

Fair AutoML

We present an end-to-end automated machine learning system to find machine learning models not only with good prediction accuracy but also fair. The system is desirable for the following reasons. (1) Comparing to traditional AutoML systems, this system incorporates fairness assessment and unfairness mitigation organically, which makes it possible to quantify fairness of the machine learning models tried and mitigate their unfairness when necessary. (2) The system is designed to have a good anytime `fair' performance, such as accuracy of a model satisfying necessary fairness constraints. To achieve it, the system includes a strategy to dynamically decide when and on which models to conduct unfairness mitigation according to the prediction accuracy, fairness and the resource consumption on the fly. (3) The system is flexible to use. It can be used together with most of the existing fairness metrics and unfairness mitigation methods.


page 1

page 2

page 3

page 4


Projection to Fairness in Statistical Learning

In the context of regression, we consider the fundamental question of ma...

Generative Adversarial Networks for Mitigating Biases in Machine Learning Systems

In this paper, we propose a new framework for mitigating biases in machi...

Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study

The FAIR Guiding Principles aim to improve the findability, accessibilit...

Socially Fair Mitigation of Misinformation on Social Networks via Constraint Stochastic Optimization

Recent social networks' misinformation mitigation approaches tend to inv...

FairXGBoost: Fairness-aware Classification in XGBoost

Highly regulated domains such as finance have long favoured the use of m...

Machine learning for public policy: Do we need to sacrifice accuracy to make models fair?

Growing applications of machine learning in policy settings have raised ...

Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?

Fairness and robustness are often considered as orthogonal dimensions wh...

1 Introduction

Machine learning (ML) is increasingly adopted to make decisions or provide services backed by predictions learned from data in various real world applications. The effectiveness of machine learning largely depends on proper configurations of many design choices, such as the choice of learners and hyperparameters. That motivates software solutions to Automated Machine Learning (AutoML). A good AutoML solution is expected to serve as an end-to-end pipeline from data to an ML model ready to be consumed, preferably at low computation cost. Effective AutoML systems have been developed to automcatically tune the configurations and find ML models with high prediction accuracy from given training data. However, In many real-world applications where the decisions to be made have a direct impact on the well-being of human beings, it is not sufficient to only have high prediction accuracy. We also expect the ML-based decisions to be ethical that do not put certain unprivileged groups or individuals at systematic disadvantages.

One example of such human-centered applications where machine learning is heavily used is modern financial services, including financial organizations’ activities such as credit scoring, lending and etc. According to a survey of UK regulators in 2019 BankofEngland (10, 2019), two-thirds of UK financial industry participants rely on AI today to make decisions. Moreover, an Economist Intelligence Unit research report John (2020) found that 86% of financial services executives plan on increasing their AI-related investments through 2025. Laws and regulations, for example the U.S. Fair Credit Reporting Act (FCRA) and Equal Credit Opportunity Act (ECOA), have been enforced to prohibit unfairness and discrimination in related financial activities. More examples of such fairness-critical applications include (but not limited to) education, hiring, criminal justice, and health care. In these applications, despite the critical importance of fairness, there has been increasing evidence showing various unfairness issues associated with machine-made decisions or predictions O’neil (2016); Barocas and Selbst (2016); Angwin et al. (2016).

To incorporate fairness into the AutoML pipeline, we propose to build fair AutoML systems that can produce models both accurate and fair. In recent years, a few methods have been proposed to mitigate the unfairness or fairness-related harms of machine learning, making an originally unfair model fair under a specified fairness definition. To the best of our knowledge, they have not been used by any existing AutoML system. We investigate the opportunity of leveraging these mitigation methods, beginning with several research questions. First, while these methods are shown to be overall effective (but also computationally expensive) for mitigating the unfairness of a single machine learning model with fixed hyperparameters, are they useful in the AutoML context where models with varying hyperparameters are searched? Does hyperparameter search negates the necessity of unfairness mitigation because some hyperparameter configurations lead to more fair models than the others? If unfairness mitigation is useful, what is the most efficient strategy to combine hyperparameter search and unfairness mitigation as both are already much more expensive than training a single model?

To gain insights to these questions, we analyzed the impact of a state-of-the-art unfairness mitigation method on models trained under different hyperparameter configurations proposed in an AutoML process, in terms of accuracy, fairness and computational cost. We have several observations. First, hyperparameter search can already find models with varying degree of fairness, but it has a limited potential to reach the best accuracy under fairness constraints, and mitigation methods can break that limit. Second, applying unfairness mitigation to a model takes 1 to 2 orders of magnitude higher cost than training that model under the same hyperparameter configuration. A trade-off between allocating resource in hyperparameter search and unfairness mitigation needs to be made. We further observe that the accuracy before and after mitigation has a strong linear correlation, while the fairness score before and after mitigation has little correlation.

Figure 1:

A demonstration of FairAutoML’s performance on a classification task over time. We used the open-source AutoML library, FLAML, as the base AutoML system, to obtain the results. The loss is calculated by

1-accuracy, and fair loss is the loss when the required fairness constraint is satisfied.

With these observations, we design an AutoML system which incorporates fairness assessment and unfairness mitigation organically into it. During the AutoML process, at each trial, the system evaluates both the accuracy and fairness of the machine learning model associated with each configuration tried. The fairness evaluation leverages existing efforts on quantifying the fairness of machine learning models. Based on the evaluation result, the system makes dynamic decisions on searching for more hyperparameter configurations, or conducting unfairness mitigation with one of the configurations evaluated. The former is helpful in finding models with good prediction accuracy, and the latter is helpful in producing models with good fairness properties. The system makes online trade-off between the two choices with the goal of efficiently finding a final model with good accuracy while satisfying a user-specified fairness constraint. Finally, we acknowledge that there is still no single definition of fairness that applies equally well to different applications of machine learning. Different application scenarios may need different fairness metrics and unfairness mitigation methods. Thus we designed the system in a way such that new fairness metrics and mitigation methods can be easily adopted. Empirical evaluation on three machine learning fairness benchmark datasets with two different fairness metrics and two different fairness thresholds show the effectiveness and efficiency of the system. Figure 1 includes a performance demonstration of our proposed system, named FairAutoML. The demonstration shows that (1) FairAutoML can quickly and effectively reduce the fair loss of the system without affecting the convergence in terms of the original loss; (2) FairAutoML can successfully break the vanilla AutoML’s limit on the fair loss, yielding a better final performance that satisfies fairness constraints.

2 Related literature

In this section, we review background literature about machine learning fairness and work related to fair AutoML.

2.1 Machine learning fairness

There are mainly two types of fairness measurements for machine learning tasks: group fairness and individual fairness. Group fairness is usually framed regarding a particular attribute, called sensitive attribute or protected attribute. Data can be categorized into groups according to the different values of the sensitive attribute. Examples of commonly used sensitive attributes include ‘gender’, ‘race’, ‘age’ and so on. Under this group fairness framing, fairness, generally speaking, requires some aspects of the machine learning system’s behavior to be comparable or even equalized across different groups. Such requirement(s) can usually be quantitatively expressed in certain forms of statistical parity across groups. One typical example of the aspects of interest is the positive outcome, or error rate. The corresponding parity form is the vanilla statistical parity or Demographic Parity (DP). Another commonly used parity form in group fairness is Equalized Odds (EO)

Hardt et al. (2016), which corresponds to equalized error rate conditioned on different response labels. Individual fairness Dwork et al. (2012); Lahoti et al. (2019) ensures that individuals who are ‘similar’ with respect to the learning task receive similar outcomes. There are debates about the conflicts and potential reconciliation between group and individual fairness measures Binns (2020). In this work, to simplify the discussions and considering the maturity of research development and understanding regarding these two types of measurement, we primarily focus on the group fairness measurement.

Various methods have been proposed to mitigate the unfairness of machine learning. Those unfairness mitigation methods can be roughly categorized into pre-processing Kamiran and Calders (2012); Calmon et al. (2017), in-processing Kamishima et al. (2011); Woodworth et al. (2017); Zafar et al. (2017); Agarwal et al. (2018); Zhang et al. (2021), and post-processing approaches Friedler et al. (2014), depending on whether the methods should be applied before, during or after the model training process.

Software toolkits have been developed for machine learning fairness. Representative ones include the open-source Python libraries AIF360 Bellamy et al. (2018) and FairLearn Bird et al. (2020). Both provide implementations for state-of-the-art fairness measurement metrics, unfairness mitigation methods, and collections of datasets commonly used for machine learning fairness research.

2.2 Automated machine learning

Many AutoML toolkits and systems have been developed, including open-source AutoML libraries. To name a few notable ones, Auto-sklearn Feurer et al. (2015), TPOT Olson et al. (2016), H2O AutoML , AutoGluon Erickson et al. (2020), and FLAML Wang et al. (2021).

There is little work in the AutoML field that takes fairness into consideration, despite some preliminary attempt in the broad regime of AutoML recently. FairHO (Fair Hyperparameter Optimization) Cruz et al. (2021) tries to make hyperparameter optimization fairness-aware by including the fairness score in the optimization objective through a weighted summation. It relies on user specified weights to balance the trade-off between prediction accuracy and fairness. FairBO (Fair Bayesian optimization) Perrone et al. (2021) proposes to reduce the unfairness of learning outcome by varying the hyperparameter configurations and solving a constrained Bayesian optimization problem. However a thorough analysis in our paper shows that hyperparameter search does not totally negates the necessity of unfairness mitigation, which indicates a potential drawback of this approach. To the best of our knowledge, there is no AutoML solution that is both fairness-aware and able to leverage the unfairness mitigation methods to effectively reduce the unfairness of the output model in an end-to-end manner.

3 Framing, formulation and analysis

3.1 Fairness framing

Since the notation of machine learning fairness is deeply contextual, we find it necessary to clearly state the scope of the fairness framing used in this work, which covers a large family of commonly used fairness definitions. But we acknowledge that our framing does not cover all existing ways to approach machine learning fairness.

Fairness assessment. In this paper, we primarily consider the general group fairness as the fairness criteria. The other main notation of fairness, individual fairness, is not covered for now. Under this group fairness context, the extent of fairness (or unfairness) can be quantitatively calculated given a specific disparity metric and sensitive attribute(s). A model is considered ‘fair’ if the output of the disparity measurement regarding the sensitive attribute(s) equals to 0 or smaller than a particular threshold. We denote by the sensitive attribute variable, a disparity function parameterized by sensitive attribute , and the fairness threshold constant. The fairness, denoted by , is then determined by as stated in Eq. (1).


in which

are the feature vectors of the data, ground-truth labels in the data and the predicted labels, and

is an indicator function.

Unfairness mitigation. Various methods have been proposed to mitigate fairness-related harms of machine learning as introduced in Section 2. In this paper, we use the state-of-the-art unfairness mitigation method Exponentiated Gradient reduction Agarwal et al. (2018) as the mitigation method by default. This method is used as it is empirically effective, theoretically grounded, open-sourced, and model-agnostic. It can be replaced with other mitigation methods without affecting our problem formulation.

3.2 The fair AutoML formulation

We denote by the data feature space, and the target or outcome (e.g., label in the classification task) space. Given a training dataset , a validation dataset

and a loss function

, with the framing of fairness assessment and unfairness mitigation introduced, we formulate fair AutoML as the process of finding a fair machine learning model minimizing the loss by searching over a set of hyperparameters and deciding whether to do unfairness mitigation, as mathematically presented in the following equation, i.e., Eq. (2).


in which denotes a hyperparameter configuration in the hyperparameter search space . denotes the resulting machine learning model associated with configuration , trained on dataset . On a special note, the superscript , taking value 0 or 1, in denotes whether unfairness mitigation is applied during model training. means regular model training without unfairness mitigation, and means model training with unfairness mitigation. denotes the prediction outcome of the machine learning model on input data . Under this formulation, we can consider fair AutoML as an AutoML problem optimizing for the fair loss, i.e., the loss when fairness constraint is satisfied. This fair loss notion will be used though this paper to evaluate the effectiveness of the AutoML solutions.

We acknowledge that in some scenarios an AutoML system may include components other than hyperparameter search, for example feature engineering, data cleaning and etc. The fairness consideration proposed is generally compatible with those components. To simplify the discussion, we omitted those components in our fair AutoML formulation.

Comparing to the original AutoML problem, the fair AutoML problem includes a constraint about fairness, which involves fairness assessment, and the possible choice of unfairness mitigation. Similar to the case of a regular AutoML problem, the total computation cost for finding a good model is of critical importance to fair AutoML systems as well in practice. It is desirable to have a fair AutoML system which is able to find a good model at low computation cost and has good anytime performance (note that ‘good’ here is measured by the loss conditioned on the satisfaction of the fairness constraint).

3.3 Analysis of the fair AutoML problem

Figure 2: Unfairness mitigation’s impact on AutoML, in terms of fairness, loss and computation cost. In the first columns, the points labeled ‘No mitigation’ correspond to models trained in the first experiment where mitigation is not used at all, and the rest correspond to that from the second experiment, where mitigation is always enforced and each configuration tried is always associated with two models, the one ‘Before mitigation’ and the one ‘After mitigation’. The second and third columns show the unfairness scores (measured by demographic parity) and loss before and after mitigation is applied, together with the corresponding Pearson’s Correlation score. The fourth columns shows the distribution of the ‘computation cost ratio’ (the cost of training a model with mitigation divided by that of training a regular model with the same configuration) for the trials where mitigation is applied.

With the fair AutoML problem formulated, we now provide analysis and insights to this problem.

Necessity of fairness mitigation: We first claim that to achieve fair AutoML, fairness mitigation is generally needed in addition to regular training and hyperparameter search in AutoML. Regular AutoML may fail to find a single model that both satisfies the fairness constraint and has good loss for an ad-hoc task (specified by the dataset and loss metric). This is essentially because there may exist multiple sources of unfairness Dudík et al. (2020), many of which are just impossible to be alleviated by varying model hyperparameters. At the same time, effective unfairness mitigation methods have been proposed to improve the fairness of machine learning for general supervised classification tasks as reviewed in Section 2.

To further support the claim above with empirical evidence, we conduct the following two experiments on a typical dataset used for the evaluation of machine learning fairness, Adult (details about this dataset is deferred to Section 5). In the first experiment we perform the regular AutoML, i.e., using the state-of-the-art AutoML library FLAML, for long-enough cpu-time (such that a large enough number of configurations can be tried and the AutoML process has converged). In another experiment, we perform an ‘exhaustively’-fair AutoML with the same amount of resource, in which unfairness mitigation is enforced for all the configurations tried in the regular AutoML process. We show the ‘unfairness scores’ and loss of the models evaluated in the two experiments in the first column of Figure 2. We consider the output of the disparity measurement D provided (without considering the threshold yet) as the ‘unfairness score’, which is the lower the better. Each scatter marker in the first column corresponds to the performance of a machine learning model trained in the first or second experiment. From the results we can first observe an obvious overall reduction of unfairness scores after mitigation is applied (comparing the ‘Before mitigation’ and ‘After mitigation’ points, which have a one-on-one mapping), which shows the effectiveness of the unfairness mitigation method on reducing the unfairness of each single machine learning model. This conclusion is more obvious from the second column of Figure 2. Then by comparing the ‘Before mitigation’ and ‘After mitigation’ points together with the ‘No mitigation’ points, we can see the effectiveness of performing mitigation in improving the performance of the AutoML system: when no unfairness mitigation is applied, it is difficult to find configurations which have both low unfairness score and low loss. These results show the necessity of including the unfairness mitigation step explicitly as an additional choice of operation into the original AutoML pipeline. In addition, doing so also makes it convenient to leverage current and future research efforts on the explicit mitigation of machine learning unfairness.

Now we analyze how to incorporate unfairness mitigation into the AutoML pipeline effectively and efficiently.

Observation 1. The high computation cost of unfairness mitigation. One naive solution is to always perform unfairness mitigation after each configuration is suggested by the original AutoML, as what is done in the ‘exhaustively’-fair AutoML experiment mentioned above. However, we notice that due to the intervention nature of the in-processing mitigation approach, the unfairness mitigation step is usually much more resource-consuming than regular model training. For example, the exponentiated gradient reduction method involves multiple iterations of data sample reweighting and model retraining, which makes the new training process up to hundreds of times more expensive. The histograms in the fourth column of Figure 2 show the distribution of the ratio between the new training cost (with mitigation) and the original training cost (without mitigation). The ratio is more than 120 on this tested dataset. It means that the ‘exhaustively’-fair AutoML system is, in expectation, 120 times slower to try the same set of configurations than the original AutoML system. According to our observations, given the same resource (1 hour wall-clock time and 16 parallel cores), it can only evaluate 11% to 55% the total number of models evaluated in the original AutoML system. This high computation cost side effect makes the exhaustive approach impractical, especially in scenarios where the computation cost of the regular AutoML is already high or the total computation budget is small. And at the same time, there is indeed chance to further improve the ‘exhaustively’-fair approach: We calculated the fraction of resource spent on mitigation which does not lead to a better fair loss. This fraction is over on the tested dataset. This large fraction of wasted resources indicates that it is possible to skip unfairness mitigation on some of the configurations without sacrificing the fair performance.

Observation 2. Non-monotonic but strong correlation between the losses before and after unfairness mitigation.

Unfairness mitigation methods usually try to improve fairness of machine learning without sacrificing the prediction accuracy. This general objective of unfairness mitigation methods means that there is usually a strong correlation between the loss of the original model and the loss of the model after mitigation is applied. We also empirically verified this: We visualize the losses of the configurations tried in the AutoML process with and without unfairness mitigation in the third column of Figure 2, where we also reported the exact Pearson’s correlation score. The scores verified the existence of strong correlation. This strong correlation should be leveraged when deciding when and on which configurations to perform the mitigation such that the computation resources are efficiently used toward the objective of the fair AutoML system. Following this rationale, one potential approach is to simply apply the mitigation to the most accurate model the original AutoML system found. There are pitfalls in this naive approach: (1) Despite the strong correlation, the relationship between the losses before and after mitigation is non-monotonic as shown in the figures, which indicates that the most accurate model before mitigation is not necessarily the most accurate ‘fair’ model; (2) To remain an end-to-end solution, the system still needs to decide when to stop the original AutoML process and apply the mitigation operation, which is non-trivial especially if we want to apply the mitigation on multiple models. If this stopping time is not set properly, the system may not able to produce a single fair model.

Observation 3: Non-monotonic and weak correlation between the unfairness score before and after unfairness mitigation. We also notice a weak correlation regarding the unfairness scores from the second column of Figure 2. This weak correlation suggests that the magnitude of unfairness scores of the original models are of little importance in deciding whether to apply the unfairness mitigation.

4 Fair AutoML

Figure 3: FairAutoML Flowchart.

Based on the analysis in the previous section, we propose to explicitly include the unfairness mitigation operation into the AutoML pipeline. And instead of always performing unfairness mitigation, we further take advantage of the observed impact of mitigation in cost, accuracy and fairness to adaptively decide when and on which configurations to do unfairness mitigation such that computation cost is allocated judiciously. Our proposed FairAutoML system is presented in Figure 3 and formally described in Algorithm 1.

1:  Inputs: Machine learning task related inputs: training and validation dataset: and a loss metric Loss. AutoML related inputs: (1) Hyperparameter search space , and the computation budget (optional); (2) The original AutoML searcher , which make hyperparameter suggestion through a suggest function; (3) The fairness assessment function Fair. An example of Fair can be found in Eq. (1) under the group fairness context.
2:  Initialization: Configurations evaluated .
3:  while Budget left, i.e.,  do
5:     if Mitigate()  then
6:        Apply unfairness mitigation () during model training and get model , which incurs computation cost .
7:     else
8:         S.suggest()
9:        Perform regular model training () and get model , which incurs computation cost .
10:     end if
11:     Performance assessment and related statistics update: (1) Assess model loss and fairness: in which . (2) Update related statistics and model: and update the original AutoML searcher with .
12:  end while
Algorithm 1 FairAutoML

Candidate configuration to perform unfairness mitigation. At each round of the fair AutoML iterations, the system considers the unfair model that has the best loss as the next candidate model to perform unfairness mitigation (line 4 of Algorithm 1). This design is based on Observation 2 and Observation 3: the loss of the original model is a strong indicator of the loss after unfairness mitigation is applied; and the original magnitude of unfairness score, i.e., the output of disparity D, does not inform the fairness score after mitigation.

Hyperparameter search vs unfairness mitigation. Then the system makes a choice (line 5-10 of the algorithm) between doing one step of unfairness mitigation on this candidate configuration or one step of hyperparameter search as what is supposed to be done in a regular AutoML system. These two choices may yield two models with different fairness and loss, and are associated with dramatically different computation consumption, i.e., the actual value of in line 6 and line 9 of the algorithm can be very different, according to our analysis in the previous section. A subtle balance between these two choices is needed: The benefit of doing unfairness mitigation is that it can potentially make an originally unfair model fair and thus contribute to the objective of finding fair models. However, there is also a potential drawback of doing unfairness mitigation: it takes too much computation resource, which can be instead used to do hyperparameter search. Note that hyperparameter search can find better candidate models to perform unfairness mitigation and sometimes can even directly find better fair models. On the other hand, concentrating too much resource on hyperparameter search is also not always desirable because it may lead to a situation where no fair model is found and there is not enough resource to finish unfairness mitigation. To balance the trade-off, we design an adaptive strategy to make the decision according to the online performance of the system. The strategy is realized by the condition which takes the set of historical observations into consideration and decides whether to perform mitigation on the candidate configuration from three aspects stated below.

Condition (1) A minimum resource guarantee for hyperparameter search when parallel resource is available. When there is parallel computation resource available, e.g., more than 1 processing unit, we ensure there is at least 1 processing unit running hyperparameter search before proceeding with unfairness mitigation. This design is to guarantee at least a certain amount of resource on hyperparameter search. It avoids the undesirable case where the regular hyperparameter search is totally blocked by the unfairness mitigation, which takes tens to hundreds of times longer than regular model training. When there is only 1 processing unit available, this condition is not enforced.

Condition (2) Doing unfairness mitigation is more efficient in terms of improving the fair loss of the system than doing hyperparmeter search. In general, there may exist cases where hyperparameter search is already good enough to find fair models with good loss, for example, when the fairness constraint is easy to satisfy. In this case, the benefit of doing unfairness mitigation is not obvious unless it is more efficient in improving the fair loss than hyperparmeter search. Based on this insight, and inspired by recent work about economical hyperparameter search Wang et al. (2021), we propose to use the notion

, which is the Estimated Cost for producing a model with a fair loss better than the input target

, to characterize their fair loss improvement efficiency according to historical observations. The exact definition of is analogous to the function described in Eq. (2) of the paper Wang et al. (2021) by replacing the original loss with fair loss. Due to space limit, we provide the formula of in the appendix. The system keeps track of the ECFair of both unfairness mitigation operation and hyperparmeter search, denoted by and respectively. We only perform unfairness mitigation when the following inequality holds,


where is the best fair loss so far. One nice property of this strategy is that it can achieve a self-adaptive balance between mitigation and hyperparmeter search: Once one choice becomes less efficient we turn to the other choice.

Note that although a single step of unfairness mitigation is usually much more expensive than a single step of hyperpermeter search with regular model training, to find a model with a better fair loss may involve multiple steps of hyperpermeter search and mitigation, which makes the relationship between and unobvious.

Condition (3) There will be fair loss improvement according to the projected fair loss. Ideally, we want to spend resource to do unfairness mitigation on the configurations that are able to yield better loss than the the current best fair loss assuming the unfairness mitigation can successfully make the model fair. Although such information, i.e., the loss of a configuration after mitigation, is not available a priori, the original loss before mitigation is a meaningful indicator of it according to Observation 2. Based on this insight, we propose to first project the resulting loss and resource needed for the unfairness mitigation without actually performing it. We only do unfairness mitigation if the projected loss is better than the current best fair loss.

For a particular configuration with an original loss of , we denote by the projected loss after mitigation. To get , we utilize the observation about the strong correlation between the loss before and after mitigation. For example, the error analysis provided in Agarwal et al. (2018)

shows that the loss difference before and after mitigation is bounded (by a term that does not depend on the original loss) with high probability. Based on these insights, a natural choice for modeling the loss change is the Gaussian distribution. We further validated this choice in appendix. We thus estimate

according to the following formula.


in which is the average and is a 95% confidence radius of loss degradation after a mitigation according to the historical observations respectively. the projected resource needed to do the mitigation. is calculated according to the resource used in historical mitigation and the success rate of mitigation. We provide the formula of in the appendix. The notion is originally from the employed AutoML system FLAML Wang et al. (2021) (defined in Eq. (1) of this paper), and is the estimated cost for improvement of hyperparameter search. The formula means that when the estimated cost for improvement from hyperparameter search is too large, we lower the requirement on the projected fair loss improvement when deciding whether to perform the mitigation: We use a high probability lower bound of the projected fair loss instead of the expectation to avoid being too conservative in performing the mitigation.

Taking all these three conditions into consideration, the is only true when: (1) there is at least one processing unit running hyperparameter search when parallel resource is available; (2) ; and (3) .

Once a decision is made, we perform the corresponding unfairness mitigation or do one step of hyperparameter search using the hyperparameter searcher in the original AutoML system. Then we assess the loss and the fairness of the resulting machine learning model, and update related statistics and the original AutoML searcher (line 11 of the algorithm).

5 Experiments

(a) Fair loss on Adult
(b) Fair loss on Bank
(c) Fair loss on Compas
Figure 4: Summary of fair loss under different wall-clock time (indicated by the titles).
(a) Fair loss on Adult
(b) Fair loss on Bank
(c) Fair loss on Compas
Figure 5: Summary of fair loss under different fairness settings. The title of each subfigure includes the specific fairness setting used. Note that the boxplot summary according to fairness metrics includes results from both settings of the fairness thresholds, and vice versa.
(a) Adult
(b) Bank
(c) Compas
Figure 6: Total number of trials and resource breakdown in FairFLAML and its ablation FLAML-M. The first column of each subfigure shows the total number of trials created in each method. The second column of each subfigure shows the corresponding resource breakdown analysis for achieving the best fair loss given the one hour budget.

5.1 Datasets and experiment settings

Datasets. We perform empirical evaluation on three machine learning fairness benchmark datasets: Adult Kohavi (1996), Bank Moro et al. (2014) and Compas Angwin et al. (2016). They are all publicly available and representative for three fairness-sensitive machine learning applications: financial resource allocation, business marketing and criminal sentencing. The Adult dataset is a census dataset, the original prediction task of which is to determine whether a person makes over 50K a year. The Bank dataset is a classification dataset, the goal of which is to predict if the client will subscribe a term deposit. The Compas dataset is a classification dataset used to predict whether a criminal defendant will re-offend. All the three datasets are included in the open-source machine learning fairness library AIF360. Following the setting in AIF360, the protected attribute ‘sex’ is used as the sensitive attribute on the Adult and Compas datasets, and ‘age’ (which is further converted into a binary value where the privileged is ‘age 25’ and unprivileged is ‘age 25’ ) on the Bank dataset. We include detailed statistics about these datasets in Table 1.

Adult Bank Compas
# of instance 48842 45211 5278
# of attributes 18 16 10
Area financial business criminal sentencing
Table 1: Dataset statistics.

Fair AutoML experiment settings. (1) Fairness settings. We experiment with two commonly used fairness metrics for classification tasks, Demographic Parity (DP) and Equalized Odds (EO). We use two different thresholds and to determine whether the disparity is small enough to be fair. These two thresholds correspond to a loose and a harsh fairness requirement respectively. In all the experiments, the default unfairness mitigation method, exponentiated gradient reduction, is used. We leverage existing implementations of the fairness metrics and unfairness mitigation from the open-source library Fairlearn. (2) AutoML settings.

We choose FLAML as the base AutoML system. We refer to our FairAutoML system as FairFLAML under this context. FLAML is notable for being lightweight and able to find accurate machine learning models efficiently and economically. In all the experiments, we use XGBoost from FLAML as the machine learner. All the other AutoML related components such as data pre-processing and the choice of search space remain the same as FLAML’s default options. We run all the experiments up to 1 hour wall-clock time with 1 CPU core using the corresponding default hyperparameter searchers in FLAML. We run all experiments (including that for baselines) for 5 times with different random seeds.

We include FLAML and FairBO, which are reviewed in Section 2, as the baselines. On all the datasets, we use as the original loss metric. We use fair loss as the final metric for evaluating the performance of the compared methods. This metric is also used in the paper of FairBO. The performance demonstration shown in Figure 1 and the analysis in Figure 2 are obtained from the Adult dataset with one random seed following the experiment configurations described above. All the rest results reported in this paper are from experiments with all 5 random seeds.

5.2 Experiment results

In Figure 4, we summarize the results under different wall-clock time settings (10m and 1h) into boxplots on each dataset. FairFLAML has an overall much better fair loss than all the baselines in both wall-clock time settings. It is worth mentioning that in all the datasets, in ten minutes FairFLAML can already achieve better or similar fair loss comparing to that of the baselines in one hour. Regarding baselines, we do observe that FLAML has a similar performance with FairFLAML on the Bank dataset. This result is consistent with what we have observed in Section 3.3: hyperparameter search itself can help find models with varying degree of fairness, which makes a good fair loss possible in certain tasks. We also observe that the performance of FairBO is worse than both FairFLAML and FLAML. This is caused by the nature of FairBO: Finding fair and accurate models through hyperparameter search requires it to try a large number of configurations, which can be very time-consuming when complex models are tried. FLAML, although also doing hyperparameter search, is designed in a cost-effective way, which yields better performance than FairBO. FairFLAML can further improve FLAML in general because of its incorporation of unfairness mitigation.

To get a better sense of FairFLAML’s performance pattern under different fairness settings, we present the results according to fairness metrics and thresholds in Figure 5. FairFLAML has a robustly good performance across different settings. We observe that when the fairness requirement is relatively loose, i.e., when threshold=0.05, FLAML’s performance is close to FairFLAML. However, once the fairness requirement becomes harsher, i.e., when threshold = 0.01, FLAML’s performance can become very bad, for example on the Adult dataset. This again verifies the necessity of introducing unfairness mitigation to AutoML.

5.3 Ablation

In this section, we include an ablation of FairFLAML to show the necessity of having an adaptive strategy to choose from unfairness mitigation and hyperparameter search. As mentioned in Section 3.3, one straightforward approach to include unfairness mitigation is to always enforce it during model training after each hyperparameter configuration is proposed. We name this approach FLAML-M and provide a summary of the result comparison in Table 2

. FairFLAML is better than FLAML-M in most cases except the 1rd quartile on the

Bank dataset.

Adult 1st quartile Median 3rd quartile
FairFLAML 0.2113 0.2138 0.2154
FLAML-M 0.2116 0.2148 0.2157
Bank 1st quartile Median 3rd quartile
FairFLAML 0.0930 0.0931 0.0948
FLAML-M 0.0917 0.0942 0.0956
Compas 1st quartile Median 3rd quartile
FairFLAML 0.3232 0.3280 0.3403
FLAML-M 0.3232 0.3368 0.3403
Table 2: Fair loss of FairFLAML and its ablation FLAML-M.

In addition to fair loss, we are also particularly interested in understanding how extensive is the model search and the computation resource consumption in these two methods. For this purpose, in Figure 6, we present the distribution of total number of trials (equivalent to the number of models trained) from FairFLAML and FLAML-M in the first column of each subfigure, and the resource breakdown analysis for achieving the best fair loss among FairFLAML and FLAML-M in the second column of each subfigure. Each bar in the resource breakdown analysis figure consists of resource consumption of three possible types of trials: trials where unfairness mitigation is not applied, denoted ‘HPO’; trials where unfairness mitigation is applied and the resulting model yields better fair loss (at that time point), labeled ‘Effective Mitigation’; and trials where unfairness mitigation is applied but the resulting model does not yield better fair loss, denoted ‘Wasted Mitigation’. From Figure 6, we observe that (1) The total number of trials (or models trained) in FairFLAML is larger than FLAML-M by one order of magnitude, indicating a more extensive hyperparameter or model search in FairFLAML; (2) FairFLAML can achieve on average better performance with less than half of the resource needed in FLAML-M according to the heights of the resource consumption bars; (3) In FLAML-M, ‘Wasted Mitigation’ dominates the resource consumption, which is an important source of its inefficiency. In contrast, in FairFLAML, hyperparameter search and unfairness mitigation are well balanced, and wasted mitigation is not the computational bottleneck.

6 Summary and future work

In this work, we build an end-to-end AutoML system to produce fair machine learning models with good accuracy at low computation cost. We started with a general fair AutoML formulation. We then performed a thorough and novel analysis about the impact of unfairness mitigation and hyperparamter search on the fairness, accuracy and resource consumption of an AutoML system. Through the analysis, we identified opportunities (in terms of producing models with good fair loss) of bringing unfairness mitigation into the AutoML pipeline.We designed our fair AutoML system in a way such that unfairness assessment and unfairness mitigation are organically included, and the trade-off between hyperparmeter search and unfairness mitigation is made adaptively in an online manner, depending on the system’s performance, including fairness, accuracy, resource consumption and the amount of available resources.

As an immediate next step, we plan to further extend FairFLAML by supporting more diverse machine learning tasks, for example regression tasks. Different fairness metrics and mitigation method are needed for the regression task Agarwal et al. (2019). We plan to keep track of the latest research development in fairness definition and unfairness mitigation and potentially explore more variants of them.We will open source FairFLAML upon the acceptance of this paper.


  • A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach (2018) A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60–69. Cited by: §2.1, §3.1, §4.
  • A. Agarwal, M. Dudík, and Z. S. Wu (2019) Fair regression: quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pp. 120–129. Cited by: §6.
  • J. Angwin, J. Larson, S. Mattu, and L. Kirchner (2016) Machine bias there’s software used across the country to predict future criminals. and it’s biased against blacks.. ProPublica. Cited by: §1, §5.1.
  • BankofEngland (10, 2019) Machine learning in uk financial services. Bank of England. Cited by: §1.
  • S. Barocas and A. D. Selbst (2016) Big data’s disparate impact. Calif. L. Rev. 104, pp. 671. Cited by: §1.
  • R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, and Y. Zhang (2018) AI Fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. Cited by: §2.1.
  • R. Binns (2020) On the apparent conflict between individual and group fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 514–524. Cited by: §2.1.
  • S. Bird, M. Dudík, R. Edgar, B. Horn, R. Lutz, V. Milan, M. Sameki, H. Wallach, and K. Walker (2020) Fairlearn: a toolkit for assessing and improving fairness in ai. Technical report Technical Report MSR-TR-2020-32, Microsoft. Cited by: §2.1.
  • F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney (2017) Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . Cited by: §2.1.
  • A. F. Cruz, P. Saleiro, C. Belém, C. Soares, and P. Bizarro (2021) Promoting fairness through hyperparameter optimization. arXiv preprint arXiv:2103.12715. Cited by: §2.2.
  • M. Dudík, W. Chen, S. Barocas, M. Inchiosa, N. Lewins, M. Oprescu, J. Qiao, M. Sameki, M. Schlener, J. Tuo, and H. Wallach (2020) Assessing and mitigating unfairness in credit models with the fairlearn toolkit. Technical report Technical Report MSR-TR-2020-34, Microsoft. Cited by: §3.3.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §2.1.
  • N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola (2020) Autogluon-tabular: robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505. Cited by: §2.2.
  • M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. In NIPS, Cited by: §2.2.
  • S. Friedler, C. Scheidegger, and S. Venkatasubramanian (2014) Certifying and removing disparate impact. arXiv preprint arXiv:1412.3756. Cited by: §2.1.
  • [16] H2O automl. Note: Cited by: §2.2.
  • M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    Advances in neural information processing systems 29, pp. 3315–3323. Cited by: §2.1.
  • D. John (2020)

    The road ahead: artificial intelligence and the future of financial services

    The Economist Intelligence Unit. Cited by: §1.
  • F. Kamiran and T. Calders (2012) Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33 (1), pp. 1–33. Cited by: §2.1.
  • T. Kamishima, S. Akaho, and J. Sakuma (2011) Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650. Cited by: §2.1.
  • R. Kohavi (1996)

    Scaling up the accuracy of naive-bayes classifiers: a

    In Proceedings of the Second International Conference on, Cited by: §5.1.
  • P. Lahoti, K. P. Gummadi, and G. Weikum (2019) Ifair: learning individually fair data representations for algorithmic decision making. In 2019 ieee 35th international conference on data engineering (icde), pp. 1334–1345. Cited by: §2.1.
  • S. Moro, P. Cortez, and P. Rita (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62, pp. 22–31. Cited by: §5.1.
  • C. O’neil (2016) Weapons of math destruction: how big data increases inequality and threatens democracy. Crown. Cited by: §1.
  • R. S. Olson, R. J. Urbanowicz, P. C. Andrews, N. A. Lavender, L. C. Kidd, and J. H. Moore (2016)

    Automating biomedical data science through tree-based pipeline optimization


    Applications of Evolutionary Computation

    , G. Squillero and P. Burelli (Eds.),
    pp. 123–137. Cited by: §2.2.
  • V. Perrone, M. Donini, M. B. Zafar, R. Schmucker, K. Kenthapadi, and C. Archambeau (2021) Fair bayesian optimization. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 854–863. Cited by: §2.2.
  • C. Wang, Q. Wu, S. Huang, and A. Saied (2021) Economical hyperparameter optimization with blended search strategy. Cited by: §4.
  • C. Wang, Q. Wu, M. Weimer, and E. Zhu (2021) FLAML: a fast and lightweight automl library. In MLSys, Cited by: §2.2, §4.
  • B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro (2017) Learning non-discriminatory predictors. In Conference on Learning Theory, pp. 1920–1953. Cited by: §2.1.
  • M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi (2017) Fairness constraints: mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962–970. Cited by: §2.1.
  • H. Zhang, X. Chu, A. Asudeh, and S. B. Navathe (2021) OmniFair: a declarative system for model-agnostic group fairness in machine learning. In Proceedings of the 2021 International Conference on Management of Data, pp. 2076–2088. Cited by: §2.1.

Appendix A Omitted details

The impact of mitigation on loss. We visualize the actual loss change in our experiments to validate our Gaussian assumption about the unfairness mitigation’s impact on the loss in Figure 7.

Figure 7: The distribution of loss change for the trials where unfairness mitigation is applied under the second experiment described for Figure 2 of Section 3.3.

Formal definition of . We calculated the projected resource needed for performing a successful unfairness mitigation on hyperparameter , i.e., according to:


in which is the set of trials where unfairness mitigation is performed in history , is the cardinality of set , is the actual resource used for performing unfairness mitigation on hyperparameter , is the resource used for performing regular training on hyperparameter , and is the success rate of performing unfairness mitigation. According to this notation, the factor is the ‘computation cost ratio’ visualized in the last column of Figure 2, and thus is the expected resource consumption for applying mitigation to configuration . We further penalize it by to get an estimation of the resource needed for a successful mitigation.

Formal definition and desirable properties of ECFair. The definition of and is provided in Eq. (6).


in which refers to the process of hyperparameter search or unfairness mitigation, the input is the target fair loss, denotes the total cost spent in the process , is the best fair loss in , and denote the total cost spent in when the best and the second best fair loss is obtained by respectively. Note that ‘best’ here is defined regarding the concerned time point when is calculated, not necessarily overall best.

We mentioned the nice self-adaptive property of the decision-making strategy based on ECFair. It is worth mentioning that this property holds even if ECFair is not an accurate estimation of its ground-truth counterpart: In the case when and/or are not accurate estimation of their ground-truth counterparts and a wrong choice is made, the consequence of this wrong choice will be reflected in the ECFair of the selected choice (it will become larger), while the ECFair of the other choice (the second choice) remains unchanged. Thus we will turn to the second choice.

Appendix B Datasets and additional experiment results

Please refer to the following links to access the three datasets tested.

(a) Loss (original) on Adult
(b) Loss (original) on Bank
(c) Loss (original) on Compas
Figure 8: Summary of loss (original) from all the experiments.

One desirable property of FairFLAML worth mentioning is that it does not affect the performance regarding the original loss while having good fair loss performance. It alleviates potential hesitations of adopting the fairness related components in AutoML due to concerns about a degraded original loss. This property is especially desirable when the practitioners are in an explorative mode regarding machine learning fairness, which is quite common due to the under-development of this topic in practical scenarios. We provide evidence showing this desriable property in Figure 8. From this result, we observe that FLAML-M significantly degrades the original loss, while FairFLAML in most cases preserves the good original loss of the original AutoML method FLAML.