Practical Compositional Fairness: Understanding Fairness in Multi-Task ML Systems

11/05/2019 ∙ by Xuezhi Wang, et al. ∙ 23

Most literature in fairness has focused on improving fairness with respect to one single model or one single objective. However, real-world machine learning systems are usually composed of many different components. Unfortunately, recent research has shown that even if each component is “fair,” the overall system can still be “unfair”<cit.>. In this paper, we focus on how well fairness composes over multiple components in real systems. We consider two recently proposed fairness metrics for rankings: exposure and pairwise ranking accuracy gap. We provide theory that demonstrates a set of conditions under which fairness of individual models does compose. We then present an analytical framework for both understanding whether a system's signals can achieve compositional fairness, and diagnosing which of these signals lowers the overall system's end-to-end fairness the most. Despite previously bleak theoretical results, on multiple data-sets—including a large-scale real-world recommender system—we find that the overall system's end-to-end fairness is largely achievable by improving fairness in individual components.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent research has highlighted that even if two machine learning (ML) models are “fair,” a combination of their predictions can still be “unfair” Dwork and Ilvento (2018a, b). This is known as the compositional fairness problem. The problem has been shown to hold over multiple definitions of fairness. Composing many predictive models in the final product, however, is a pervasive design pattern in real production systems Adomavicius and Tuzhilin (2005); Burke (2002); He et al. (2014); Wang et al. (2011); Ma et al. (2018); Yi et al. (2014).

Most existing literature focuses on achieving fairness in the single-task (non-compositional) setting. Dwork and Ilvento (2018a)

has studied a relatively restricted compositional setting, where the binary outputs of multiple classifiers are combined through

logical operations to produce a single output. Regarding group fairness, Dwork and Ilvento (2018a) makes the case that classifiers that appear to satisfy group fairness properties, may not compose to also satisfy those properties. Dwork and Ilvento (2018a) also raises the concern that composed systems that do satisfy group fairness properties, may not be fair for socially meaningful sub-groups (i.e. a system "fair" across gender or race, may not be fair from the perspective of a specific gender-race sub-group).

Our high-level objective is to know what group fairness goals we can achieve in ranking type problems. In this case, the end score is a rank derived by composing scores produced by different components. We assume we can only control each individual component independently to create overall system fairness. We use recently proposed ranking fairness metrics Singh and Joachims (2018); Beutel et al. (2019a), each capturing slightly different goals.

More specifically, we study the setting where each component outputs real values; the composition function is the multiplication of these component scores, and so also real-valued. Mathematically, we frame each component as functions where . The overall system generates scores, . These scores are then used for ranking. This design is common in recommender systems such as cascading recommenders He et al. (2014); Wang et al. (2011) and multi-task recommenders McAuley et al. (2012); Ma et al. (2018); Yi et al. (2014).

The fairness metric for each component is evaluated on the order (rank) of its scores. The composed system fairness metric is evaluated on the order (rank) of the product of the component scores. Evaluating the fairness metric on rank order aligns well with most real-world multi-model recommender systems, but can also be applied to classification Kallus and Zhou (2019); Borkan et al. (2019); Narasimhan et al. (2019).

A motivating example.

To better concretize the problem, consider the hypothetical example of a large-scale recommendation system for books, like the one described in Ekstrand et al. (2018). The recommender system has the following components:

  • : one that predicts click through rate on a book

  • : one that predicts the star rating given a clicked book

Let the fairness goal be demographic parity in ranking exposure. For example, the ranking of the composite score should not systematically differ between white and non-white authors. Each component could be made “fair" with respect to author demographics through recent mitigation methods Bolukbasi et al. (2016); Singh and Joachims (2018); Beutel et al. (2019b, a). What does this mean for the demographic parity on the ranked composite scores?

A simple counter-example.

In the example above, it may feel intuitive to assume that if each component gives equal exposure to each group, the overall system should as well. We give a simple example here showing this is not the case. Assume we have the following books in each column:

Component non-white non-white white white

Each component exposes books from each group equally: if we rank scores for , we get [non-white, white, white, non-white]. If we rank , we get [non-white, white, white, non-white]. When the two components are multiplied together to form the composite score, we get the ranking: [white, white, non-white, non-white]. This composite ranking does not have demographic parity, even though all individual component rankings do.

In practice, large systems are designed with compositions; there are theoretical fairness risks to such systems. We try to ease this tension between the practical and the theoretical: we describe mathematical conditions under which compositional fairness holds, and we demonstrate how to empirically test a system for compositional fairness. Our contributions in this paper:

  • Theory: We provide theory showing a set of conditions under which fair components can compose into fair systems.

  • System Understanding: We provide a framework for both understanding whether a system’s signals can achieve compositional fairness, and diagnosing which of these signals lowers the overall system fairness the most.

  • Empirical Analysis: Although compositional fairness is theoretically not guaranteed, on multiple data-sets, including a large-scale real-world recommender system, we demonstrate that the overall system fairness is largely achievable by improving fairness in individual components.

2 Related Work

Fairness in Classification

The majority of the fairness metric definition literature focuses on classification. Here we cover some examples of fairness metrics in classification. Demographic parity Calders et al. (2009); Žliobaitė (2015); Zafar et al. (2015) is a common way of addressing discrimination against protected attributes. It requires a decision to be independent of the protected attribute.

Equalized odds

, proposed by Hardt et al. (2016)

, is a fairness metric for a specified sensitive attribute in supervised learning.

Equalized odds requires a predictor to be independent with respect to the sensitive attribute, conditioned on the true label . This metric equivalently equalizes true positive rates as well as the false positive rates across the two demographics to prevent classification models to perform well only on the majority group. Equal opportunity, also proposed by Hardt et al. (2016), is a relaxation of equalized odds. It focuses only on “advantaged" outcome . More recently, metrics have been explored continuous scores from a classifier: Kallus and Zhou (2019); Borkan et al. (2019); Narasimhan et al. (2019) all break down AUC of these scores into Mann-Whitney U-tests Mann and Whitney (1947).

Fairness in Ranking

Recently, there have been a few definitions proposed for fairness in the ranking setting as well Zehlike et al. (2017); in our work we focus on two recent framings. Singh and Joachims (2018) proposes measuring exposure an item or group of items gets depending on what position they fall in a ranking. The work offers multiple fairness goals, such as exposure proportional to relevance, but in our usage we build on this notion of exposure to measure group representation throughout a ranked list (as we ignore any label or relevance, this is philosophically closer to the principles of demographic parity above). Beutel et al. (2019a) focuses on measuring accuracy in a recommender system based on pairwise comparisons

. The accuracy of a ranking for a pair of items is defined as the probability that the clicked item is ranked higher than the un-clicked item. In this set-up, two items from different groups are used to create a pair, and the difference in accuracy for each group is used as a fairness metric. We use this metric to capture fairness in ranking more closely aligned with equal opportunity (as it is measuring accuracy with respect to a label—clicks).

In Section 3, we will formalize the above two definitions within the same framework. For each of the ranking metrics listed, given per-component fairness, we will show conditions where the compositional fairness holds (and counter-examples where it might not hold).

Ranking diversification is a closely related area to ranking fairness. Here the goal is to diversify the ranking results to improve user satisfaction Slivkins et al. (2010); Radlinski et al. (2008); Gollapudi and Sharma (2009); Capannini et al. (2011); Agrawal et al. (2009); Carbonell and Goldstein (1998). In this paper, we focus on the ranking fairness goal with respect to specified demographic groups. In some cases, general-purpose diversification may not align with fairness for certain sub-groups.

Compositional Fairness

Dwork and Ilvento (2018a) has studied general constructions for fair composition, and showed that classifiers that are fair in isolation do not necessarily compose into fair systems, for individuals or for groups. Furthermore, systems that are fair to different types of groups, may not be fair to inter-sectional sub-groups. The authors studied the “Functional Composition" setting, where the assumption is that the binary outputs of multiple classifiers are combined through logical operations to produce a single output for a single task. Specifically, a notion of “OR Fairness" is proposed and relevant theory is developed in this setting.


Many papers have proposed different approaches for achieving fairness in the single-task (non-compositional) setting. The approaches can be partitioned by when they intervene in the creation of an ML system: pre-processing, training, and post-processing.

  • Pre-processing the data, like obfuscation on the protected attributes, or debiasing on the input features like word embeddings Bolukbasi et al. (2016).

  • Incorporating fairness into the learning objective: Numerous approaches have been proposed to improve fairness metrics during training, including constraints Cotter et al. (2016, 2019); Agarwal et al. (2018), regularization Zafar et al. (2017); Beutel et al. (2019b), and adversarial learning Louizos et al. (2016); Zhang et al. (2018); Beutel et al. (2017); Madras et al. (2018). The regularization approaches are most similar to our analysis, encouraging matching the distribution of predictions Zafar et al. (2017); Beutel et al. (2019b) or representation Beutel et al. (2017); Madras et al. (2018) across groups. Beutel et al. (2019a) built on this for ranking, proposing pairwise regularization to the objective function.

  • Post-processing on the predictions, e.g., Pleiss et al. (2017); Kim et al. (2019). For example, Hardt et al. (2016) uses different classifications thresholds per group at inference time, and Singh and Joachims (2018)

    proposes solving a linear program to achieve fair exposure in rankings.

3 Fairness Metrics and Theoretical Analysis

To formalize the problem, we denote as the item being ranked, and for simplicity assume there are two groups being considered, Group and Group .

We formulate the key problem we want to study in this paper as the following: We have a system composed of components, where each component takes an input and produces its own score ; The overall composition function that produces the final ranking score is . If all components satisfy some fairness metric by themselves , we ask whether the overall function satisfies the same fairness metric . Restated, we would like to know whether the system achieves compositional (end-to-end) fairness given that each component has achieved fairness independently.

In the sections below, we consider two commonly-used fairness metrics in ranking: ranking exposure, and pairwise ranking accuracy. We describe each metric, and explore how the function composition affects end-to-end fairness for that metric.

3.1 Ranking exposure as the fairness metric

3.1.1 Definition and Examples

In this section, we focus on the cases where the ranking exposure Singh and Joachims (2018) is the fairness metric.

Formally, we define the ranking exposure for any Group as:

under a certain ranking order . Here denotes the utility function, which is usually a monotonically decreasing function with respect to the rank of the item. For example, one common choice is to use an exponent , and

where is the rank of item under the ranking order . Another common option is , similar to the position discount as defined in Discounted Cumulative Gain (DCG) Järvelin and Kekäläinen (2002).

Now the fairness exposure metric between Group and Group under the ranking is defined as:


which denotes the normalized gap between the exposure for Group and . This gap metric ranges from to , where means there is an equal exposure ( each) of both groups, and means one group has all the exposure () while the other group has no exposure at all.

Note, when the two groups have the same size, i.e., , the ideal exposure gap should reach zero in order to be fair for the two groups. This is not the case when the two groups have different sizes, for example, if , then one might argue that a reasonable exposure for and could be proportional to their sizes, i.e., , so the exposure gap is . In the following for simplicity we always assume the two groups have the same size.

Intuitively, this metric (here unconditioned on relevance) makes the goal providing a diverse ranking with each group being well represented throughout the ranked list. While we build on Singh and Joachims (2018) for framing, similar intuitions were previously proposed in Zehlike et al. (2017) and used in job search applications Geyik et al. (2019).


Here we give a counter-example and show that under the fairness exposure metric, per-component fairness does not always guarantee end-to-end fairness.

Consider the following example with a ranking system composed of two components, two groups , and each group has two items. For any , suppose:

For simplicity we assume , i.e., each rank position contributes equally to the exposure metric, and we consider the exposure for the first two positions. Assume are the rankings produced by each individual component scores , respectively, then for each component independently, we have

because within each component, the two highest-scored items (with scores , ) are from , respectively, in other words, each component by itself is fair. But when combined, denote to be the ranking produced by ordering the items based on the composite score , we have

i.e., Group is always ranked below the Group . Specifically, for the first two positions, and , and .

One might think the magnitude of the scores for each component will play a role here, but the above example also suggests this is not the case, since we can make arbitrarily small to make the magnitude of the score for the two components arbitrarily close to each other, while keeping .

Distribution Normalization

In the above example, if we normalize the distribution of for each group (i.e., , where

represents the mean and standard deviation of the scores), then it is easy to show that we can achieve compositional fairness.

However, we present a slightly modified example that shows sometimes distribution normalization might not work:

After normalization, we have:

Again we can see that when the scores are combined , we have for Group , , and for Group , , i.e., Group is always ranked above the Group .

On the other hand, this problem can simply be solved by adding any constant shift to make sure all the scores are positive, e.g., for a shift of :

Now the two items from Group will be ranked in-between the two items from Group .

3.1.2 Condition for composition of ranking exposure

Now we present theory showing under what conditions we will achieve end-to-end fairness given we have per-component fairness, using the ranking exposure (§3.1.1) as the fairness metric.

Consider a system with two components and , and two groups . Let

represent the random variable defined by

, represent the random variable defined by , and similarly we define for . For simplicity we assume the higher half and the lower half of the items receive different exposure values of after sorting over the scores produced by and , respectively, i.e., to achieve per-component fairness, we have , and . We have the following:

Theorem 1.

If are symmetric random variables such that and are also symmetric, then per-component fairness on and means we have compositional fairness for .


From per-component fairness on and we have: and . Hence

(by symmetry of )
(by linearity of expectation)
(by symmetry of and )
(by per-component fairness)
(by symmetry of and )
(by linearity of expectation)
(by symmetry of )

By the monotonicity of the function, the above equation gives

i.e., the compositional fairness holds for the entire system. ∎

3.2 Pairwise ranking accuracy as the fairness metric

Recently another fairness metric in ranking has been proposed Beutel et al. (2019a), where the idea is to compute the accuracy of a system ranking a pair of items correctly conditioned on the true feedback information (e.g., one being clicked and another not being clicked). The pair of items is constrained to come from two different groups, and , through randomized experiments.

Formally, the Pairwise Ranking Accuracy is defined as:

Here denotes the observed label for , as either being clicked: , or not-clicked: .

The empirical estimate is given by counting the item pairs

, where , normalized by the total number of item pairs from that satisfy .

Correspondingly, the Pairwise Ranking Gap is defined as:

Pairwise Ranking Gap

In other words, given a pair of items, one from group and one from group , conditioned on one item being clicked and the other not being clicked, we would like the system to have the same accuracy of ranking this pair of items correctly, regardless of which group the clicked item is from.

Let represent the random variable defined by , for ; represent the random variable defined by for , and are defined similarly. We can simplify the Pairwise Ranking Gap metric as

Pairwise Ranking Gap

This pairwise ranking accuracy has a nice connection with the Mann-Whitney U-test Mann and Whitney (1947), and aligns well with the equality gap metric Borkan et al. (2019) and the xAUC metric Kallus and Zhou (2019) for classification.


In the following we present a simple example that shows per-component fairness might not lead to compositional fairness, using the pairwise ranking gap as the fairness metric.

Consider the following system with two components, two groups, and two items within each group:

Pairwise Ranking Acc 0.0 0.5

For and , i.e., we have two clicked items from group and two un-clicked items from group , the Pairwise Ranking Accuracy under the composite function is 0.0, because both clicked items from receive a lower prediction score () than un-clicked items from (). On the other hand, for and , i.e., when we have two clicked items from group and two un-clicked items from group , the Pairwise Ranking Accuracy is much higher, 0.5. In other words, the predictor does not have equal treatment for ranking the items from and .

4 Analytical Framework

As we can see in the theoretical results, it is not the case that improving the fairness of individual components never effects the fairness of the composite score, but rather it is dependent on the components and the relationship between them when we can expect compositional fairness to hold. Therefore we ask: if we have a multi-component system where we observe fairness issues, how much will improving the fairness for each component help the overall system’s fairness?

Taking this data-driven view of the problem, we find there are multiple questions that we can tractably answer:

  1. How much would “fixing” a particular component improve the combined system’s fairness?

  2. Given a system with a fairness issue, improving which components would yield the greatest benefit?

  3. If all components were independently “fixed,” what would be the resulting fairness metrics for the combined system?

We describe below our analytical framework for answering these questions.

4.1 Per-Component Fixes

First we consider what are realistic classes of methods for improving the fairness of a model? For example, while multiplying all model predictions by zero will result in good fairness metrics, it is also unrealistic in that it will destroy the usefulness of the system. Rather, we consider the two methods, which we believe are realistic as we explain below.

4.1.1 Distribution Matching

A significant amount of academic literature Gretton et al. (2012); Mroueh et al. (2017) and publications on what is used practice Beutel et al. (2019b, a), takes the perspective of regularizing the model such that the distribution of predictions from each group (sometimes conditioned on the label) is matching. Under different formulations this has been done by comparing the covariance Zafar et al. (2017), correlation Beutel et al. (2019b, a), and maximum mean discrepancy Gretton et al. (2012); Muandet et al. (2017) between the distributions. Therefore, we consider whether matching the groups’ distributions of predictions for each model has the desired effect on the combined fairness metrics.

As this is an analytical framework, in contrast to a training framework, we can easily do this offline by directly changing the predictions over our dataset. We consider distribution matching for a component . In order to match the distributions, we sort all examples in each group by their scores . We define by

a sorted vector of scores for examples in Group

and by the mapping of examples to positions in this sorted list, i.e. and for all ; we similarly define and for examples from Group . For simplicity, we assume the number of examples from each group is equal, . Therefore, when matching the distributions, we define the “fixed” component as follows:


That is, for examples in , returns the score for the similarly ranked item from such that the empirical distribution over and exactly matches. Note, and

are the empirical cumulative distribution function (CDF) for

over and respectively, and as such if

then simple interpolation to match the empirical CDFs can be used.

Theorem 2.

as defined by Eq. (3) has a exposure gap (defined by Eq. (1)) of zero, assuming the ranking order based on gives the exact same rank of given the same .


Given the definition in Eq. (1), it is easy to see that

Because there is an exact one-to-one correspondence of and that gives exactly one pair of , which cancels each other given and thus results in a zero gap. ∎

Note in real applications a tie-breaking strategy is still needed, and assume the tie-breaking strategy is random, then the above approach should achieve an exposure gap close to zero.

4.1.2 Label-Conditioned Distribution Matching

The above approach only recalibrates the predictions by group but does not necessarily align with any labels for the task. As such, for the pairwise fairness metric Eq. (2) the method as described is not guaranteed to give per-component pairwise fairness. For that, we provide a slight modification of the algorithm above. We consider and to be the set of examples in with a negative and positive label, respectively; we similarly define and .

We define a delta term between all pairs from and and similarly between all pairs from and , i.e.,

In the following we propose a method that exactly matches the empirical distribution between and , which aligns with the regularization proposed in Beutel et al. (2019a). We also show that it suffices to match and to achieve fairness for each component .

We define by a sorted vector of scores in , and by the mapping of examples to positions in this sorted list, i.e. and for all ; we similarly define and for examples from .

We again for simplicity, assume the number of examples from each group is equal, i.e., . Note the definition of the pairwise ranking accuracy implies that the number of examples from and is the same in order to form pairs, hence , and similarly . Given the above assumption we essentially have the same number of examples for all quadrants, i.e., .

Therefore, to exactly match on the delta terms, we keep the scores for the deltas on one pair of the groups (e.g., ), and fix the scores on the other pair of groups (e.g., , which can be achieved by either changing the scores for or ). For example, suppose we only fix the scores for , we define the “fixed” component as follows:

Theorem 3.

as defined by Eq. (4) has a pairwise fairness gap, defined in Eq. (2) of 0.


It is easy to see that the is given by

i.e., it is equal to the percentage of positive deltas in by definition. Similarly, the is

and is equal to the percentage of positive deltas in .

Given we exactly matched the delta terms in and by Eq. (4), and since , we have , i.e., the pairwise fairness gap, defined as in Eq. (2) is . ∎

4.1.3 Distribution Normalization

While the above procedure is provably guaranteed to achieve per-component fairness, under the definitions given previously, in practice we would want to use a regularization on the model for this goal, which will be noisier. As such, we consider a lighter-weight approach: per-group normalization:

Definition 1.

Per-Group Normalization For groups and , we modify component to incorporate per-group normalization by:


where is the empirical mean and standard deviation on , for or , respectively. While is not guaranteed to provide even per-component fairness under either definition, we find in practice it too can significantly improve end-to-end fairness.

4.2 Counterfactual Testing

How can we use the modified functions described above to understand the system’s end-to-end fairness properties? All of the questions given at the beginning of this section are counterfactual questions: what would happen if we succeeded in fixing a component or set of components? With the above methods for simulating a fixed component (without actually changing the model training), we can do this headroom analysis.

Per-Component Effect

As before, we assume we have components which are multiplied together such that the overall score given to an example by the system is . Even when improving the fairness of one component, it is not guaranteed to improve the fairness of the overall system. For example, two components could be equally biased in opposite directions such that improving only one actually worsens the end-to-end fairness metrics.

Therefore, we use the above per-component modifications to test the effect of independently improving individual components. We will use to characterize a modified component as described above, i.e., . With this we can simulate how the system would behave if we improve a given component :

Definition 2 (-Improved System).

Given a system with components , and a simulated improved component for component , we define the improved system as:


With this we can measure the fairness of system , using either Eq. 1 or Eq. 2, to understand this counterfactual – if we improved the fairness of component , what would be the resulting end-to-end fairness?

Given that we can now answer this counterfactual question concretely, we now ask: which components should I prioritize improving? First, we define the degree to which improving a given component helps the end-to-end fairness:

Definition 3 (Fairness Improvement).

Given a system and a -improved system , we define the fairness improvement by


Finally, we can measure the fairness improvement FI for all components and sort them in decreasing order to find the components for which an improvement would have the largest effect.

Overall System Effect

While the procedure described above is valuable for understanding which components are more important for improving the end-to-end system, they do not tell us how much improving each component independently will ultimately improve the overall system’s fairness. For that, we build on the counterfactual testing above but now across all of the components. That is we define the per-component improved system as follows:

Definition 4 (All-Components Improved System).

Given a system with components , and for each component we have a simulate improved version , we define the improved system as:


Note as before we assume . Finally, with this, we can test how the improved system where each component is fair performs on the fairness metrics.

5 Experiments

We now use a combination of synthetic and real-world experiments to explore how well fairness composes in different settings.

5.1 Synthetic Data

We begin with presenting experiments on synthetic datasets to demonstrate the relationship between per-component fairness and compositional fairness. Again we assume the system has two components and , and we evaluate the fairness metrics with respect to two groups and .

Dataset with independent Gaussian distributions.


We draw examples from each group. Figure 1 (left) shows the distribution of this synthetic dataset, with x-axis representing the scores from component , and y-axis representing component , . The two different colors show the two groups, respectively.

Table 1 shows the fairness metric (in terms of the ranking exposure, as defined in Section 3.1.1, with a ). We see that in the original data, Group gets significantly more exposure ( more). We start by applying the fix on each component by distribution matching (as defined by Eq. 3). From the table we see that fixing only one component has very limited effect on overall system’s fairness, and the end-to-end fairness can only be achieved by fixing both components. Second, we apply the fix on each component by distribution normalization (as defined by Eq. 5). Compared to the distribution matching approach, this is much more effective in reducing the gap between the two groups while fixing only one component at a time.

Fixed Component(s) Group Group Overall Gap
None (baseline) 0.7640 0.2360 0.5281
Distribution Matching
Component 1 0.7433 0.2567 0.4865
Component 2 0.6856 0.3144 0.3712
Both 0.4818 0.5182 -0.0365111The gap cannot be exactly 0 from discretization effects at the top of the list.
Distribution Normalization
Component 1 0.5472 0.4528 0.0943
Component 2 0.5470 0.4530 0.0940
Both 0.4858 0.5142 -0.0285
Table 1: Effect on compositional fairness on Synthetic Dataset 1, with independent distributions.
Fixed Component(s) Group Group Overall Gap
None (baseline) 0.7699 0.2301 0.5398
Distribution Matching
Component 1 0.7602 0.2398 0.5205
Component 2 0.7318 0.2682 0.4636
Both 0.6262 0.3738 0.2524
Distribution Normalization
Component 1 0.6156 0.3844 0.2312
Component 2 0.5765 0.4235 0.1529
Both 0.6950 0.3050 0.3899
Table 2: Effect on compositional fairness on Synthetic Dataset 2, with anti-correlated distributions between components.
Figure 1: Data distribution for the two components on synthetic dataset 1 (left) and 2 (right).
Figure 2: Histogram of the final ranking scores by distribution matching (left), and distribution normalization (right), on Synthetic Dataset 1.
Figure 3: Histogram of the final ranking scores by distribution matching (left), and distribution normalization (right), on Synthetic Dataset 2.
Dataset with anti-correlated Gaussian distributions.

In this experiment, we follow the exact same setting as the previous experiment except changing for (we choose of the first Gaussian such that , same as the first dataset) to create some anti-correlation between and for group . Again examples are sampled for each group. Figure 1 (right) shows the distribution of this synthetic dataset, compared to the first dataset we can clearly see this anti-correlation showing up as we have a very different shape for group .

Table 2 shows the fairness metrics, compared with Table 1, we can see that the anti-correlation makes the end-to-end fairness metric much harder to achieve. Figure 2 and  3 show the histogram of the final ranking scores by distribution matching (left), and distribution normalization (right), on synthetic data 1 and 2, respectively. We can also observe that the scores are matched much better when there is no anti-correlation between the component scores.

5.2 German Credit Data

Fixed Component(s) Male Rep. Female Rep. Overall Gap
None (baseline) 0.6081 0.3919 0.2162
credit amount 0.5852 0.4148 0.1704
age 0.5865 0.4135 0.1731
num_credits 0.5986 0.4014 0.1972
num_liable 0.5953 0.4047 0.1907
credit amount & age 0.5652 0.4348 0.1304
credit amount
& num_credits
0.5810 0.4190 0.1621
credit amount & age
& num_credits
0.5572 0.4428 0.1145
credit amount & age
& num_liable
0.5392 0.4608 0.0783
All components 0.5352 0.4648 0.0705
Table 3: Effect on end-to-end fairness by distribution matching for each component on the German Credit dataset.

In this section, we demonstrate our analytical framework on a public academic dataset: the German Credit data222, as another example to illustrate the effect of score composition on the end-to-end fairness. This dataset provides a set of attributes for each person, including credit history, credit amount, installment rate, personal status, gender, age, etc., and the corresponding credit risk.

We assume the final score for assessing credit risk is composed by the following four attributes: 1) credit amount; 2) age; 3) Number of existing credits at this bank (denoted as “num_credits" in the following), 4) Number of people being liable to provide maintenance for (denoted as “num_liable"). We consider the problem of ranking all people in this dataset by the above score composition, and we consider the end-to-end fairness metric to be the ranking exposure with respect to gender: male, female333As that is how gender is categorized in the dataset.. The fairness metric we consider is the ranking exposure, as defined in Section 3.1.1, again with a .

In the first setting, we assume the group size to be the same, i.e., , which means the top people within each gender group should receive the same ranking exposure (we restrict the larger group to be of the same size as the smaller group, ). In this case the ideal exposure gap should reach zero. In Table 3, we show the effect on the end-to-end fairness, in terms of the percentage of male and female representation in the end ranking, as well as the gap between them. The method we use for improving the system is distribution matching, as defined in Eq. 3, and we use the counterfactual testing (Section 4.2) to test the effect of fixing each component alone, and the effect of fixing different combinations of the components. For any combination with multiple components, we sampled some combinations and show the results in Table 3 to save space.

From Table 3 we can see that distribution matching for each component independently can help on the compositional fairness (Column “Overall Gap") to some extent. Fixing multiple components simultaneously better helps on the compositional fairness, and the overall gap is reduced most when all components are fixed. In addition, fixing different combination of the components help on the compositional fairness by different degrees, for example, fixing “credit amount" plus “age", and fixing “credit amount" plus “age" plus “num_liable", reduce the gap much further when compared to other 2/3-component fixes. The headroom analysis can provide us guidance on which components should be prioritized for improving end-to-end fairness.

Figure 4: Gap between gender groups with respect to each position, by single-component distribution normalization, on the German Credit Dataset.
Figure 5: Gap between gender groups with respect to each position, by multi-component distribution normalization, on the German Credit Dataset.
Fixed component(s) Gap (CTR) Gap (S1) Gap (S2) Group Acc. Group Acc. Overall Gap
None (baseline) 0.1024 0.1781 0.1078 0.5862 0.7408 0.1546
Matching on Marginal Distributions
CTR 0.0049 0.1781 0.1078 0.6198 0.7084 0.0886
Satisfaction 1 0.1024 0.0103 0.1078 0.6292 0.7054 0.0762
Satisfaction 2 0.1024 0.1781 0.0202 0.6021 0.7270 0.1248
All 0.0049 0.0103 0.0202 0.6781 0.6546 0.0236
Matching on Conditional Distributions
CTR 0.0057 0.1781 0.1078 0.6164 0.7048 0.0884
Satisfaction 1 0.1024 0.0092 0.1078 0.6258 0.7039 0.0781
Satisfaction 2 0.1024 0.1781 0.0197 0.6003 0.7258 0.1255
All 0.0057 0.0092 0.0197 0.6697 0.6472 0.0225
Matching on Delta Distributions
CTR 0.0000 0.1781 0.1078 0.6473 0.7408 0.0935
Satisfaction 1 0.1024 0.0000 0.1078 0.6669 0.7408 0.0739
Satisfaction 2 0.1024 0.1781 0.0000 0.6197 0.7408 0.1211
All 0.0000 0.0000 0.0000 0.7630 0.7408 0.0222
Table 4: Effect on end-to-end fairness by distribution matching within each component, on a large-scale real-world recommender system.

In the second setting, we do not assume the same group size, and we rank and proportional to their respective sizes ( for male, for female on this dataset). We vary the number of top positions and plot the exposure gap metric with respect to the positions. As a reference, we also plot the exposure gap under random ordering (denoted as “Gap (random)" in the figures, by averaging over runs), which ranks each person regardless of their gender. In Figure 4, we show the end-to-end fairness (in terms of exposure gap) by applying distribution normalization on each component. In Figure 5, we show the results by distribution normalization on multiple components simultaneously. The title of each sub-figure indicates the components that we have applied distribution normalization on. We can see that doing per-component fixes can help on the gap metric in most of the cases, and similar as the first setting, fixing different combinations of the components lead to improvements with various degrees on the end-to-end fairness metric.

5.3 Case Study on A Real Production System

In this section, we describe the results on a large-scale real-world recommender system. On an abstract level, the system mainly consists of three different components, one predicting the probability of click (denoted as “CTR"), and two other components predicting different signals of user satisfaction, denoted as “Satisfaction 1" and “Satisfaction 2".

In the following, we present results by fixing each individual component using distribution matching to

  • Match the marginal distribution of and , as in Eq. (3).

  • Match the conditional distributions of and , as well as the conditional distributions of and , building on Eq. (3).

  • Match the distribution on the delta terms: and , as in Eq. (4).

  • As a reference, to test the fairness on the extreme end, we set a constant value for all the (clicked, un-clicked) pairs from each component. This experiment is to explore what other conditions might help end-to-end fairness when we have per-component fairness.

The results are shown in Table 4, the first column denotes the “fixed" component(s), and column 2-4 show the Pairwise Ranking Gap for each component (abbreviated to “CTR", “S1", “S2"), respectively. Column 5-7 show the overall (compositional) Pairwise Ranking Accuracy (as defined in Section 3.2) for Group and , as well as the overall (compositional) Pairwise Ranking Gap. From the table there are a few interesting observations:

  • Compared to matching on marginal/conditional distributions, matching on the delta distributions is the only method that achieves zero gap on the per-component gap metric (Column “Gap(CTR), Gap(S1), Gap(S2)" in Table 4). This is consistent with our theory (Theorem 3).

  • Although marginal/conditional distribution matching does not provably ensure per-component fairness, empirically they still lead to a good amount of gap reduction (all close to zero), and effectively help on the compositional fairness (Column “Overall Gap" in Table 4).

  • Compositional fairness is better achieved when all the components are fixed, and fixing per-component alone helps to different extents on the compositional fairness. On this dataset, fixing “CTR" or “Satisfaction 1" has a larger effect on reducing the overall gap, while fixing “Satisfaction 2" has a relatively smaller effect.

Fixed Comp. Group Acc. Group Acc. Overall Gap
None (baseline) 0.5862 0.7408 0.1546
CTR 0.5050 0.6767 0.1717
Satisfaction 1 0.6411 0.7508 0.1096
Satisfaction 2 0.6111 0.7590 0.1479
All 1.0000 1.0000 0.0000
CTR 0.9702 0.9891 0.0189
Satisfaction 1 0.9690 0.9882 0.0192
Satisfaction 2 0.9439 0.9787 0.0348
All 1.0000 1.0000 0.0000
CTR 0.9993 0.9996 0.0003
Satisfaction 1 0.9991 0.9994 0.0003
Satisfaction 2 0.9964 0.9990 0.0026
All 1.0000 1.0000 0.0000
Table 5: Effect on end-to-end fairness by setting values on the (clicked, unclicked) pairs, on a large-scale real-world recommender system.

In addition, for the set of experiments that set values on the (clicked, un-clicked) pairs, Table 5 shows the effect on the overall (compositional) gap metric. Note any achieves a zero per-component pairwise ranking gap since all pairs are ordered correctly, but the overall compositional fairness varies with different values of . For example, by setting , and for any single component, the pairwise ranking accuracy for both group and is and the pairwise ranking gap is for that single component, but choosing a larger (e.g., or ), clearly helps the end-to-end fairness (“Overall Gap" in Table 5) much better. Again, this suggests an interesting interplay between the prediction values, beyond the component’s fairness, and the effect on the overall system’s fairness.

6 Conclusion

In this paper, we study the problem of compositional fairness in ranking, i.e, given a multi-component system, where the end ranking score is the product of scores from each component, does making each component fair independently improve the system’s end-to-end fairness? We formalize this problem in two recently proposed fairness metrics for ranking, fairness of exposure, and pairwise ranking accuracy gap, and present examples where compositional fairness might not hold, aligned with prior work Dwork and Ilvento (2018a).

While these lack of guarantees can be disheartening, we also present theory showing conditions under which we can achieve end-to-end fairness from achieving per-component fairness. Because the theory shows that composition is distribution-dependent, we propose taking a data-driven approach to this problem. We offer an analytical framework for diagnosing which components are most damaging end-to-end fairness and measuring how much improving per-component fairness will improve end-to-end fairness. By applying our analytical framework to multiple datasets, including a large real-world recommender system, we are able to identify the signals that are lowering the end-to-end fairness the most and observe that in practice most of the end-to-end exposure or accuracy gaps can be addressed through applying independently per-component improvements!

As most real-world ML systems are composed of many models and tasks, understanding how and when fairness composes is crucially important to enabling the application of fairness principles in practice. Our results highlight that while guarantees don’t hold in the worst-case, there is more nuance over realistic data distributions. As a result, we believe there is a lot of potential in generalizing both the theory and empirical frameworks to different applications, covering different data distributions, compositional functional forms, and fairness metrics.

Acknowledgements: The authors would like to thank Ben Packer for his valuable feedback on this paper.


  • G. Adomavicius and A. Tuzhilin (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. In TKDE, Cited by: §1.
  • A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. M. Wallach (2018) A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 60–69. Cited by: 2nd item.
  • R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong (2009) Diversifying search results. In WSDM, Cited by: §2.
  • A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, L. Heldt, Z. Zhao, L. Hong, E. H. Chi, and C. Goodrow (2019a) Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pp. 2212–2220. External Links: Link, Document Cited by: §1, §1, 2nd item, §2, §3.2, §4.1.1, §4.1.2.
  • A. Beutel, J. Chen, T. Doshi, H. Qian, A. Woodruff, C. Luu, P. Kreitmann, J. Bischof, and E. H. Chi (2019b) Putting fairness principles into practice: challenges, metrics, and improvements. arXiv preprint arXiv:1901.04562. Cited by: §1, 2nd item, §4.1.1.
  • A. Beutel, J. Chen, Z. Zhao, and E. H. Chi (2017) Data decisions and theoretical implications when adversarially learning fair representations. In 2017 Workshop on Fairness, Accountability, and Transparency in Machine Learning, Cited by: 2nd item.
  • T. Bolukbasi, K. Chang, J. Zou, V. Saligrama, and A. Kalai (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain., Cited by: §1, 1st item.
  • D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman (2019) Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 491–500. Cited by: §1, §2, §3.2.
  • R. Burke (2002) Hybrid recommender systems: survey and experiments. In User Modeling and User-Adapted Interaction, Volume 12, Issue 4, pp. 331–370. Cited by: §1.
  • T. Calders, F. Kamiran, and M. Pechenizkiy (2009) Building classifiers with independency constraints. In Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, ICDMW ’09, Washington, DC, USA, pp. 13–18. External Links: ISBN 978-0-7695-3902-7, Link, Document Cited by: §2.
  • G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri (2011) Efficient diversification of web search results. In VLDB, Cited by: §2.
  • J. Carbonell and J. Goldstein (1998) The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR’98, pp. 335–336. Cited by: §2.
  • A. Cotter, M. Friedlander, G. Goh, and M. Gupta (2016) Satisfying real-world goals with dataset constraints. Cited by: 2nd item.
  • A. Cotter, M. Gupta, H. Jiang, N. Srebro, K. Sridharan, S. Wang, B. Woodworth, and S. You (2019) Training well-generalizing classifiers for fairness metrics and other data-dependent constraints. In ICML, Cited by: 2nd item.
  • C. Dwork and C. Ilvento (2018a) Fairness under composition. In arXiv preprint arXiv:1806.06122, Cited by: Practical Compositional Fairness: Understanding Fairness in Multi-Task ML Systems, §1, §1, §2, §6.
  • C. Dwork and C. Ilvento (2018b) Group fairness under composition. In FATML, Cited by: §1.
  • M. D. Ekstrand, M. Tian, M. R. I. Kazi, H. Mehrpouyan, and D. Kluver (2018) Exploring author gender in book rating and recommendation. In RecSys ’18, Proceedings of the 12th ACM Conference on Recommender Systems, pp. 242–250. Cited by: §1.
  • S. C. Geyik, S. Ambler, and K. Kenthapadi (2019) Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pp. 2221–2231. Cited by: §3.1.1.
  • S. Gollapudi and A. Sharma (2009) An axiomatic approach for result diversification. In WWW, Cited by: §2.
  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. In The Journal of Machine Learning Research, Volume 13, pp. 723–773. Cited by: §4.1.1.
  • M. Hardt, E. Price, N. Srebro, et al. (2016) Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: 3rd item, §2.
  • X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1, §1.
  • K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20, pp. 422–446. Cited by: §3.1.1.
  • N. Kallus and A. Zhou (2019) The fairness of risk scores beyond classification: bipartite ranking and the xauc metric. arXiv preprint arXiv:1902.05826. Cited by: §1, §2, §3.2.
  • M. P. Kim, A. Ghorbani, and J. Zou (2019) Multiaccuracy: black-box post-processing for fairness in classification. In AIES ’19 Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Cited by: 3rd item.
  • C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel (2016)

    The variational fair autoencoder

    In ICLR, Cited by: 2nd item.
  • J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi (2018) Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1930–1939. Cited by: §1, §1.
  • D. Madras, E. Creager, T. Pitassi, and R. S. Zemel (2018) Learning adversarially fair and transferable representations. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 3381–3390. Cited by: 2nd item.
  • H. B. Mann and D. R. Whitney (1947) On a test of whether one of two random variables is stochastically larger than the other. In Annals of Mathematical Statistics, Cited by: §2, §3.2.
  • J. McAuley, J. Leskovec, and D. Jurafsky (2012) Learning attitudes and attributes from multi-aspect reviews. In 2012 IEEE 12th International Conference on Data Mining, pp. 1020–1025. Cited by: §1.
  • Y. Mroueh, T. Sercu, and V. Goel (2017) McGan: mean and covariance feature matching gan. In ICML, Cited by: §4.1.1.
  • K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf (2017) Kernel mean embedding of distributions: a review and beyond. Foundations and Trends in Machine Learning 10 (1-2), pp. 1–141. Cited by: §4.1.1.
  • H. Narasimhan, A. Cotter, M. Gupta, and S. Wang (2019) Pairwise fairness for ranking and regression. arXiv preprint arXiv:1906.05330. Cited by: §1, §2.
  • G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger (2017) On fairness and calibration. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Cited by: 3rd item.
  • F. Radlinski, R. Kleinberg, and T. Joachims (2008) Learning diverse rankings with multi-armed bandits. In ICML, Cited by: §2.
  • A. Singh and T. Joachims (2018) Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pp. 2219–2228. External Links: Link, Document Cited by: §1, §1, 3rd item, §2, §3.1.1, §3.1.1.
  • A. Slivkins, F. Radlinski, and S. Gollapudi (2010) Learning optimally diverse rankings over large document collections. In ICML, Cited by: §2.
  • L. Wang, J. Lin, and D. Metzler (2011) A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 105–114. Cited by: §1, §1.
  • X. Yi, L. Hong, E. Zhong, N. N. Liu, and S. Rajan (2014) Beyond clicks: dwell time for personalization. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 113–120. Cited by: §1, §1.
  • M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi (2015) Learning fair classifiers. Cited by: §2.
  • M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi (2017) Fairness Constraints: Mechanisms for Fair Classification. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 962–970. Cited by: 2nd item, §4.1.1.
  • M. Zehlike, F. Bonchi, C. Castillo, S. Hajian, M. Megahed, and R. Baeza-Yates (2017) Fa* ir: a fair top-k ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1569–1578. Cited by: §2, §3.1.1.
  • B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Association for the Advancement of Artificial Intelligence, Cited by: 2nd item.
  • I. Žliobaitė (2015) On the relation between accuracy and fairness in binary classification. ArXiv abs/1505.05723. Cited by: §2.