1 Introduction
Recent research has highlighted that even if two machine learning (ML) models are “fair,” a combination of their predictions can still be “unfair” Dwork and Ilvento (2018a, b). This is known as the compositional fairness problem. The problem has been shown to hold over multiple definitions of fairness. Composing many predictive models in the final product, however, is a pervasive design pattern in real production systems Adomavicius and Tuzhilin (2005); Burke (2002); He et al. (2014); Wang et al. (2011); Ma et al. (2018); Yi et al. (2014).
Most existing literature focuses on achieving fairness in the singletask (noncompositional) setting. Dwork and Ilvento (2018a)
has studied a relatively restricted compositional setting, where the binary outputs of multiple classifiers are combined through
logical operations to produce a single output. Regarding group fairness, Dwork and Ilvento (2018a) makes the case that classifiers that appear to satisfy group fairness properties, may not compose to also satisfy those properties. Dwork and Ilvento (2018a) also raises the concern that composed systems that do satisfy group fairness properties, may not be fair for socially meaningful subgroups (i.e. a system "fair" across gender or race, may not be fair from the perspective of a specific genderrace subgroup).Our highlevel objective is to know what group fairness goals we can achieve in ranking type problems. In this case, the end score is a rank derived by composing scores produced by different components. We assume we can only control each individual component independently to create overall system fairness. We use recently proposed ranking fairness metrics Singh and Joachims (2018); Beutel et al. (2019a), each capturing slightly different goals.
More specifically, we study the setting where each component outputs real values; the composition function is the multiplication of these component scores, and so also realvalued. Mathematically, we frame each component as functions where . The overall system generates scores, . These scores are then used for ranking. This design is common in recommender systems such as cascading recommenders He et al. (2014); Wang et al. (2011) and multitask recommenders McAuley et al. (2012); Ma et al. (2018); Yi et al. (2014).
The fairness metric for each component is evaluated on the order (rank) of its scores. The composed system fairness metric is evaluated on the order (rank) of the product of the component scores. Evaluating the fairness metric on rank order aligns well with most realworld multimodel recommender systems, but can also be applied to classification Kallus and Zhou (2019); Borkan et al. (2019); Narasimhan et al. (2019).
A motivating example.
To better concretize the problem, consider the hypothetical example of a largescale recommendation system for books, like the one described in Ekstrand et al. (2018). The recommender system has the following components:

: one that predicts click through rate on a book

: one that predicts the star rating given a clicked book
Let the fairness goal be demographic parity in ranking exposure. For example, the ranking of the composite score should not systematically differ between white and nonwhite authors. Each component could be made “fair" with respect to author demographics through recent mitigation methods Bolukbasi et al. (2016); Singh and Joachims (2018); Beutel et al. (2019b, a). What does this mean for the demographic parity on the ranked composite scores?
A simple counterexample.
In the example above, it may feel intuitive to assume that if each component gives equal exposure to each group, the overall system should as well. We give a simple example here showing this is not the case. Assume we have the following books in each column:
Component  nonwhite  nonwhite  white  white 
Each component exposes books from each group equally: if we rank scores for , we get [nonwhite, white, white, nonwhite]. If we rank , we get [nonwhite, white, white, nonwhite]. When the two components are multiplied together to form the composite score, we get the ranking: [white, white, nonwhite, nonwhite]. This composite ranking does not have demographic parity, even though all individual component rankings do.
In practice, large systems are designed with compositions; there are theoretical fairness risks to such systems. We try to ease this tension between the practical and the theoretical: we describe mathematical conditions under which compositional fairness holds, and we demonstrate how to empirically test a system for compositional fairness. Our contributions in this paper:

Theory: We provide theory showing a set of conditions under which fair components can compose into fair systems.

System Understanding: We provide a framework for both understanding whether a system’s signals can achieve compositional fairness, and diagnosing which of these signals lowers the overall system fairness the most.

Empirical Analysis: Although compositional fairness is theoretically not guaranteed, on multiple datasets, including a largescale realworld recommender system, we demonstrate that the overall system fairness is largely achievable by improving fairness in individual components.
2 Related Work
Fairness in Classification
The majority of the fairness metric definition literature focuses on classification. Here we cover some examples of fairness metrics in classification. Demographic parity Calders et al. (2009); Žliobaitė (2015); Zafar et al. (2015) is a common way of addressing discrimination against protected attributes. It requires a decision to be independent of the protected attribute.
Equalized odds
, proposed by Hardt et al. (2016), is a fairness metric for a specified sensitive attribute in supervised learning.
Equalized odds requires a predictor to be independent with respect to the sensitive attribute, conditioned on the true label . This metric equivalently equalizes true positive rates as well as the false positive rates across the two demographics to prevent classification models to perform well only on the majority group. Equal opportunity, also proposed by Hardt et al. (2016), is a relaxation of equalized odds. It focuses only on “advantaged" outcome . More recently, metrics have been explored continuous scores from a classifier: Kallus and Zhou (2019); Borkan et al. (2019); Narasimhan et al. (2019) all break down AUC of these scores into MannWhitney Utests Mann and Whitney (1947).Fairness in Ranking
Recently, there have been a few definitions proposed for fairness in the ranking setting as well Zehlike et al. (2017); in our work we focus on two recent framings. Singh and Joachims (2018) proposes measuring exposure an item or group of items gets depending on what position they fall in a ranking. The work offers multiple fairness goals, such as exposure proportional to relevance, but in our usage we build on this notion of exposure to measure group representation throughout a ranked list (as we ignore any label or relevance, this is philosophically closer to the principles of demographic parity above). Beutel et al. (2019a) focuses on measuring accuracy in a recommender system based on pairwise comparisons
. The accuracy of a ranking for a pair of items is defined as the probability that the clicked item is ranked higher than the unclicked item. In this setup, two items from different groups are used to create a pair, and the difference in accuracy for each group is used as a fairness metric. We use this metric to capture fairness in ranking more closely aligned with equal opportunity (as it is measuring accuracy with respect to a label—clicks).
In Section 3, we will formalize the above two definitions within the same framework. For each of the ranking metrics listed, given percomponent fairness, we will show conditions where the compositional fairness holds (and counterexamples where it might not hold).
Ranking diversification is a closely related area to ranking fairness. Here the goal is to diversify the ranking results to improve user satisfaction Slivkins et al. (2010); Radlinski et al. (2008); Gollapudi and Sharma (2009); Capannini et al. (2011); Agrawal et al. (2009); Carbonell and Goldstein (1998). In this paper, we focus on the ranking fairness goal with respect to specified demographic groups. In some cases, generalpurpose diversification may not align with fairness for certain subgroups.
Compositional Fairness
Dwork and Ilvento (2018a) has studied general constructions for fair composition, and showed that classifiers that are fair in isolation do not necessarily compose into fair systems, for individuals or for groups. Furthermore, systems that are fair to different types of groups, may not be fair to intersectional subgroups. The authors studied the “Functional Composition" setting, where the assumption is that the binary outputs of multiple classifiers are combined through logical operations to produce a single output for a single task. Specifically, a notion of “OR Fairness" is proposed and relevant theory is developed in this setting.
Mitigation
Many papers have proposed different approaches for achieving fairness in the singletask (noncompositional) setting. The approaches can be partitioned by when they intervene in the creation of an ML system: preprocessing, training, and postprocessing.

Preprocessing the data, like obfuscation on the protected attributes, or debiasing on the input features like word embeddings Bolukbasi et al. (2016).

Incorporating fairness into the learning objective: Numerous approaches have been proposed to improve fairness metrics during training, including constraints Cotter et al. (2016, 2019); Agarwal et al. (2018), regularization Zafar et al. (2017); Beutel et al. (2019b), and adversarial learning Louizos et al. (2016); Zhang et al. (2018); Beutel et al. (2017); Madras et al. (2018). The regularization approaches are most similar to our analysis, encouraging matching the distribution of predictions Zafar et al. (2017); Beutel et al. (2019b) or representation Beutel et al. (2017); Madras et al. (2018) across groups. Beutel et al. (2019a) built on this for ranking, proposing pairwise regularization to the objective function.

Postprocessing on the predictions, e.g., Pleiss et al. (2017); Kim et al. (2019). For example, Hardt et al. (2016) uses different classifications thresholds per group at inference time, and Singh and Joachims (2018)
proposes solving a linear program to achieve fair exposure in rankings.
3 Fairness Metrics and Theoretical Analysis
To formalize the problem, we denote as the item being ranked, and for simplicity assume there are two groups being considered, Group and Group .
We formulate the key problem we want to study in this paper as the following: We have a system composed of components, where each component takes an input and produces its own score ; The overall composition function that produces the final ranking score is . If all components satisfy some fairness metric by themselves , we ask whether the overall function satisfies the same fairness metric . Restated, we would like to know whether the system achieves compositional (endtoend) fairness given that each component has achieved fairness independently.
In the sections below, we consider two commonlyused fairness metrics in ranking: ranking exposure, and pairwise ranking accuracy. We describe each metric, and explore how the function composition affects endtoend fairness for that metric.
3.1 Ranking exposure as the fairness metric
3.1.1 Definition and Examples
In this section, we focus on the cases where the ranking exposure Singh and Joachims (2018) is the fairness metric.
Formally, we define the ranking exposure for any Group as:
under a certain ranking order . Here denotes the utility function, which is usually a monotonically decreasing function with respect to the rank of the item. For example, one common choice is to use an exponent , and
where is the rank of item under the ranking order . Another common option is , similar to the position discount as defined in Discounted Cumulative Gain (DCG) Järvelin and Kekäläinen (2002).
Now the fairness exposure metric between Group and Group under the ranking is defined as:
(1) 
which denotes the normalized gap between the exposure for Group and . This gap metric ranges from to , where means there is an equal exposure ( each) of both groups, and means one group has all the exposure () while the other group has no exposure at all.
Note, when the two groups have the same size, i.e., , the ideal exposure gap should reach zero in order to be fair for the two groups. This is not the case when the two groups have different sizes, for example, if , then one might argue that a reasonable exposure for and could be proportional to their sizes, i.e., , so the exposure gap is . In the following for simplicity we always assume the two groups have the same size.
Intuitively, this metric (here unconditioned on relevance) makes the goal providing a diverse ranking with each group being well represented throughout the ranked list. While we build on Singh and Joachims (2018) for framing, similar intuitions were previously proposed in Zehlike et al. (2017) and used in job search applications Geyik et al. (2019).
Counterexample
Here we give a counterexample and show that under the fairness exposure metric, percomponent fairness does not always guarantee endtoend fairness.
Consider the following example with a ranking system composed of two components, two groups , and each group has two items. For any , suppose:
For simplicity we assume , i.e., each rank position contributes equally to the exposure metric, and we consider the exposure for the first two positions. Assume are the rankings produced by each individual component scores , respectively, then for each component independently, we have
because within each component, the two highestscored items (with scores , ) are from , respectively, in other words, each component by itself is fair. But when combined, denote to be the ranking produced by ordering the items based on the composite score , we have
i.e., Group is always ranked below the Group . Specifically, for the first two positions, and , and .
One might think the magnitude of the scores for each component will play a role here, but the above example also suggests this is not the case, since we can make arbitrarily small to make the magnitude of the score for the two components arbitrarily close to each other, while keeping .
Distribution Normalization
In the above example, if we normalize the distribution of for each group (i.e., , where
represents the mean and standard deviation of the scores), then it is easy to show that we can achieve compositional fairness.
However, we present a slightly modified example that shows sometimes distribution normalization might not work:
After normalization, we have:
Again we can see that when the scores are combined , we have for Group , , and for Group , , i.e., Group is always ranked above the Group .
On the other hand, this problem can simply be solved by adding any constant shift to make sure all the scores are positive, e.g., for a shift of :
Now the two items from Group will be ranked inbetween the two items from Group .
3.1.2 Condition for composition of ranking exposure
Now we present theory showing under what conditions we will achieve endtoend fairness given we have percomponent fairness, using the ranking exposure (§3.1.1) as the fairness metric.
Consider a system with two components and , and two groups . Let
represent the random variable defined by
, represent the random variable defined by , and similarly we define for . For simplicity we assume the higher half and the lower half of the items receive different exposure values of after sorting over the scores produced by and , respectively, i.e., to achieve percomponent fairness, we have , and . We have the following:Theorem 1.
If are symmetric random variables such that and are also symmetric, then percomponent fairness on and means we have compositional fairness for .
Proof.
From percomponent fairness on and we have: and . Hence
(by symmetry of )  
(by linearity of expectation)  
(by symmetry of and )  
(by percomponent fairness)  
(by symmetry of and )  
(by linearity of expectation)  
(by symmetry of )  
By the monotonicity of the function, the above equation gives
i.e., the compositional fairness holds for the entire system. ∎
3.2 Pairwise ranking accuracy as the fairness metric
Recently another fairness metric in ranking has been proposed Beutel et al. (2019a), where the idea is to compute the accuracy of a system ranking a pair of items correctly conditioned on the true feedback information (e.g., one being clicked and another not being clicked). The pair of items is constrained to come from two different groups, and , through randomized experiments.
Formally, the Pairwise Ranking Accuracy is defined as:
Here denotes the observed label for , as either being clicked: , or notclicked: .
The empirical estimate is given by counting the item pairs
, where , normalized by the total number of item pairs from that satisfy .Correspondingly, the Pairwise Ranking Gap is defined as:
Pairwise Ranking Gap  
(2) 
In other words, given a pair of items, one from group and one from group , conditioned on one item being clicked and the other not being clicked, we would like the system to have the same accuracy of ranking this pair of items correctly, regardless of which group the clicked item is from.
Let represent the random variable defined by , for ; represent the random variable defined by for , and are defined similarly. We can simplify the Pairwise Ranking Gap metric as
Pairwise Ranking Gap  
This pairwise ranking accuracy has a nice connection with the MannWhitney Utest Mann and Whitney (1947), and aligns well with the equality gap metric Borkan et al. (2019) and the xAUC metric Kallus and Zhou (2019) for classification.
Counterexample.
In the following we present a simple example that shows percomponent fairness might not lead to compositional fairness, using the pairwise ranking gap as the fairness metric.
Consider the following system with two components, two groups, and two items within each group:
Component  
Pairwise Ranking Acc  0.0  0.5 
For and , i.e., we have two clicked items from group and two unclicked items from group , the Pairwise Ranking Accuracy under the composite function is 0.0, because both clicked items from receive a lower prediction score () than unclicked items from (). On the other hand, for and , i.e., when we have two clicked items from group and two unclicked items from group , the Pairwise Ranking Accuracy is much higher, 0.5. In other words, the predictor does not have equal treatment for ranking the items from and .
4 Analytical Framework
As we can see in the theoretical results, it is not the case that improving the fairness of individual components never effects the fairness of the composite score, but rather it is dependent on the components and the relationship between them when we can expect compositional fairness to hold. Therefore we ask: if we have a multicomponent system where we observe fairness issues, how much will improving the fairness for each component help the overall system’s fairness?
Taking this datadriven view of the problem, we find there are multiple questions that we can tractably answer:

How much would “fixing” a particular component improve the combined system’s fairness?

Given a system with a fairness issue, improving which components would yield the greatest benefit?

If all components were independently “fixed,” what would be the resulting fairness metrics for the combined system?
We describe below our analytical framework for answering these questions.
4.1 PerComponent Fixes
First we consider what are realistic classes of methods for improving the fairness of a model? For example, while multiplying all model predictions by zero will result in good fairness metrics, it is also unrealistic in that it will destroy the usefulness of the system. Rather, we consider the two methods, which we believe are realistic as we explain below.
4.1.1 Distribution Matching
A significant amount of academic literature Gretton et al. (2012); Mroueh et al. (2017) and publications on what is used practice Beutel et al. (2019b, a), takes the perspective of regularizing the model such that the distribution of predictions from each group (sometimes conditioned on the label) is matching. Under different formulations this has been done by comparing the covariance Zafar et al. (2017), correlation Beutel et al. (2019b, a), and maximum mean discrepancy Gretton et al. (2012); Muandet et al. (2017) between the distributions. Therefore, we consider whether matching the groups’ distributions of predictions for each model has the desired effect on the combined fairness metrics.
As this is an analytical framework, in contrast to a training framework, we can easily do this offline by directly changing the predictions over our dataset. We consider distribution matching for a component . In order to match the distributions, we sort all examples in each group by their scores . We define by
a sorted vector of scores for examples in Group
and by the mapping of examples to positions in this sorted list, i.e. and for all ; we similarly define and for examples from Group . For simplicity, we assume the number of examples from each group is equal, . Therefore, when matching the distributions, we define the “fixed” component as follows:(3) 
That is, for examples in , returns the score for the similarly ranked item from such that the empirical distribution over and exactly matches. Note, and
are the empirical cumulative distribution function (CDF) for
over and respectively, and as such ifthen simple interpolation to match the empirical CDFs can be used.
Theorem 2.
Proof.
Given the definition in Eq. (1), it is easy to see that
Because there is an exact onetoone correspondence of and that gives exactly one pair of , which cancels each other given and thus results in a zero gap. ∎
Note in real applications a tiebreaking strategy is still needed, and assume the tiebreaking strategy is random, then the above approach should achieve an exposure gap close to zero.
4.1.2 LabelConditioned Distribution Matching
The above approach only recalibrates the predictions by group but does not necessarily align with any labels for the task. As such, for the pairwise fairness metric Eq. (2) the method as described is not guaranteed to give percomponent pairwise fairness. For that, we provide a slight modification of the algorithm above. We consider and to be the set of examples in with a negative and positive label, respectively; we similarly define and .
We define a delta term between all pairs from and and similarly between all pairs from and , i.e.,
In the following we propose a method that exactly matches the empirical distribution between and , which aligns with the regularization proposed in Beutel et al. (2019a). We also show that it suffices to match and to achieve fairness for each component .
We define by a sorted vector of scores in , and by the mapping of examples to positions in this sorted list, i.e. and for all ; we similarly define and for examples from .
We again for simplicity, assume the number of examples from each group is equal, i.e., . Note the definition of the pairwise ranking accuracy implies that the number of examples from and is the same in order to form pairs, hence , and similarly . Given the above assumption we essentially have the same number of examples for all quadrants, i.e., .
Therefore, to exactly match on the delta terms, we keep the scores for the deltas on one pair of the groups (e.g., ), and fix the scores on the other pair of groups (e.g., , which can be achieved by either changing the scores for or ). For example, suppose we only fix the scores for , we define the “fixed” component as follows:
(4) 
Proof.
It is easy to see that the is given by
i.e., it is equal to the percentage of positive deltas in by definition. Similarly, the is
and is equal to the percentage of positive deltas in .
4.1.3 Distribution Normalization
While the above procedure is provably guaranteed to achieve percomponent fairness, under the definitions given previously, in practice we would want to use a regularization on the model for this goal, which will be noisier. As such, we consider a lighterweight approach: pergroup normalization:
Definition 1.
PerGroup Normalization For groups and , we modify component to incorporate pergroup normalization by:
(5) 
where is the empirical mean and standard deviation on , for or , respectively. While is not guaranteed to provide even percomponent fairness under either definition, we find in practice it too can significantly improve endtoend fairness.
4.2 Counterfactual Testing
How can we use the modified functions described above to understand the system’s endtoend fairness properties? All of the questions given at the beginning of this section are counterfactual questions: what would happen if we succeeded in fixing a component or set of components? With the above methods for simulating a fixed component (without actually changing the model training), we can do this headroom analysis.
PerComponent Effect
As before, we assume we have components which are multiplied together such that the overall score given to an example by the system is . Even when improving the fairness of one component, it is not guaranteed to improve the fairness of the overall system. For example, two components could be equally biased in opposite directions such that improving only one actually worsens the endtoend fairness metrics.
Therefore, we use the above percomponent modifications to test the effect of independently improving individual components. We will use to characterize a modified component as described above, i.e., . With this we can simulate how the system would behave if we improve a given component :
Definition 2 (Improved System).
Given a system with components , and a simulated improved component for component , we define the improved system as:
(6) 
With this we can measure the fairness of system , using either Eq. 1 or Eq. 2, to understand this counterfactual – if we improved the fairness of component , what would be the resulting endtoend fairness?
Given that we can now answer this counterfactual question concretely, we now ask: which components should I prioritize improving? First, we define the degree to which improving a given component helps the endtoend fairness:
Definition 3 (Fairness Improvement).
Given a system and a improved system , we define the fairness improvement by
(7) 
Finally, we can measure the fairness improvement FI for all components and sort them in decreasing order to find the components for which an improvement would have the largest effect.
Overall System Effect
While the procedure described above is valuable for understanding which components are more important for improving the endtoend system, they do not tell us how much improving each component independently will ultimately improve the overall system’s fairness. For that, we build on the counterfactual testing above but now across all of the components. That is we define the percomponent improved system as follows:
Definition 4 (AllComponents Improved System).
Given a system with components , and for each component we have a simulate improved version , we define the improved system as:
(8) 
Note as before we assume . Finally, with this, we can test how the improved system where each component is fair performs on the fairness metrics.
5 Experiments
We now use a combination of synthetic and realworld experiments to explore how well fairness composes in different settings.
5.1 Synthetic Data
We begin with presenting experiments on synthetic datasets to demonstrate the relationship between percomponent fairness and compositional fairness. Again we assume the system has two components and , and we evaluate the fairness metrics with respect to two groups and .
Dataset with independent Gaussian distributions.
Assume
We draw examples from each group. Figure 1 (left) shows the distribution of this synthetic dataset, with xaxis representing the scores from component , and yaxis representing component , . The two different colors show the two groups, respectively.
Table 1 shows the fairness metric (in terms of the ranking exposure, as defined in Section 3.1.1, with a ). We see that in the original data, Group gets significantly more exposure ( more). We start by applying the fix on each component by distribution matching (as defined by Eq. 3). From the table we see that fixing only one component has very limited effect on overall system’s fairness, and the endtoend fairness can only be achieved by fixing both components. Second, we apply the fix on each component by distribution normalization (as defined by Eq. 5). Compared to the distribution matching approach, this is much more effective in reducing the gap between the two groups while fixing only one component at a time.
Fixed Component(s)  Group  Group  Overall Gap 
None (baseline)  0.7640  0.2360  0.5281 
Distribution Matching  
Component 1  0.7433  0.2567  0.4865 
Component 2  0.6856  0.3144  0.3712 
Both  0.4818  0.5182  0.0365^{1}^{1}1The gap cannot be exactly 0 from discretization effects at the top of the list. 
Distribution Normalization  
Component 1  0.5472  0.4528  0.0943 
Component 2  0.5470  0.4530  0.0940 
Both  0.4858  0.5142  0.0285 
Fixed Component(s)  Group  Group  Overall Gap 
None (baseline)  0.7699  0.2301  0.5398 
Distribution Matching  
Component 1  0.7602  0.2398  0.5205 
Component 2  0.7318  0.2682  0.4636 
Both  0.6262  0.3738  0.2524 
Distribution Normalization  
Component 1  0.6156  0.3844  0.2312 
Component 2  0.5765  0.4235  0.1529 
Both  0.6950  0.3050  0.3899 
Dataset with anticorrelated Gaussian distributions.
In this experiment, we follow the exact same setting as the previous experiment except changing for (we choose of the first Gaussian such that , same as the first dataset) to create some anticorrelation between and for group . Again examples are sampled for each group. Figure 1 (right) shows the distribution of this synthetic dataset, compared to the first dataset we can clearly see this anticorrelation showing up as we have a very different shape for group .
Table 2 shows the fairness metrics, compared with Table 1, we can see that the anticorrelation makes the endtoend fairness metric much harder to achieve. Figure 2 and 3 show the histogram of the final ranking scores by distribution matching (left), and distribution normalization (right), on synthetic data 1 and 2, respectively. We can also observe that the scores are matched much better when there is no anticorrelation between the component scores.
5.2 German Credit Data
Fixed Component(s)  Male Rep.  Female Rep.  Overall Gap  
None (baseline)  0.6081  0.3919  0.2162  
credit amount  0.5852  0.4148  0.1704  
age  0.5865  0.4135  0.1731  
num_credits  0.5986  0.4014  0.1972  
num_liable  0.5953  0.4047  0.1907  
credit amount & age  0.5652  0.4348  0.1304  

0.5810  0.4190  0.1621  

0.5572  0.4428  0.1145  

0.5392  0.4608  0.0783  
All components  0.5352  0.4648  0.0705 
In this section, we demonstrate our analytical framework on a public academic dataset: the German Credit data^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data), as another example to illustrate the effect of score composition on the endtoend fairness. This dataset provides a set of attributes for each person, including credit history, credit amount, installment rate, personal status, gender, age, etc., and the corresponding credit risk.
We assume the final score for assessing credit risk is composed by the following four attributes: 1) credit amount; 2) age; 3) Number of existing credits at this bank (denoted as “num_credits" in the following), 4) Number of people being liable to provide maintenance for (denoted as “num_liable"). We consider the problem of ranking all people in this dataset by the above score composition, and we consider the endtoend fairness metric to be the ranking exposure with respect to gender: male, female^{3}^{3}3As that is how gender is categorized in the dataset.. The fairness metric we consider is the ranking exposure, as defined in Section 3.1.1, again with a .
In the first setting, we assume the group size to be the same, i.e., , which means the top people within each gender group should receive the same ranking exposure (we restrict the larger group to be of the same size as the smaller group, ). In this case the ideal exposure gap should reach zero. In Table 3, we show the effect on the endtoend fairness, in terms of the percentage of male and female representation in the end ranking, as well as the gap between them. The method we use for improving the system is distribution matching, as defined in Eq. 3, and we use the counterfactual testing (Section 4.2) to test the effect of fixing each component alone, and the effect of fixing different combinations of the components. For any combination with multiple components, we sampled some combinations and show the results in Table 3 to save space.
From Table 3 we can see that distribution matching for each component independently can help on the compositional fairness (Column “Overall Gap") to some extent. Fixing multiple components simultaneously better helps on the compositional fairness, and the overall gap is reduced most when all components are fixed. In addition, fixing different combination of the components help on the compositional fairness by different degrees, for example, fixing “credit amount" plus “age", and fixing “credit amount" plus “age" plus “num_liable", reduce the gap much further when compared to other 2/3component fixes. The headroom analysis can provide us guidance on which components should be prioritized for improving endtoend fairness.
Fixed component(s)  Gap (CTR)  Gap (S1)  Gap (S2)  Group Acc.  Group Acc.  Overall Gap 
None (baseline)  0.1024  0.1781  0.1078  0.5862  0.7408  0.1546 
Matching on Marginal Distributions  
CTR  0.0049  0.1781  0.1078  0.6198  0.7084  0.0886 
Satisfaction 1  0.1024  0.0103  0.1078  0.6292  0.7054  0.0762 
Satisfaction 2  0.1024  0.1781  0.0202  0.6021  0.7270  0.1248 
All  0.0049  0.0103  0.0202  0.6781  0.6546  0.0236 
Matching on Conditional Distributions  
CTR  0.0057  0.1781  0.1078  0.6164  0.7048  0.0884 
Satisfaction 1  0.1024  0.0092  0.1078  0.6258  0.7039  0.0781 
Satisfaction 2  0.1024  0.1781  0.0197  0.6003  0.7258  0.1255 
All  0.0057  0.0092  0.0197  0.6697  0.6472  0.0225 
Matching on Delta Distributions  
CTR  0.0000  0.1781  0.1078  0.6473  0.7408  0.0935 
Satisfaction 1  0.1024  0.0000  0.1078  0.6669  0.7408  0.0739 
Satisfaction 2  0.1024  0.1781  0.0000  0.6197  0.7408  0.1211 
All  0.0000  0.0000  0.0000  0.7630  0.7408  0.0222 
In the second setting, we do not assume the same group size, and we rank and proportional to their respective sizes ( for male, for female on this dataset). We vary the number of top positions and plot the exposure gap metric with respect to the positions. As a reference, we also plot the exposure gap under random ordering (denoted as “Gap (random)" in the figures, by averaging over runs), which ranks each person regardless of their gender. In Figure 4, we show the endtoend fairness (in terms of exposure gap) by applying distribution normalization on each component. In Figure 5, we show the results by distribution normalization on multiple components simultaneously. The title of each subfigure indicates the components that we have applied distribution normalization on. We can see that doing percomponent fixes can help on the gap metric in most of the cases, and similar as the first setting, fixing different combinations of the components lead to improvements with various degrees on the endtoend fairness metric.
5.3 Case Study on A Real Production System
In this section, we describe the results on a largescale realworld recommender system. On an abstract level, the system mainly consists of three different components, one predicting the probability of click (denoted as “CTR"), and two other components predicting different signals of user satisfaction, denoted as “Satisfaction 1" and “Satisfaction 2".
In the following, we present results by fixing each individual component using distribution matching to

Match the marginal distribution of and , as in Eq. (3).

Match the conditional distributions of and , as well as the conditional distributions of and , building on Eq. (3).

Match the distribution on the delta terms: and , as in Eq. (4).

As a reference, to test the fairness on the extreme end, we set a constant value for all the (clicked, unclicked) pairs from each component. This experiment is to explore what other conditions might help endtoend fairness when we have percomponent fairness.
The results are shown in Table 4, the first column denotes the “fixed" component(s), and column 24 show the Pairwise Ranking Gap for each component (abbreviated to “CTR", “S1", “S2"), respectively. Column 57 show the overall (compositional) Pairwise Ranking Accuracy (as defined in Section 3.2) for Group and , as well as the overall (compositional) Pairwise Ranking Gap. From the table there are a few interesting observations:

Although marginal/conditional distribution matching does not provably ensure percomponent fairness, empirically they still lead to a good amount of gap reduction (all close to zero), and effectively help on the compositional fairness (Column “Overall Gap" in Table 4).

Compositional fairness is better achieved when all the components are fixed, and fixing percomponent alone helps to different extents on the compositional fairness. On this dataset, fixing “CTR" or “Satisfaction 1" has a larger effect on reducing the overall gap, while fixing “Satisfaction 2" has a relatively smaller effect.
Fixed Comp.  Group Acc.  Group Acc.  Overall Gap 
None (baseline)  0.5862  0.7408  0.1546 
CTR  0.5050  0.6767  0.1717 
Satisfaction 1  0.6411  0.7508  0.1096 
Satisfaction 2  0.6111  0.7590  0.1479 
All  1.0000  1.0000  0.0000 
CTR  0.9702  0.9891  0.0189 
Satisfaction 1  0.9690  0.9882  0.0192 
Satisfaction 2  0.9439  0.9787  0.0348 
All  1.0000  1.0000  0.0000 
CTR  0.9993  0.9996  0.0003 
Satisfaction 1  0.9991  0.9994  0.0003 
Satisfaction 2  0.9964  0.9990  0.0026 
All  1.0000  1.0000  0.0000 
In addition, for the set of experiments that set values on the (clicked, unclicked) pairs, Table 5 shows the effect on the overall (compositional) gap metric. Note any achieves a zero percomponent pairwise ranking gap since all pairs are ordered correctly, but the overall compositional fairness varies with different values of . For example, by setting , and for any single component, the pairwise ranking accuracy for both group and is and the pairwise ranking gap is for that single component, but choosing a larger (e.g., or ), clearly helps the endtoend fairness (“Overall Gap" in Table 5) much better. Again, this suggests an interesting interplay between the prediction values, beyond the component’s fairness, and the effect on the overall system’s fairness.
6 Conclusion
In this paper, we study the problem of compositional fairness in ranking, i.e, given a multicomponent system, where the end ranking score is the product of scores from each component, does making each component fair independently improve the system’s endtoend fairness? We formalize this problem in two recently proposed fairness metrics for ranking, fairness of exposure, and pairwise ranking accuracy gap, and present examples where compositional fairness might not hold, aligned with prior work Dwork and Ilvento (2018a).
While these lack of guarantees can be disheartening, we also present theory showing conditions under which we can achieve endtoend fairness from achieving percomponent fairness. Because the theory shows that composition is distributiondependent, we propose taking a datadriven approach to this problem. We offer an analytical framework for diagnosing which components are most damaging endtoend fairness and measuring how much improving percomponent fairness will improve endtoend fairness. By applying our analytical framework to multiple datasets, including a large realworld recommender system, we are able to identify the signals that are lowering the endtoend fairness the most and observe that in practice most of the endtoend exposure or accuracy gaps can be addressed through applying independently percomponent improvements!
As most realworld ML systems are composed of many models and tasks, understanding how and when fairness composes is crucially important to enabling the application of fairness principles in practice. Our results highlight that while guarantees don’t hold in the worstcase, there is more nuance over realistic data distributions. As a result, we believe there is a lot of potential in generalizing both the theory and empirical frameworks to different applications, covering different data distributions, compositional functional forms, and fairness metrics.
Acknowledgements: The authors would like to thank Ben Packer for his valuable feedback on this paper.
References
 Toward the next generation of recommender systems: a survey of the stateoftheart and possible extensions. In TKDE, Cited by: §1.
 A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 60–69. Cited by: 2nd item.
 Diversifying search results. In WSDM, Cited by: §2.
 Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 48, 2019., pp. 2212–2220. External Links: Link, Document Cited by: §1, §1, 2nd item, §2, §3.2, §4.1.1, §4.1.2.
 Putting fairness principles into practice: challenges, metrics, and improvements. arXiv preprint arXiv:1901.04562. Cited by: §1, 2nd item, §4.1.1.
 Data decisions and theoretical implications when adversarially learning fair representations. In 2017 Workshop on Fairness, Accountability, and Transparency in Machine Learning, Cited by: 2nd item.
 Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain., Cited by: §1, 1st item.
 Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 491–500. Cited by: §1, §2, §3.2.
 Hybrid recommender systems: survey and experiments. In User Modeling and UserAdapted Interaction, Volume 12, Issue 4, pp. 331–370. Cited by: §1.
 Building classifiers with independency constraints. In Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, ICDMW ’09, Washington, DC, USA, pp. 13–18. External Links: ISBN 9780769539027, Link, Document Cited by: §2.
 Efficient diversification of web search results. In VLDB, Cited by: §2.
 The use of mmr, diversitybased reranking for reordering documents and producing summaries. In Proceedings of SIGIR’98, pp. 335–336. Cited by: §2.
 Satisfying realworld goals with dataset constraints. Cited by: 2nd item.
 Training wellgeneralizing classifiers for fairness metrics and other datadependent constraints. In ICML, Cited by: 2nd item.
 Fairness under composition. In arXiv preprint arXiv:1806.06122, Cited by: Practical Compositional Fairness: Understanding Fairness in MultiTask ML Systems, §1, §1, §2, §6.
 Group fairness under composition. In FATML, Cited by: §1.
 Exploring author gender in book rating and recommendation. In RecSys ’18, Proceedings of the 12th ACM Conference on Recommender Systems, pp. 242–250. Cited by: §1.
 Fairnessaware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 48, 2019., pp. 2221–2231. Cited by: §3.1.1.
 An axiomatic approach for result diversification. In WWW, Cited by: §2.
 A kernel twosample test. In The Journal of Machine Learning Research, Volume 13, pp. 723–773. Cited by: §4.1.1.
 Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: 3rd item, §2.
 Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1, §1.
 Cumulated gainbased evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20, pp. 422–446. Cited by: §3.1.1.
 The fairness of risk scores beyond classification: bipartite ranking and the xauc metric. arXiv preprint arXiv:1902.05826. Cited by: §1, §2, §3.2.
 Multiaccuracy: blackbox postprocessing for fairness in classification. In AIES ’19 Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Cited by: 3rd item.

The variational fair autoencoder
. In ICLR, Cited by: 2nd item.  Modeling task relationships in multitask learning with multigate mixtureofexperts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1930–1939. Cited by: §1, §1.
 Learning adversarially fair and transferable representations. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 3381–3390. Cited by: 2nd item.
 On a test of whether one of two random variables is stochastically larger than the other. In Annals of Mathematical Statistics, Cited by: §2, §3.2.
 Learning attitudes and attributes from multiaspect reviews. In 2012 IEEE 12th International Conference on Data Mining, pp. 1020–1025. Cited by: §1.
 McGan: mean and covariance feature matching gan. In ICML, Cited by: §4.1.1.
 Kernel mean embedding of distributions: a review and beyond. Foundations and Trends in Machine Learning 10 (12), pp. 1–141. Cited by: §4.1.1.
 Pairwise fairness for ranking and regression. arXiv preprint arXiv:1906.05330. Cited by: §1, §2.
 On fairness and calibration. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Cited by: 3rd item.
 Learning diverse rankings with multiarmed bandits. In ICML, Cited by: §2.
 Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 1923, 2018, pp. 2219–2228. External Links: Link, Document Cited by: §1, §1, 3rd item, §2, §3.1.1, §3.1.1.
 Learning optimally diverse rankings over large document collections. In ICML, Cited by: §2.
 A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 105–114. Cited by: §1, §1.
 Beyond clicks: dwell time for personalization. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 113–120. Cited by: §1, §1.
 Learning fair classifiers. Cited by: §2.

Fairness Constraints: Mechanisms for Fair Classification.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 962–970. Cited by: 2nd item, §4.1.1.  Fa* ir: a fair topk ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1569–1578. Cited by: §2, §3.1.1.
 Mitigating unwanted biases with adversarial learning. In Association for the Advancement of Artificial Intelligence, Cited by: 2nd item.
 On the relation between accuracy and fairness in binary classification. ArXiv abs/1505.05723. Cited by: §2.
Comments
There are no comments yet.