Recommendation or Discrimination?: Quantifying Distribution Parity in Information Retrieval Systems

by   Rinat Khaziev, et al.
True Fit Corporation

Information retrieval (IR) systems often leverage query data to suggest relevant items to users. This introduces the possibility of unfairness if the query (i.e., input) and the resulting recommendations unintentionally correlate with latent factors that are protected variables (e.g., race, gender, and age). For instance, a visual search system for fashion recommendations may pick up on features of the human models rather than fashion garments when generating recommendations. In this work, we introduce a statistical test for "distribution parity" in the top-K IR results, which assesses whether a given set of recommendations is fair with respect to a specific protected variable. We evaluate our test using both simulated and empirical results. First, using artificially biased recommendations, we demonstrate the trade-off between statistically detectable bias and the size of the search catalog. Second, we apply our test to a visual search system for fashion garments, specifically testing for recommendation bias based on the skin tone of fashion models. Our distribution parity test can help ensure that IR systems' results are fair and produce a good experience for all users.



There are no comments yet.


page 1

page 6


Grep-BiasIR: A Dataset for Investigating Gender Representation-Bias in Information Retrieval Results

The provided contents by information retrieval (IR) systems can reflect ...

Learning Fair Representations via an Adversarial Framework

Fairness has become a central issue for our research community as classi...

Multi-Perspective Semantic Information Retrieval

Information Retrieval (IR) is the task of obtaining pieces of data (such...

User Acceptance of Gender Stereotypes in Automated Career Recommendations

Currently, there is a surge of interest in fair Artificial Intelligence ...

Fair Distributions from Biased Samples: A Maximum Entropy Optimization Framework

One reason for the emergence of bias in AI systems is biased data -- dat...

Fair Bayes-Optimal Classifiers Under Predictive Parity

Increasing concerns about disparate effects of AI have motivated a great...

Large Scale Visual Recommendations From Street Fashion Images

We describe a completely automated large scale visual recommendation sys...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Information retrieval (IR) systems, such as search engines and recommender systems (RS), are some of the most widely used machine learning systems today and are used to suggest a list of items or recommendations that are most relevant to users. Given the widespread use of IR systems and RS, ensuring that all users receive the same, high quality recommendation experience is critically important. Unlike other applications of machine learning, IR systems make recommendations based on a query or input. If the algorithms used to produce recommendations have underlying biases, these biases will likely propagate to the results. More specifically, if algorithms used in IR systems have encoded information associated with latent protected variables such as gender, race, or age, the presence of a protected characteristic (e.g., being a woman) in the query can lead to recommendation results also reflecting this protected characteristic. In most domains of application, protected variables are not related to the features that matter most for recommendations. For example, a user’s race is not relevant to whether or not two handbags are similar. Yet if race were partially encoded in the query handbag, for example through characteristics of a human model in an image as well as in the IR algorithm, recommendations may also reflect this bias. Being able to quantify the extent to which recommendations reflect irrelevant, protected variables is a necessary first step in ensuring that all users receive fair and useful recommendations.

The risk of generating unfair recommendations is especially acute for applications like visual search that use features extracted from images to generate recommendations. In many modern visual search systems, deep neural networks learn the visual features used to identify relevant recommendations. However, the standard datasets used to train deep neural networks have implicit gender and racial biases

(Handa, 2019). These biases become encoded in the recommendation models themselves, leading to discriminatory and biased results (Stock and Cisse, 2018). In fashion visual search applications, human fashion models often appear in the query image and result images along with the fashion garments being recommended. Therefore, using a biased computer vision model can lead to recommendations that have more in common with the people modeling fashion garments, such as the dress images in the top row of Figure 1, than the fashion garments themselves, such as the dress images in the bottom row of Figure 1.

In recent years, fairness in machine learning has received increased attention both publicly and in the scientific community, leading to some convergence in operational definitions of fairness. Definitions of fairness typically fall into two categories: individual fairness and group fairness (Dwork et al., 2012), (Yang and Stoyanovich, 2017), (Zehlike et al., 2017), (Karako and Manggala, 2018), and (Gajane and Pechenizkiy, 2017). Individual fairness is achieved when everyone is treated consistently regardless of their association with a protected group or protected variable. Group fairness, also known as statistical parity or demographic parity, requires that a group receiving a positive or negative outcome is treated equal to all other groups. That is, group fairness is achieved when outcomes are equalized across all groups. These definitions have provided a conceptual framework for researching fairness in many machine learning systems; however, they are not directly applicable in all contexts. For example, in IR, the output of a model is not a single, categorical determination, but rather a list of recommendations. For IR to be fair, lists of recommendations must be independent of protected variables.

In this paper, we introduce a new definition of fairness, “distribution parity,” to assess the fairness of IR systems. An IR system exhibits distribution parity when the distribution of values of a protected variable in the top-K recommendations match the distribution of values in the dataset regardless of the value of the protected variable in the query. In contrast, an IR system lacks distribution parity when the value of the protected variable in the input significantly biases the distribution of values of the protected variable in the top-K recommendations relative to the dataset. For example, if an IR system returns more images of women relative to the dataset when the query image contains a woman, the system would fail to satisfy distribution parity.

Using the concept of distribution parity, we develop an approach to determine whether a IR system’s recommendations are biased and evaluate our approach in the context of fashion recommendations. We first describe our statistical test for distribution parity. We then use Monte Carlo simulations to investigate the relationship between the sample size, bias, and statistical power of our distribution parity test. Lastly, we apply our test to a fashion visual search system to elucidate how the test could function in a real-world context. Specifically, we use a neural network based visual search system that uses image embeddings to retrieve similar clothing items. For images, we use the DeepFashion In-Shop Clothes Retrieval (Liu et al., 2016a) dataset. Using image segmentation, we extract the skin tone of the fashion models depicted in the DeepFashion images and apply our distribution parity test to determine if, given an image with a fashion model with a particular skin tone, the resulting similar images are significantly more or less likely to contain fashion models with the same skin tone.

2. Related Work on Fairness in Machine Learning

Much of the machine learning bias research focuses on discrimination in classification problems, Verma and Rubin (Verma and Rubin, 2018) have a good review of what has been done; however, our statistical test focuses on bias in recommendation problems. Only a handful of papers look at bias in recommendation problems (Yang and Stoyanovich, 2017), (Karako and Manggala, 2018), and (Zehlike et al., 2017). Of the recommendation focused papers, Yang and Stonyanovich (Yang and Stoyanovich, 2017) explicitly measures any form of bias in RS. They define fairness as statistical parity where the proportion of the protected group in the top-K recommendations is the same as the non-protected group and use modified ranking metrics to detect fairness in ranked lists. To test for statistical parity with multiple protected groups, they use a modified KL-divergence, normalized discounted KL-divergence (rKL). A major drawback of rKL is that it does not give a clear answer for when a ranking algorithm returns biased recommendations. Other papers that have investigated biases in RS have focused on inclusivity in results, but do not give explicit measures. Karako and Manggala (Karako and Manggala, 2018) define fairness as uniformity of the protected variables in the top-K recommendations. Such an approach yields a recommender system that is perceptually equal, however, it can over-recommend items associated with the minority labels regardless of their relevance to the user. Zehlike et. al. (Zehlike et al., 2017) utilizes statistical parity with the criterion that the proportion of a protected group is above a minimum threshold, , set by the practitioner. Previous methods for defining fairness in recommendations have primarily focused only on the recommendation results, yet results in many IR systems are contingent upon some input, such as a query image in a visual search system. In the current work, our test for distribution parity addresses this problem by defining fairness in results with respect to an input.

3. Approach

Our approach consists of three parts: 1) a statistical test for distribution parity; 2) an evaluation of the statistical power of the test using simulated data; and 3) an application of the test to a real-world dataset using visual search for fashion recommendations.

3.1. Statistical Test for Distribution Parity

3.1.1. Definitions

In this work, we consider recommendations unbiased if the query-conditional distribution exhibits distribution parity. In a dataset, there is a protected variable (e.g., skin tone) with a set of possible values where is a given value of the protected variable (e.g., ST3 skin tone). An individual observation in the dataset can only be associated with one value of the protected variable. The distribution of the values of the protected variable, conditional on the query input’s value, , is . Generally, to achieve distribution parity, the distribution of values of the protected variable in the recommendations should on average match the distribution of the protected variable values in the dataset.

We can concretely enforce distribution parity in two ways, either for each of the top-K ranks


or for a set of top-K ranks


where and

denote the random variables associated with the evaluation of

over positions and top- of recommendations respectively.

We refer to equations (1) and (2) as strong and weak fairness conditions. Detecting strong fairness is preferred when end users are exposed to a large set of ranked outputs. Enforcing the weak fairness condition is usually sufficient especially with fashion IR systems since they typically do not display more than a handful of items to each user and deal with relatively small search catalogs.

If recommendations lack distribution parity, the distribution of values of the protected variable in the set of recommendations will be significantly different than the distribution of the values of the protected variable in the dataset. Such a definition of fairness is particularly applicable when the protected variable’s distribution of values is highly imbalanced (e.g., skin tone of human models in fashion catalogs), ensuring that underrepresented protected variable values are not diluted during evaluation of recommendations.

3.1.2. Testing Fairness using a Categorical Protected Variable

Query Items Recommendations
300 50 250
40 600 260
80 150 1800
catalog 100 150 350
Table 1.

Example of a omnibus contingency tables for detecting bias in the recommendations for a catalog with three values (A, B, and C) of a protected variable,

We verify the statistical validity of condition (2) by performing a test of independence on a contingency table (see Table 1) that is generated by aggregating the top-K recommendations for each value of the protected variable. Each row of the table encodes the protected variable’s distribution of values (e.g., skin tones), , in the top-K recommendations for a given value . To test recommendations for distribution parity, we include the protected variable’s distribution of values in the search catalog as a row of the contingency table. We test for the independence (i.e., distribution parity) of the protected variable using a -test, which is performed on the full contingency table for a given

level. We refer to the test using the full contingency table as the “omnibus test”. The null hypothesis

of the omnibus test is formulated as follows


If the omnibus test detects a statistically significant effect, it may be of interest to determine whether there is a significant effect for specific values of the protected variable (e.g., for the skin tone ST1 or ST3). We refer to these follow up tests, focused on a single value of the protected variable, as “contrast tests”. The contrast test is performed by setting up a contingency table for a single value of along with the protected variable’s distribution of values in the search catalog (see Table 2). The null hypothesis of a contrast test is defined for every value of the protected variable as


In the contrast test, we use a aggregation of the full contingency table for each value of the protected variable. Operating on a subset of the full table for the value of interest, the contrast test (4) has less power than test omnibus test (3), and as a result, there is a risk of not detecting biases on small datasets.

Query Items Recommendations
A 300 300
catalog 100 600
Table 2. Example of a contrast contingency table for detecting bias in recommendations for the ,

We measure the bias in recommendations for each skin tone using the risk ratio


Using the risk ratio is a preferable due to its ease of interpretation. Values of the risk ratio,

, greater than one indicate the algorithm is over-representing a protected variable in the search results, and values less than one indicate that a protected variable is under-represented in the search results. In addition to the point estimate for the risk ratio, we can calculate a confidence interval using standard practices.

It is important to note that a bias can manifest as either an under representation or an over representation of a given value of a protected variable. To quantify a bias in either direction on a common scale, we use a normalized risk ratio


which takes values between 0 and 1. The normalized risk ratio (6) can be linked to the ”80% rule,” which is used in the American legal system to define a threshold for discriminatory policies. The 80% rule specifies that an algorithm can be considered discriminatory if the rate at which individuals belonging to a protected group (e.g., having a disability) are assigned to a positive outcome (e.g., being hired) is less than 80% of the rate at which individuals not belonging to that group (e.g., not having a disability) are assigned to the positive outcome (Zafar et al., 2015). Under the 80% rule, an algorithm is considered to be fair if the normalized risk ratio belongs to the range from 0.8 to 1 (or risk ratio belongs to the range from 0.8 and 1.25).

3.2. Evaluation of Statistical Power with Simulated Data

Whether a deviation from distribution parity is statistically detectable depends on both the risk ratio (i.e., bias size) and the sample size (catalog size). To provide an understanding of how the the risk ratio and catalog size impact the likelihood of our omnibus and contrast tests detecting a bias, we perform an analysis using Monte Carlo simulated data.

In order to understand the statistical properties of our tests under different conditions, it is necessary to generate synthetic datasets with a known level of bias. Specifically, we generate synthetic datasets with a given distribution of values of the protected variable , artificially manipulating the risk ratio and catalog size . When generating synthetic recommendations, we randomly sample query items from a given distribution of the protected variable , and then generate the top-K recommendations for each item from a biased distribution of the protected variable controlled by the risk ratio . The aforementioned quantities are used as input variables to our Monte Carlo sampling algorithm, Algorithm 1.

Algorithm 1 includes four steps. First, we sample the protected variable values for the query items from the protected variable’s distribution of values . Second, for each query item

, we skew the protected variable’s distribution of values

by multiplying by the risk ratio , ensuring that

is no greater than 1. The rest of the probabilities

are scaled down uniformly to make sure that = 1 and . Finally, the recommendations are sampled from and assigned to the query item .

input : , , ,
output :  query items, Set of Recommendations
for  do
       for  do
       end for
end for
return ;
Algorithm 1 Monte Carlo sampling algorithm for generating biased recommendations as a function of catalog size , risk ratio , protected variable’s distribution of values , and number of recommendations .

For each set of simulation parameters, the generated values of the protected variables for the queries and recommendations are used to build the omnibus and contrast contingency tables to evaluate the hypotheses and for . Knowing that the alternative hypothesis is true in any of those cases, except for , we empirically evaluate our test’s power, a probability of correctly rejecting on Monte Carlo trials as a fraction of the tests that were rejected out of trials.

Note that the statistical properties of the hypothesis tests that are generated using this approach are valid only for the protected variable’s distribution of values, , specific to a particular search catalog. Thus, a practitioner would have to recompute the statistical properties of the tests every time the protected variable’s distribution of values significantly changes in the search catalog. Nevertheless, our simulation analysis can provide general guidance to practitioners for the conditions under which the distribution parity test is appropriate.

3.3. Application to Visual Search for Fashion Recommendations

To provide an example of how our approach would function within a real-world information retrieval system, we apply our test for distribution parity within the context of a fashion visual search system. In this analysis, we focus on evaluating distribution parity in the skin tone of fashion models depicted in recommendations. To apply our test in a fashion visual search setting involves several steps. First, we generate recommendations for each image in the dataset using image embeddings learned from a convolutional neural network (CNN) (see Section

3.3.1). Second, we extract the skin tone of the fashion models in each image, which occurs in two stages. In the first stage, we use supervised image segmentation to localize the models’ skin within each image (Section 3.3.2). In the second stage, we take each skin cutout and extract the skin tone for each image using a tailored color mapping (see Section 3.3.3). Lastly, we apply our test for distributional parity to determine whether there is a statistically significant bias in the recommendations using the omnibus test and perform follow up contrasts to elucidate the specific nature of any significant bias detected by the omnibus test.

Importantly, distributional parity in skin tone is not equivalent to parity in race. We focus specifically on skin tone for several reasons. In many datasets used for classification problems, individuals' self-reported race is known. For fashion images, we do not know how fashion models would describe their own race. Without these self-identifications, we cannot determine the race of fashion models because race is a social construct without a fixed meaning (Michael Omi, 2014). Although race, for the purposes of scientific inquiry, cannot be precisely defined based on objectively quantifiable characteristics (Lee, 2009), psychological research has shown that humans use perceptual features such as a person's skin tone to make racial categorizations (Dunham et al., 2015; Stepanova and Strube, 2009). In addition to being used as a perceptual proxy for race, skin tone can also account for biases that extend beyond race in contexts such as electoral decision making (Weaver, 2012), implicit attitudes (Nosek et al., 2007), and the marriage marketplace (Jha and Adelman, 2009). Therefore, although skin tone is not equivalent to race, skin tone is still an important avenue of inquiry for understanding both intra- and inter-racial bias.

3.3.1. Visual Search

Visual search systems commonly rely on CNNs to automatically learn and generate features in a high-dimensional space (Krizhevsky et al., 2012)

. The generated features, referred to as image embeddings, are fixed-length vector image representations from the CNNs hidden layers. However, the image embeddings are not human interpretable. Therefore, protected variables are not easily detectable if CNNs implicitly learn them, hence the need for our test.

Following (Jing et al., 2015; Shankar et al., 2017; Yang et al., 2017), we build a visual search system that retrieves similar items using image embeddings. The most similar images are determined using a k-nearest neighbor search with the Minkowski distance metric. We utilize a ResNet model (Kaiming et al., 2016)

pre-trained on ImageNet

(Deng et al., 2009) to generate image embeddings for fashion images.

3.3.2. Image Segmentation for Skin Detection

Figure 2. Examples of the skin segmentation and ITA predictions on DeepFashion dataset (Liu et al., 2016a). Top row, the original images with ITA values and ITA category listed above. Bottom row, pixel-wise skin detection on images where yellow indicates the areas of the image labeled as skin.

To determine the skin tone of human models in fashion images, we must first extract models' skin from the rest of the image. To this end, we train a binary supervised image segmentation model that can distinguish skin from the rest of the image. We use a CNN to perform supervised image segmentation, a task of pixel-wise image classification. Supervised image segmentation is an active field of study (Chen et al., [n. d.], 2018), and has many fashion applications (Yang et al., 2014; Liu et al., 2014; Liu et al., 2016b; Tangseng et al., 2017)

. In this work, we use the DeepLab V3+ architecture with the Xception feature extractor and output stride 16

(Chen et al., 2018).

We trained our image segmentation model on a proprietary dataset that includes images that are representative of online fashion catalogs. Although there are publicly available datasets for training image segmentation models on fashion images, they do not provide labels that are sufficiently accurate to train a model for skin tone extraction. The largest publicly-available fashion segmentation dataset, ModaNet (Zheng et al., 2018), does not have skin labels at all. The Fashionista(Yang et al., 2014) and Clothing Co-parsing(Tangseng et al., 2017) datasets provide skin labels; however, they were generated using super-pixel labeling techniques and as a result have noisy label boundaries with a large number of false positive pixels.

3.3.3. Skin Tone Classification

Given a trained image segmentation model, we are able to extract skin pixels; however, we need a framework for classifying the skin tone of skin pixels to perform our test. Extracting the skin tone using raw RGB values (16M colors) or color histograms results in feature spaces that do not map well to human perceptions of skin tone. In dermatology research, skin tones are often characterized using Fitzpatrick

(Fitzpatrick, 1975, 1988) skin type and, more recently, Individual Typology Angle (ITA) (Chardon et al., 1991).

The Fitzpatrick skin type is determined based on a self-reported questionnaire, and consequently, cannot be used for automated skin tone extraction at scale. In contrast, ITA can be used to automatically classify skin tone. ITA is a mathematical transformation of skin color in the CIELAB color space, which encodes color as a combination of three values: lightness , red-green scale and blue-yellow scale . The transformations from RGB to CIELAB color space are well known, and it is recommended (Del Bino and Bernerd, 2013) to use the D65 illumination function in the ITA calculations. Following the transformation of the mean skin tone color from RGB to a CIELAB, the ITA value is determined as


To perform our test for distributional parity, the continuous ITA scores must be mapped to discrete categories. Although mappings from ITA continuous values to category labels have been proposed by (Del Bino and Bernerd, 2013) as well as (Saint-Léger, 2015), both systems use color words (e.g., ”intermediate”, ”golden”) that are neither precise nor strictly accurate. Moreover, color terms as applied to skin tone are often imbued with sociocultural meanings that are not related to objective quantification of skin tone (Saint-Léger, 2015). To avoid introducing such subjectivity into our category labels, we label the ITA categories as ST1 (i.e., skin tone 1) through ST6 where lower values indicate darker skin tone. Our mappings between ITA values and skin tone categories are presented in in Table 3.

ITA, ITA Skin Tone
Table 3. Mapping of the ITA values to six categorical skin tone labels (Del Bino and Bernerd, 2013)

In any image, the ITA values vary across skin pixels. To determine the skin tone category of a model in an image, we use the median ITA value in the images, and map that value to its associated skin tone category. Figure 2 displays examples of skin tone predictions and image segmentation predictions on images from the DeepFashion In-Shop Clothes Retrieval dataset (Liu et al., 2016a). The top row of the figure displays the original image and predicted skin tone for fashion models with different ITAs. The values of the ITA are reported above the original image ranging from -33 to 53. The bottom row shows the image segmentation results for the image above where yellow corresponds to pixels labeled as skin.

4. Results

We demonstrate statistical properties of our test using Monte Carlo simulated recommendations and empirical applicability using visual search on the DeepFashion In-Shop Clothes Retrieval (Liu et al., 2016a) dataset. This dataset is similar to online clothing catalogs as it contains styled images of garments from multiple viewpoints sorted by gender (i.e., Men’s vs. Women’s) and garment type (e.g., jeans, dresses, etc.). We perform our test for distributional parity globally–across the whole dataset, as well as within a garment type and viewpoint, e.g., front-facing Women’s dresses.

4.1. Statistical Power with Simulated Data

(a) Power of the proposed hypothesis test as a function of the risk ratio () and catalog size
(b) Detectable risk ratio as a function of the samples size at 80% power (RR ¡ 1)
(c) Detectable risk ratio as a function of the samples size at 80% power (RR ¿ 1)
Figure 3. (a) The power of the omnibus hypothesis test, , as a function of the risk ratio for the search catalog size 100, 250, and 1000. (b) The detectable risk ratio for omnibus and contrast tests at 80% power as a function of the catalog size. Significance level is used when executing tests and calculating the detectable risk ratio.
ITA Skin Tone Frequency, %
ST1 5
ST2 15
ST3 15
ST4 25
ST5 30
ST6 10
Table 4. Protected variable’s distribution of skin tones in Monte Carlo trials

The Monte Carlo trials were generated using Algorithm 1 and the protected variable’s distribution of skin tones in Table 4. When evaluating the hypothesis tests, we utilize the significance level unless otherwise specified. The reported trends hold for other values of . For each set of parameters, 1000 Monte Carlo simulations were run to gather sufficient enough data for our analysis.

The power of the omnibus hypothesis test (see section 3.1.2) as a function of the risk ratio is displayed in Figure 2(a). We report the power plots for three catalog sizes, 100, 250, and 1000 observations. When the risk ratio is close to one and the distribution of skin tones in the recommendations are close to the catalog’s distribution of skin tones, the test has the least amount of power, achieving the minimum at . The test easily detects large biases, , for small samples, which is indicated by the value of the test’s power being close to 1. The power curve displays an inverted bell shape, becoming narrower for larger catalog sizes. A narrow power curve indicates that the hypothesis test can reliably detect smaller biases, i.e., risk ratios closer to 1. The power curves for the contrast tests show similar characteristics to the omnibus test curves shown in Figure 2(a).

Figures 2(b) and 2(c) display the detectable bias as a function of the catalog size given 80% power at the significance level for and respectively. For the omnibus test, both figures indicate that the detectable bias is decreasing, getting close to 1, as a function of a catalog size reaching (80% rule) at approximately 400-600 samples in the dataset.

Looking at the contrast tests, it is clear that if a protected value is less represented in the dataset, the number of samples needed to detect an increases. There is a clear difference between the curves in both figures. When the , the curves drop off more drastically when there is a small catalog size than when the . This is due to the having an upper bound that varies with ; the upper bound is . Therefore, the for will only approach 0 when . Also, the detectable bias depends on a skin tone’s proportion in the catalog. More specifically, the smaller the proportion a skin tone makes up in a catalog, the larger the bias needed to detect one. This can be seen in Figures 2(b) and 2(c) with smaller proportion skin tones having smaller detectable ’s at the same catalog size.

In the case where the protected variable’s distribution of values in the catalog is different from Table 4, the methods described in this section can be used to infer the power curves, the detectable risk ratio given the catalog size, and the applicability of the combined omnibus and contrast tests.

4.2. Visual Search for Fashion Recommendations

The testing methodology described in Section 3 is empirically evaluated on a fashion visual search system’s recommendations. We perform a search on two sets of images (catalogs) of different sizes, built as subsets of the DeepFashion In-Shop Clothes Retrieval dataset (Liu et al., 2016a). The first catalog consists of 51,740 images available in the DeepFashion dataset (i.e., the ”full” dataset) which have non-null ITA values, and includes multiple views of the same garments worn on the model. The second catalog is constructed as the frontal views of Women Dresses (i.e., the ”dresses” dataset) comprising 1,812 images.

The distribution of the skin tones in the dresses catalog almost matches the distribution of the skin tones in the full catalog (see Table 5). In both cases, the ST6, ST5 and ST4 skin tones compose the majority of the images accounting for 83% and 88% of the full and the dresses subset, respectively. The ST1, ST2, and ST3 skin tones are underrepresented in both subsets, having merely 234 and 2 images associated with the ST1 skin tone label in the full and dresses catalogs, respectively.

ITA Skin Tone Frequency, %
Full Catalog Dresses Catalog
ST1 0.4 0.1
ST2 3.7 1.7
ST3 11 10
ST4 25 26
ST5 40 46
ST6 18 16
Table 5. Distribution of the ITA skin tone labels extracted from the DeepFashion In-Shop Clothes Retrieval dataset (Liu et al., 2016a). The distributions are reported in the full catalog and the subset of restricted to the Women’s Dresses

A bias is detected in the fashion recommendations generated using the approach described in 3.3.1 with =6 nearest neighbors. The omnibus test detects a bias on both the full and dresses catalogs. Also, the contrast test detects a bias on the full catalog and only a limited set of the protected values for the dresses catalog (see Table 7). Thus, the contrast test rejects only for ST3  at on the women dresses search, a smaller catalog. A bias is detected for all of the values of the query skin tone on a large catalog search, highlighting that a large number of samples is required to detect smaller biases on smaller catalogs. For example, the is estimated at 2.01, however, the bias is not detected given that there are only 30 query items, which is not sufficient to detect a bias in their recommendations.

The strongest bias is estimated for the fashion images associated with the models of the ST1 (i.e., darkest) skin tone, registering at . Notably, the bias on the dresses search for the same values of the protected variable is detected by a test; however, the confidence interval of the risk ratio contains . Having only 2 query images with a model that have a ST1 skin tone, the disagreement between risk ratio confidence interval and test results indicates lack of the statistical power to detect bias. The risk ratios estimated from the search performed on the full and dresses catalogs are closely aligned. Investigating whether the bias is an intrinsic property of the search algorithm used is outside of the scope of this paper and requires further exploration.

Catalog size
Full 51740
Dresses 1812
* ; ** ; *** ;
Table 6. Results of the omnibus test for distribution parity in skin tone from recommendations made with nearest neighbors
Protected variable in the query, Catalog Size
ST1 Full 234 8.41 [6.54, 10.8]
Dresses 2 0 [0, ]
ST2 Full 1963 2.79 [2.53, 2.87]
Dresses 30 2.01 [0.97, 4.19]
ST3 Full 5904 1.54 [1.49, 1.58]
Dresses 182 1.51 [1.27, 1.81]
ST4 Full 13121 1.15 [1.13, 1.17]
Dresses 469 3.76 1.10 [1.00, 1.20]
ST5 Full 21123 1.10 [1.09, 1.12]
Dresses 821 1.08 [1.02, 1.14]
ST6 Full 9395 1.38 [1.35, 1.41]
Dresses 298 1.22 [1.07,1.39]
* ; ** ; *** ;
Table 7. Results of contrast tests for distribution parity in skin tones from recommendations made with nearest neighbors

5. Limitations

Several technical and practical limitations of our approach are worth noting. First, our categorization of skin tone using ITA is a measure of skin tone within the context of a particular image rather than a fixed, objective measure of skin tone. The visual appearance of colors in images depends on the lighting, shadows, make-up, etc., which can vary the values of the ITA for the same fashion model. Figure 4

shows the ITA values generated for six images of the same model ranging from -33 to 13, which corresponds to ST1, ST2, and ST3 skin tones. The standard deviation of the ITA values for a single model is estimated from multiple views of the same product is

, assuming that only one model is present in the set of product images. This estimate of is conservative because the sample of images containing the same model is not readily available in the DeepFashion dataset. Thus, the ITA values and generated labels should be interpreted as color features that capture visually-apparent skin tone rather than the true skin tone of a model.

Figure 4. Variation in ITA values vary for the same model. Estimated standard deviation of .

Second, in our power analysis, we leverage the 80% rule, which represents a legal standard for whether or not an algorithm’s output is biased (Zafar et al., 2015). Having a single, static threshold for bias is practically useful. With a static threshold, we can build automated tests to determine if recommendations merit manual review. However, using the 80% rule to set a threshold is problematic for functional as well as statistical reasons. Within the recommendation domain, there is no evidence to suggest that the level at which humans perceive skin tone and/or racial bias in algorithms’ recommendations is consistent with the thresholds that correspond to the 80% rule. Moreover, the perceptual threshold for perceiving skin tone or racial bias in recommendations may depend both on the domain of application (e.g., fashion vs. cosmetics) as well as characteristics of users (e.g., race, gender). For example, being a racial minority and having previously experienced subtle forms of racial bias is associated with an increased likelihood of perceiving racially-charged internet memes as offensive (Williams et al., 2016). Our power analysis demonstrates the statistical complexities of using a fixed point estimate as a threshold—whether a bias can be statistically detected depends on properties of the sample. A more valid method for setting a functionally relevant threshold would involve conducting user testing in the specific domain of application within multiple user groups to determine the level at which users detect bias.

Also, the test we propose here involves performing multiple statistical tests; therefore, we are more likely to commit a Type I error with each test we conduct. Yet, we do not propose a specific multiple comparisons correction (MCP) here for several reasons. The choice of MCP will depend on whether the tests are planned or post-hoc, simple or complex comparisons, and whether there are many or only a few comparisons. While a Bonferroni correction may be reasonable for an analysis with sufficiently few follow-up comparisons, a Bonferroni correction would be overly conservative if sufficiently many comparisons are performed

(Field et al., 2012)

. The choice of MCP also depends on whether the practitioner is more concerned with controlling the Type I or Type II error rate. If our test is used to identify potentially biased recommendation results for manual review, having numerous Type I errors would create unnecessary manual labor; however, by stringently controlling the Type I error rate, we increase our chances of committing a Type II error. That is, we are more likely to fail to flag some biased results. Practitioners who are more interested in reducing the chance of serving biased results to users may do well to consider a less conservative MCP such as those based on false discovery rate.

Furthermore, our proposed method is focused on identifying rather than correcting deviations from distribution parity. Identifying bias is not equivalent to providing unbiased results. That is, our method does not specify what to do in the event that a bias is detected. Lipton et al. (Lipton et al., 2018) demonstrate that well-intentioned attempts to render algorithmic outcomes fair can sometimes result in harm to particular individuals within a disadvantaged group. Therefore, proposed methodologies for correcting recommendations with a bias should carefully consider any potential negative consequences of the correction. Although proposing specific guidelines for correcting bias is outside the scope of the current paper, we have observed that we can significantly reduce the extent to which visual search recommendations are biased by skin tone. One approach is to use segmentation models trained on images featuring more diverse fashion models and use these segmentations to remove skin pixels from images before performing similar item retrieval. Another is to use embeddings generated by fashion specific classifiers trained again on diverse data-sets of fashion models. Future work should examine these and other methodologies for providing search results that are independent of protected variables.

Lastly, our empirical analysis is limited in its usefulness due to issues of representation in the data set used. First, although our effect sizes for ST1 (i.e., the darkest skin tone) were the largest observed, our ability to detect these effects is severely constrained by the low number of fashion models with an ST1 skin tone. The low number of fashion models with darker skin tones is not limited to the DeepFashion dataset. In an analysis of cover models in Vogue Magazine over the last few years, Handa (Handa, 2019) demonstrates that although the magazine has featured more models of color in recent years, still very few have a darker skin tone. As a result, our test is least likely to detect bias in the most marginalized group of fashion models. Recently, the fashion industry has responded to calls for increased inclusivity by featuring fashion models with more racial, cultural, age, and body shape diversity (Day, 2018). However, without greater diversity in human models in fashion images, attempts to detect bias along any single protected variable, let alone intersecting protected variables, will be limited.

6. Discussion

The goal of the current work was to develop a test for detecting fairness in IR systems. Here we describe our test for distribution parity, which determines whether the presence of a protected variable value in a query affects the likelihood that the resulting recommendations will also share that value. Although the distribution parity test could be used for a range of protected variables with categorical values, we chose to evaluate our test for distribution parity using skin tone bias in an image-based fashion IR system as an example use case. To demonstrate the utility of the test within this context, we performed an evaluation on a publicly available dataset using the 80% rule to set our bias threshold. In the DeepFashion full dataset, which has a sufficient number of samples per skin tone category, our test reveals a statistically significant bias in recommendations. Also, the DeepFashion dresses dataset had a statistically significant bias, but it was not possible to conclude that there was bias in certain skin tones because of an insufficient number of images. Through these results, we have shown that our method for detecting bias can be a powerful tool for ensuring that users are given high quality, unbiased results, but also that this test cannot find statistically significant bias when some skin tone categories have limited representation in a dataset.

As recommender and IR systems become more prevalent, it will be increasingly important to develop methods for determining if a system outputs are biased. Bias in algorithms informing high-stakes decision making is straightforwardly damaging to some user segments. For example, ProPublica (Angwin et al., 2016) showed that one widely employed recidivism prediction system falsely predicted higher rates of recidivism among black than white defendants. Yet, bias in an algorithms’ predictions in seemingly benign contexts such as fashion still merit investigation. If factors independent of the recommendation domain bias algorithms’ results–i.e., if skin tone influences fashion recommendations—the resulting recommendations will necessarily provide a poor experience for users. Beyond contributing to bad user experience, results with bias could also have deleterious effects on marginalized individuals, who already regularly experience bias in their day-to-day lives. For example, experiencing racially biased results from a recommender system is conceptually similar to experiencing other racial microaggressions–defined as subtle, daily experiences that intentionally or unintentionally insult, degrade, or invalidate racial minorities (Wong et al., 2014). In racial minorities, individuals who report having experienced microaggressions also report poorer physical, mental, and occupational outcomes (Wong et al., 2014). Therefore, providing recommendations that manifest skin tone and/or racial biases could contribute to the constellation of negative experiences marginalized people frequently experience. Similarly, if IR systems are not fair with regard to other dimensions of users identities such as gender (see (Barthelemy et al., 2016; Basford et al., 2014)) or intersecting identities (see (Nadal et al., 2015)), many users may be especially impacted by IR systems in negative ways.

The method we propose here offers an avenue for understanding bias within the recommendation domain. Although there are some caveats for its application, our approach can help ensure that all users are provided with a high quality recommendation experience.


  • (1)
  • Angwin et al. (2016) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. ProPublica (May 2016).
  • Barthelemy et al. (2016) Ramón S Barthelemy, Melinda McCormick, and Charles Henderson. 2016. Gender discrimination in physics and astronomy: Graduate student experiences of sexism and gender microaggressions. Physical Review Physics Education Research 12, 2 (2016), 020119.
  • Basford et al. (2014) Tessa E Basford, Lynn R Offermann, and Tara S Behrend. 2014. Do you see what I see? Perceptions of gender microaggressions in the workplace. Psychology of Women Quarterly 38, 3 (2014), 340–349.
  • Chardon et al. (1991) A Chardon, I Cretois, and C Hourseau. 1991. Skin colour typology and suntanning pathways. International journal of cosmetic science 13, 4 (1991), 191–208.
  • Chen et al. (2018) Liang Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    11211 LNCS (2018), 833–851. arXiv:arXiv:1802.02611v3
  • Chen et al. ([n. d.]) Liang-chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and C V Aug. [n. d.]. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ([n. d.]). arXiv:1802.02611v3
  • Day (2018) Emma Day. 2018. 9 Ways The Fashion Industry Embraced Inclusivity in 2018. Vogue Arabia (jul 2018).
  • Del Bino and Bernerd (2013) S. Del Bino and F. Bernerd. 2013. Variations in skin colour and the biological consequences of ultraviolet radiation exposure. British Journal of Dermatology 169, SUPPL. 3 (oct 2013), 33–40.
  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
  • Dunham et al. (2015) Yarrow Dunham, Elena V. Stepanova, Ron Dotsch, and Alexander Todorov. 2015. The development of race-based perceptual categorization skin color dominates early category judgments. Developmental Science 18, 3 (2015), 469–483.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference on - ITCS ’12 (2012), 214–226. arXiv:arXiv:1104.3913v2
  • Field et al. (2012) Andy Field, Jeremy Miles, and Zoë Field. 2012. Discovering statistics using R. Sage publications.
  • Fitzpatrick (1975) Thomas B. Fitzpatrick. 1975. Soleil et peau. J Med Esthet (1975).
  • Fitzpatrick (1988) Thomas B. Fitzpatrick. 1988. The Validity and Practicality of Sun-Reactive Skin Types I Through VI. Archives of Dermatology 124, 6 (jun 1988), 869.
  • Gajane and Pechenizkiy (2017) Pratik Gajane and Mykola Pechenizkiy. 2017. On Formalizing Fairness in Prediction with Machine Learning. (2017). arXiv:1710.03184
  • Handa (2019) Malaika Handa. 2019. Colorism in High Fashion. The Pudding (Apr 2019).
  • Jha and Adelman (2009) Sonora Jha and Mara Adelman. 2009. Looking for love in all the white places: a study of skin color preferences on Indian matrimonial and mate-seeking websites. Studies in South Asian Film & Media 1, 1 (2009), 65–83.
  • Jing et al. (2015) Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, and Sarah Tavel. 2015. Visual Search at Pinterest. (2015). arXiv:1505.07647
  • Kaiming et al. (2016) He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Identity Mappings in Deep Residual Networks, In Leibe B., Matas J., Sebe N., Welling M. (eds) Computer Vision – ECCV 2016. Lecture Notes in Computer Science, vol 9908. Springer, Cham.
  • Karako and Manggala (2018) Chen Karako and Putra Manggala. 2018. Using Image Fairness Representations in Diversity-Based Re-ranking for Recommendations. In Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization - UMAP ’18. ACM Press, New York, New York, USA, 23–28.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105.
  • Lee (2009) Catherine Lee. 2009. “Race” and “ethnicity” in biomedical research: how do scientists construct and explain differences in health? Social Science & Medicine 68, 6 (2009), 1183–1190.
  • Lipton et al. (2018) Zachary Lipton, Julian McAuley, and Alexandra Chouldechova. 2018. Does mitigating ML’s impact disparity require treatment disparity?. In Advances in Neural Information Processing Systems. 8125–8135.
  • Liu et al. (2014) Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. 2014. Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia 16, 1 (2014), 253–265. arXiv:10.1109/TMM.2013.2285526
  • Liu et al. (2016a) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016a. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Liu et al. (2016b) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016b. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1 (2016), 1096–1104. arXiv:1409.1556
  • Michael Omi (2014) Howard Michael Omi, Winant. 2014. Racial Formation in the United States. Routledge.
  • Nadal et al. (2015) Kevin L Nadal, Kristin C Davidoff, Lindsey S Davis, Yinglee Wong, David Marshall, and Victoria McKenzie. 2015. A qualitative approach to intersectional microaggressions: Understanding influences of race, ethnicity, gender, sexuality, and religion. Qualitative Psychology 2, 2 (2015), 147.
  • Nosek et al. (2007) Brian A. Nosek, Frederick L. Smyth, Jeffrey J. Hansen, Thierry Devos, Nicole M. Lindner, Kate A. Ranganath, Colin Tucker Smith, Kristina R. Olson, Dolly Chugh, Anthony G. Greenwald, and Mahzarin R. Banaji. 2007. Pervasiveness and correlates of implicit attitudes and stereotypes. European Review of Social Psychology 18, 1 (2007), 36–88.
  • Saint-Léger (2015) D Saint-Léger. 2015. The colour of the human skin: fruitful science, unsuitable wordings. International journal of cosmetic science 37, 3 (2015), 259–265.
  • Shankar et al. (2017) Devashish Shankar, Sujay Narumanchi, H A Ananya, Pramod Kompalli, and Krishnendu Chaudhury. 2017. Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce. (2017). arXiv:1703.02344
  • Stepanova and Strube (2009) Elena Stepanova and Michael Strube. 2009. Making of a face: Role of facial physiognomy, skin tone, and color presentation mode in evaluations of racial typicality. Journal of Social Psychology 149, 1 (2009), 66–81.
  • Stock and Cisse (2018) Pierre Stock and Moustapha Cisse. 2018. Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases. In Proceedings of the European Conference on Computer Vision (ECCV). 498–512.
  • Tangseng et al. (2017) Pongsate Tangseng, Zhipeng Wu, and Kota Yamaguchi. 2017. Looking at Outfit to Parse Clothing. (2017).[cs.CV] arXiv:1703.01386
  • Verma and Rubin (2018) Sahil Verma and Julia Rubin. 2018. Fairness definitions explained. (2018), 1–7.
  • Weaver (2012) Vesla M Weaver. 2012. The electoral consequences of skin color: The “hidden” side of race in politics. Political Behavior 34, 1 (2012), 159–192.
  • Williams et al. (2016) Amanda Williams, Clio Oliver, Katherine Aumer, and Chanel Meyers. 2016. Racial microaggressions and perceptions of Internet memes. Computers in Human Behavior 63 (2016), 424–432.
  • Wong et al. (2014) Gloria Wong, Annie O. Derthick, E. J.R. David, Anne Saw, and Sumie Okazaki. 2014. The What, the Why, and the How: A Review of Racial Microaggressions Research in Psychology. Race and Social Problems 6, 2 (2014), 181–200.
  • Yang et al. (2017) Fan Yang, Ajinkya Kale, Yury Bubnov, Leon Stein, Qiaosong Wang, Hadi Kiapour, and Robinson Piramuthu. 2017. Visual Search at eBay. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17. ACM Press, New York, New York, USA, 2101–2110. arXiv:1706.03154
  • Yang and Stoyanovich (2017) Ke Yang and Julia Stoyanovich. 2017. Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 22.
  • Yang et al. (2014) Wei Yang, Ping Luo, and Liang Lin. 2014. Clothing co-parsing by joint image segmentation and labeling. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2013 (2014), 3182–3189. arXiv:1502.00739
  • Zafar et al. (2015) Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259 (2015).
  • Zehlike et al. (2017) Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. 2017. FA*IR: A Fair Top-k Ranking Algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM ’17. ACM Press, New York, New York, USA, 1569–1578. arXiv:1706.06368
  • Zheng et al. (2018) Shuai Zheng, Fan Yang, M. Hadi Kiapour, and Robinson Piramuthu. 2018. ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations. October (2018), 22–26. arXiv:1807.01394