Diversity and Inclusion Metrics in Subset Selection

by   Margaret Mitchell, et al.

The ethical concept of fairness has recently been applied in machine learning (ML) settings to describe a wide range of constraints and objectives. When considering the relevance of ethical concepts to subset selection problems, the concepts of diversity and inclusion are additionally applicable in order to create outputs that account for social power and access differentials. We introduce metrics based on these concepts, which can be applied together, separately, and in tandem with additional fairness constraints. Results from human subject experiments lend support to the proposed criteria. Social choice methods can additionally be leveraged to aggregate and choose preferable sets, and we detail how these may be applied.



There are no comments yet.


page 3


What About Applied Fairness?

Machine learning practitioners are often ambivalent about the ethical as...

Reconfiguring Diversity and Inclusion for AI Ethics

Activists, journalists, and scholars have long raised critical questions...

Diversity in Sociotechnical Machine Learning Systems

There has been a surge of recent interest in sociocultural diversity in ...

The Use and Misuse of Counterfactuals in Ethical Machine Learning

The use of counterfactuals for considerations of algorithmic fairness an...

Fairness through Optimization

We propose optimization as a general paradigm for formalizing fairness i...

Improving Fair Predictions Using Variational Inference In Causal Models

The importance of algorithmic fairness grows with the increasing impact ...

Diverse Data Selection under Fairness Constraints

Diversity is an important principle in data selection and summarization,...

Code Repositories


Interactive Tools for Machine Learning, Deep Learning and Math

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In human resource settings, it is said that diversity is being invited to the party; inclusion is being asked to dance (Paradiso, 2017). Although difficult to define, such fundamentally human concepts are critical in algorithmic contexts that involve humans. Historical inequities have created over-representation of some characteristics and under-representation of others in the datasets and knowledge bases that power machine learning (ML) systems. System outputs can then amplify stereotypes, alienate users, and further entrench rigid social expectations. Approximating diversity and inclusion concepts within an algorithmic system can create outputs that are informed by the social context in which they occur.

In management and organization science, diversity focuses on organizational demography; organizations that are diverse have plentiful representation within race, sexual orientation, gender, age, ability, and other identity aspects. Inclusion refers to a sense of belonging and ability to function to one’s fullest ability within organizations (Mor-Barak and Cherin, 1998; Shore et al., 2011; Roberson, 2006; Pelled et al., 1999). In sociology, one strain of research assesses the efficacy of diversity programs within firms, studying how well particular human resources interventions – such as mentoring, anti-bias training, and shared organizational responsibility practices – improve employee diversity (Kalev et al., 2006; Dobbin and Kalev, 2016). Another strain is skeptical of the concept of diversity and the discursive work that it performs more broadly within firms and social life. Managers will often use the language of diversity without making corresponding changes to promote diverse and inclusive teams (Bell and Hartmann, 2007; Embrick, 2011; Berrey, 2015).

An example of diversity is when people with different genders, races, and/or ability statuses work together at a job. In this context, the people belong to different identity groups. These identity groups are salient insofar as they correspond to systems which afford them differential access to power, as institutional racism, sexism, and ableism. An example of inclusion is when wheelchair-accessible options are available for wheelchair users in a building. Here, the wheelchair attribute is represented in the design of the building such that wheelchair users are given similar movement options to those without wheelchairs. Inclusion, in this case, refers to the ability of individuals to feel a sense of both belonging and uniqueness for what their perspective and abilities bring to a team (Shore et al., 2011).

Building on these concepts, we introduce metrics for diversity and inclusion based on quantifiable criteria that may be applied in subset selection problems – selecting a set of instances from a larger pool. Subset selection is a common problem in ML applications that return a set of results for a query, such as in ranking and recommendation systems. While there are many burgeoning sets of mathematical formalisms for the related concept of fairness, much of the work has focused on formalizing anti-discrimination in the context of classification systems. This has given rise to fairness criteria that call for parity across various classification error metrics for pre-defined groups (Barocas and Selbst, 2016). Such constraints are generally referred to as “group” fairness, as they request that the treatment of each group is similar in some measure. In contrast to group fairness, notions of individual fairness (Dwork et al., 2012) ask that individuals similar for a task be treated similarly throughout that task.

Some notions of fairness proposed in the ranking and subset selection literature include considerations that are closely related to the idea of diversity discussed here (Celis et al., 2016; Asudeh et al., 2019; Drosou et al., 2017; Yang and Stoyanovich, 2017; Singh and Joachims, 2017). However, this literature has often conflated fairness and diversity as they are referred to in other fields such as biology (Baselga et al., 2007; Jost and others, 2009; MacArthur, 1965) and ecology (Tuomisto, 2011; Whittaker, 1960; Legendre et al., 2008). Geometric or distance-based measures of diversity have also been explored within the sciences, measuring the diversity of a dataset by the dataset’s volume (Celis et al., 2016; Kulesza et al., 2012; Gong et al., 2014; Lin and Bilmes, 2012; Zhou et al., 2010; Anari et al., 2016; Deshpande and Rademacher, 2010)

, variance as in PCA 

(Samadi et al., 2018), or other measures of spread. The notion of heterogeneity more closely matches such proposals, as they do not explicitly refer to features with societal import and context.

Our work intentionally differentiates the concept of diversity from variety or heterogeneity that may hold of a set, where diversity focuses on individual attributes of social concern (see the background section), and heterogeneity is agnostic to specific social groups. As we discuss in this work, a diversity metric can prioritize that as many identity characteristics as possible be represented in a subset, subject to a target distribution. If the target distribution is uniform (i.e., equal representation), this is similar to demographic parity in fairness literature (Dwork et al., 2012), where similar groups have similar treatment. Although group-based fairness constraints may apply in this setting, such constraints would be asking that all groups be represented equally. The proposed diversity metrics allow for more control over the specification of the distribution of groups. Contrasted with the numerous definitions of diversity and fairness, measurements of inclusion have received relatively little consideration within computer science. We define a metric for inclusion, taking inspiration from works in organization science and notions of individual fairness. To summarize our contributions:

  1. We propose metrics for diversity and inclusion, relating these concepts to their corresponding social notions.

  2. We focus on the general problem of selecting a set of instances from a larger set, formalizing how each set may be scored for diversity and inclusion.

  3. We demonstrate how methods from social choice theory can be used to aggregate and choose preferable sets.

Results from human subject experiments suggest that the proposed metrics are consistent with social notions of these concepts.

Background and Notation

Subset selection is a fundamental task in many algorithmic systems, underpinning retrieval, ranking, and recommendation problems. We formalize the family of diversity and inclusion metrics within this task. Fix a query , and a set of instances in the domain of relevance . 111We intentionally conflate queries and query intents in this work, and assume that queries closely capture a user’s intent.

Given a set of instances and instances , each instance may have multiple objects or items relevant to the query, e.g., people or shoes. We denote these relevant objects by . All proposed metrics can act upon instances or sets .

Let refer to an attribute of a person or item indexing a corresponding group type, such as age:young. Here, the attribute young indexes its corresponding group type age. defines the set of attributes to measure for a given instance of set. With some abuse of notation, we define as a function that indicates whether individual has attribute . For example, this might take the form of an indicator function. We define as a function that indicates the relevance of attribute within . For example, this might take the form of an indicator function for whether the instance contains an item which refers to the attribute. Similarly, we define as a function of within , such as the proportion of instances that contain . This allows us to quantify the following concepts for instances or sets:

  • Heterogeneity: Variety within an instance or set of instances. may be any kind of characteristic, where greater heterogeneity corresponds to as many attributes in as possible.

  • Diversity: Variety in the representation of individuals in an instance or set of instances, with respect to sociopolitical power differentials (gender, race, etc.). Greater diversity means a closer match to a target distribution over socially relevant characteristics.

  • Inclusion: Representation of an individual user within an instance or a set of instances, where greater inclusion corresponds to better alignment between a user and the options relevant to them in an instance or set.

Throughout, we define as a set of attributes for an individual, but note that does not have to correspond to a specific person; it may simply be a set of attributes for a system to be inclusive towards. Critically, for the family of diversity and inclusion metrics introduced below, is defined in light of human attributes involved in social power differentials, such as gender, race, color, or creed. Power differentials are significant insofar as greater representation and presence of individuals with marginalized identities can result in greater feelings of belonging and acceptance and more successful teams. For example, if represents the Gender concept, an attribute may be {Gender:female, Gender:male, or Gender:nonbinary}. may also be a collection of attributes from multiple different demographic subgroups, such as {Skin:Fitzpatrick Type 6, Gender:Female}. Further details are provided in the following section.

Quantifying Diversity

Recall the domain of relevance for a query , and the aim to quantify the diversity of a set . The more diverse a set is in a domain , the greater the presence of attributes relevant to social structures of power and influence are represented in the set.

Given a set of attributes where each has target lower and upper bounds on their presence in as a quantification of the presence of within . The measurement of as well as the bounds and are design parameters of our family of diversity metrics. Selecting values for each induces a particular metric in this family. The lower bound might be defined to implement the rule, or require at least population-level frequency of attribute within . Many literatures have adopted their own notions of diversity (see the introduction). Our formulation bears some resemblance to that of (Celis et al., 2017), who discuss ranking objects subject to upper and lower bounds. Our work departs from theirs in that for different choices outlined below, these need not be hard constraints on the presence of an attribute, and presence need not implement simple count.

Presence Score

Recall that an instance (e.g., a recommended movie in a set of movie recommendations) is composed of one or more items (e.g., actors, objects, and settings in the movie). Each item reflects or indexes different attributes. For example, the actors reflect attributes such as their gender, age and race; objects similarly index such attributes, for example, high heels may index the woman attribute. We define the presence score of an attribute as a function quantifying how close the presence is to the target and upper and lower bounds on the attribute’s presence:

with higher values meaning is more present in .

One natural quantification of the presence of in is the proportion of items within reflecting the attribute . Similarly, one of the simplest forms that can take is as an indicator function that returns a value of 1 when the the proportion of in is at least . This approach is equivalent to: . may also be instantiated as a more complex function, for example, capturing the distance between and . There also may be settings where the lower and upper bounds are not hard constraints: some choices of can return nonzero values for , such as when there is an increasing penalty for going beyond the specified upper bound.

The presence formulation provides information about the contribution of a single attribute to an instance. For each the form of , as well as , must be specified to define a metric. Different choices for these values give rise to metrics with different meaning; what is appropriate for a given task should be considered carefully by domain experts and a broad set of individuals who use the technology relying on the set selection.

Using target distributions for scoring sets and instances provides for additional considerations beyond the parity often afforded by fairness metrics, such as sets that are closer to real-world distributions. This also potentially allows for more fluid/nuanced treatment of group membership, where multiple overlapping group memberships within one instance can be accommodated.

Diversity Score

With the presence score defined, we can now define the diversity of an instance as an aggregate statistic of the attributes in the instance:

, across , where can return the minimum, maximum, or average presence value of the attributes. These standard choices of cumulation functions are borrowed from social choice theory in economics, and similar economics-based metrics may be applied to combine presence scores of many attributes into the single diversity score, for example, using a function such as maximin (Rawls, 1974) reduces to the lowest-scoring attribute for (see below section on Social Choice Theory).

The Diversity family of metrics can highlight or prioritize diversity with respect to relevant social groups. For example:

  • Racial Diversity: many race groups present.

  • Gender Diversity: many gender groups present.

  • Age Diversity: many age groups present.

Figure 1. Gender diversity, without inclusion for women, “construction worker” image domain: Although several genders and colors are represented (diversity of people), male-presenting individuals are shown in realistic, modern construction worker situations, while women and other genders are depicted as historic nostalgia, toys, clipart, or passive in the event.

Set Diversity

The formulation for an instance giving rise to a diversity score naturally extends to a set of instances giving rise to a diversity score. An example set of images that are Gender Diverse are shown in Figure 1. We define the cumulative diversity score of a set as a function of across . As before, this can be scored following the social choice theory functions further detailed below.

Quantifying Inclusion

We now move towards proposing a family of metrics to measure inclusion for subset selection. Our proposed inclusion metric captures the degree to which an individual is well represented by the returned set. As an example, an individual looking for hair style inspiration might query ‘best hairstyles 2019’. In the absence of additional qualifiers, e.g., those that narrow the query by explicitly specifying demographic information, an inclusive image set would be one where the individual sees people with similar hair textures to theirs in the selected set. We measure the inclusion of a person (or set of attributes) along attribute when selecting from . We begin by introducing instance inclusion, a measure of how well an instance represents , and then extend to set inclusion.

Instance Inclusion.

As above, we assume an instance (e.g., an image) is composed of one or more items (e.g., different components of the image). Each item has some relevance to a query and may be a better or a worse fit for an individual along some attribute . The inclusion of an instance aggregates the relevance and fit of all items in and produces a single measure of that instance’s ability to reflect or to meet ’s goals.

Continuing with the example above, an instance can refer to an image with several subjects, and each subject corresponds to an item . A person may find to be a good fit along the hair type attribute if their hair type is similar to ’s. Then, the instance’s inclusion for along this attribute combines the fit of all the subjects in the instance.

Relevance of an item.

Formally, let measure the relevance of an item to query . The relevance score is an exogeneous measure of how well an item answers a query, that is, it is the assumed system metric for the susbet selection task at hand.

Representativeness of an item.

Let measure the representativeness of an item for and query along attribute . Representativeness measures how well an item aligns with a user ’s attribute (e.g., if has similar hair texture to and refers to hair styles). We allow for representativeness to be both positive and negative, to capture the idea that an item might be a positive or negative representation of , and that this polarity might depend on as well as the attribute .

There are many natural choices for the representativeness function. For example, if items correspond to people, then a candidate representativeness function could indicate whether the attribute is the same for both and an item:

One could also choose some more complex measure of the match of to along . One can similarly define a notion of representativeness for items that are not individuals, if individuals find some of those items as being well-aligned with their identity along .

This can express that “similar” individuals along may make feel more included, even if similar values to would not increase the diversity score for . This captures the idea that the diversity score measures an abstracted and simplified summary, while the inclusion score affords a more fluid contextual understanding of identities.

An instance’s set of items, their relevance, and their representativeness together may be represented as:

We can then define the inclusion of an instance as an aggregate statistic of the set of items in the instance, their relevance to the query, and the items’ alignment or match to individual along :

In the simplest case, each instance may contain only one item (or one relevant item), in which case might simply report the representativeness of the single (relevant) item. In the case where many items in an instance are relevant, might measure the median representativeness of the high-relevance items in , or the maximum representativeness of some item in the instance.

An inclusion score near indicates finds the instance stereotypical; this is similar to the notion of negative stereotypes in representation (Cheryan et al., 2013) or tokenism (Snell, 2017). A score near refers to ’s known attribute being well aligned in . A score near corresponds to finding few or no attribute alignments in .

Set Inclusion.

An instance giving rise to an inclusion score for along an attribute for query naturally extends to scoring the inclusion of a set of instances. The cumulative inclusion score of a set is a function of across the instances in the set: . In this formulation, the inclusion score of an instance is comprised of the representativeness and relevance of items within it, and the inclusion score of a set is made up of the instances within the set.

Multiple Attribute Inclusion.

Another type of cumulative inclusion score ranges over the set of attributes known about , capturing a holistic sense of inclusion for rather than one according to a single attribute. Just as in set inclusion, many natural definitions of multiple attribute inclusion arise from defining a cumulative function .

Both instance-based and attribute-based cumulative functions for Inclusion can leverage social choice theory to return the final score, as detailed in the Social Choice section below. For example, in a Nash Welfare Inclusivity approach for Set Inclusion,

would return the geometric mean over

for . In a Nash Welfare Inclusivity approach for Multiple Attribute Inclusion, would return the geometric mean over for .

Inclusion Metrics Discussion

The relevance function .

We now reflect on the relevance function in the description of inclusion above. We mention above that the relevance function measures how well an item corresponds to a query string . The objective function for many subset selection algorithms often measures exactly such a quantity, independent of inclusion or diversity concerns, though this may only be measured for an instance rather than items in the instance.

However, the ground-truth relevance score of an instance or set of instances with respect to some may never be measurable or even directly defined, and for this reason some simpler proxies are often used in place of a ground truth relevance score. If one uses this same proxy score function to define inclusion, this choice may affect inclusion scores for certain parties more than others due to unequal measurement error across the space of items and instances.

Individual :
skin:type_6, age:70
Query : Scientist
Inc   woman 1 = 1.00   woman 1 = 1.00   woman 1 = 1.00
Inc   type_5 = 0.83   type_4 = 0.67   type_3 = 0.50
Inc   31 = 0.61   23 = 0.53   47 = 0.77
Cumulative Multiple Attribute Multiple Attribute Multiple Attribute Set (Pair)
Utilitarian 0.81 0.73 0.76 0.77 0.79 0.75
2-13[2pt/2pt] Egalitarian 0.61 0.53 0.50 0.53 0.50 0.50
2-13[2pt/2pt]Nash 0.79 0.71 0.73 0.75 0.76 0.72
Figure 2. Worked example of Inclusion scores for attributes of each instance , given a user and a query . Below each image are the associated attributes (left) and the Inclusion scores for on this attribute (right). In this example, we must select two images out of the three. Three different methods for aggregating the inclusion scores for attributes are illustrated. The first, motivated by utilitarianism, takes the average inclusion score for the image pair. The highest-scoring pair is then images x and x images. The second, motivated by egalitarianism, takes the minimum inclusion score of the pair. The highest-scoring pair is then images x and x. Finally, Nash inclusivity chooses the pair with the highest geometric mean, in this case the same images as in utilitarianism, x and x.

Comparing Subset Inclusivity: Approaches from Social Choice Theory

We have defined Diversity and Inclusion criteria for single attributes in single instances, and have briefly discussed how these can be extended to sets of instances or to sets of attributes. Extending to such sets requires a cumulation mechanism, which produces a single score from a set of scores. Here, we can build from social choice theory, which has well-developed mechanisms for determining a final score from a set of scored items based on the ethical goals defined for a system. For example, an egalitarian mechanism (Rawls, 1974) can be used to favor under-served individuals that share an attribute. A utilitarian mechanism (Mill, 2016) can be used to treat all attributes as equally important, producing an arithmetic average over items. Such methods may also be used to compare scores across sets. We detail three such relevant mechanisms for subset scoring below, and illustrate these concepts using scores in Figure 2.

Egalitarian (maximin) inclusivity. Set may be said to be more inclusive than set if the lowest inclusion score in is higher than the lowest inclusion score in , i.e.,

If , then repeat for the second lowest scores, third, and so on. If the two mechanisms are equal, we are indifferent between and .

Utilitarian inclusivity: This corresponds to an arithmetic average over the inclusion scores for all items in the set, where a set is more inclusive than if the average of its inclusion metric scores is greater.


Nash inclusivity: This corresponds to the geometric mean over the inclusion scores for all items in the set. Set is more inclusive than if the product of its inclusion metric scores is greater, i.e.,

Nash inclusivity can be seen as a mix of utilitarian and egalitarian, as it monotonically increases with both of these measures (Caragiannis et al., 2019).

Metrics In Practice

We assume that is a set of instances relevant to the domain of interest , such that instances within each selected subset are relevant according to , where a score of 1.0 means that an instance is relevant to the query.

Prompt Polarity.

When applying Diversity and Inclusion metrics in a domain where the query is not only neutral, but may also be negative (e.g., “jerks”), it is necessary to incorporate a value into the score to tease out the ‘negative’ meaning and values of the inclusion score, as may be provided by a sentiment model. For example:


Note that the subset for a given p, q pair can increase the diversity score by producing diverse stereotypes222Examples of stereotypes intentionally omitted throughout paper in order to minimize further stereotype propagation. unless and are well defined. The domain of relevance is crucial for understanding whether a set of results might stereotype by a particular attribute. For example, if is “work clothing”, and the set contains only pink womens’ workwear but a variety of colors for mens’ workwear, this set could be said to uphold the stereotype about women and their color preferences, even if the set is diverse and inclusive for a man. On the other hand, if is “pink womens’ work clothing”, the same set of womens’ clothing reflects the query and domain, while in the former case the results overconcentrate a specific color in the results relevant to women. Stereotyping here refers to homogeneity across results for attribute .

The person perceiving a set of results is obviously the arbiter of whether the results stereotype them. Suppose the person searching for clothing in the previous example is a woman. If she likes pink workwear, she might feel as though the instances of womens’ workwear being pink suits her goals and needs; if she does not particularly like pink, even if a majority of women generally like pink, the results of a search containing only pink womens’ clothing does not meet her goals, but does reinforce a standard assumption about womens’ clothing.


Crossing demographics-based such as those based on Gender and Race yields intersectional that can be applied in the same manner as unitary . Without accounting for intersectionality, it is possible for a set of instances to receive high diversity and inclusion scores without reflecting the unique characteristics of the individual. For example, if a black woman is searching for movie recommendations, and the set returned is half movies starring black men and half movies starring white women, the selection may be diverse and aligned somewhat with her social identities while still creating a sense of exclusion.

Inclusion within Instances.

The focus of the family of inclusion metrics introduced in this paper is inclusion towards the individual presented with the set. Another aspect of inclusion concerns the individuals represented in the instances. For example, if contains people of different ethnicities, all stereotyped except for the one that authentically represents the ethnicity of the individual, the proposed metrics will not capture this effect. It may be desirable to apply the Inclusion metric not only to the individual creating the query, but also to those who may be represented.

Worked Example

We begin with the context and person creating the query. The person may be seeking a selection of stock images to use for a presentation to an unknown-to-them audience. The person has a token in the system where they permit information to be stored, such as their gender and hair color. Assume a specific : .333skin:6 refers to Fitzpatrick Skin Type 6 (Fitzpatrick, 1988). A generalization is a list of attributes most at risk for disproportionately unfair experiences, without requiring correspondence to a specific individual. scores are shown in Figure 2. Each image has one item , and for simplicity we assume the given relevance score for all images .444That is, all images are equally relevant to the query. The Inclusion score is then:

Inclusion is here equal to the representativeness score for each group type (skin, age, hair). Basic instantiations of the metric may be measures of distance or match:

Figure 2 details inclusion scores for a set of images given the person described above. Applying the Diversity criteria above, with Presence scored by an indicator function, each image has a Diversity score of 0, because each attribute has only one form in each image (e.g., a single person is present). The image set is also not Gender Diverse.

Image Set Perception Study


To evaluate the viability of our proposed metrics, we conducted surveys on Amazon’s Mechanical Turk platform, asking respondents to compare the relative diversity and inclusiveness of sets of images with respect to gender and skin tone.

To do this, we curated several stock image sets containing people depicting specific occupations, listed in Table 1. These sets were designed to be diverse and/or inclusive as outlined in this paper. Specifically, we curated four sets of images: a set that was diverse but not inclusive (D+I-), inclusive but not diverse (D-I+), both inclusive and diverse (D+I+), and neither inclusive nor diverse (D-I-).

Respondents were presented with pairs of image sets from a given occupation and asked to select which was more inclusive or diverse with respect to a specified demographic—gender or skin tone—with an option to indicate that both were approximately the same. At the end of the survey, we also collected information on rater age and gender555Our interface also allowed us to collect genders beyond the man/woman binary. However, due to the small sample size, they are excluded from our analysis.. We scored image sets by simply calculating the percentage of all comparisons where the image set “won” (i.e. was selected as the more diverse or inclusive set).

computer programmer scientist doctor nurse
salesperson janitor lawyer dancer
Table 1. Occupations in study


As shown in figure 3, we found that aggregating across occupations, D+I+ image sets had the highest average scores for both the diversity and inclusion comparison tasks, with D+I+ sets receiving higher diversity and inclusion ratings than the other three conditions (D+I-, D-I+, and D-I-). D-I- sets received the lowest diversity and inclusion ratings. This suggests, perhaps unsurprisingly, that there is some overlap in the concepts of diversity and inclusion: inclusivity adds to the perception of diversity, and vice versa.

Although there is overlap in the perception of the two concepts, our results also suggest that respondents differentiated between our metrics of inclusivity and diversity. Specifically, D-I+ stimuli were labeled as more inclusive than diverse, aligning with the intended diversity and inclusion of the sets. Interestingly, D+I- stimuli were also labeled as more inclusive than diverse, although the gap between inclusion and diversity ratings is smaller. These results indicate that respondents perceive sets with more diversity and inclusion over a baseline as more inclusive than diverse.

When split by users’ self-identified gender, men tended to rate D+I- conditions as more inclusive than diverse, while women tended to rate these conditions as equally inclusive and diverse. Female respondents also found the D-I+ sets substantially more inclusive than diverse, with much less of a difference between diversity and inclusion scores for the remainder of the sets. This discrepancy underscores the relevancy of the user: the identity of the respondents impacts perceptions of diversity and inclusion in image sets.

Figure 3. Fraction of comparison tasks “won” for each image group and task.

Data Quality

We screened for low-quality responses using three approaches: duplicate “confirmation questions”, the use of free-response fields on a multiple-choice question, and reCAPTCHA. First, each set of comparisons contained two “confirmation questions”, which were simply duplicates of earlier questions with images shuffled and the comparison presented in reverse order. Second, while the survey had only three available options (“Set A”, “Set B”, and “Same”), respondents were given a free-response answer box to type their answer. This allowed us to filter for automated responses, as we found that a small fraction of the responses were nonsensical (e.g. “No”, or “Very good”). Finally, respondents had to fill out a reCAPTCHA form before submitting. Answers with a reCAPTCHA score below , those whose confirmation questions did not agree, and free-response answers that could not be resolved into a valid response were removed. After filtering, we had 491 valid responses, which contained comparisons between all image sets for each occupation.


We have distinguished between notions of diversity and inclusion and detailed how they may be formalized, applied to the general problem of scoring instances or sets. This may be useful in subset selection problems that seek to reflect individuals with attributes that are disproportionately marginalized, such as when selecting images of people in a stock photo selection task. Our worked example demonstrates how social choice theory can be applied to compare diversity and inclusion scores across different sets.

Thank you to Andrew Zaldivar, Ben Packer, and Tulsee Doshi for the insightful discussions and suggestions.


  • N. Anari, S. O. Gharan, and A. Rezaei (2016)

    Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes

    In Conference on Learning Theory, pp. 103–115. Cited by: Introduction.
  • A. Asudeh, Z. Jin, and H. Jagadish (2019) Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 554–565. Cited by: Introduction.
  • S. Barocas and A. D. Selbst (2016) Big data’s disparate impact. Calif. L. Rev. 104, pp. 671. Cited by: Introduction.
  • A. Baselga, A. Jiménez-Valverde, and G. Niccolini (2007) A multiple-site similarity measure independent of richness. Biology Letters 3 (6), pp. 642–645. Cited by: Introduction.
  • J. M. Bell and D. Hartmann (2007) Diversity in everyday discourse: the cultural ambiguities and consequences of “happy talk”. American Sociological Review 72 (6), pp. 895–914. Cited by: Introduction.
  • E. Berrey (2015) The enigma of diversity: the language of race and the limits of racial justice. University of Chicago Press. Cited by: Introduction.
  • I. Caragiannis, D. Kurokawa, H. Moulin, A. D. Procaccia, N. Shah, and J. Wang (2019) The unreasonable fairness of maximum nash welfare. ACM Transactions on Economics and Computation (TEAC) 7 (3), pp. 12. Cited by: Comparing Subset Inclusivity: Approaches from Social Choice Theory.
  • L. E. Celis, A. Deshpande, T. Kathuria, and N. K. Vishnoi (2016) How to be fair and diverse?. arXiv preprint arXiv:1610.07183. Cited by: Introduction.
  • L. E. Celis, D. Straszak, and N. K. Vishnoi (2017) Ranking with fairness constraints. CoRR abs/1704.06840. External Links: Link, 1704.06840 Cited by: Quantifying Diversity.
  • S. Cheryan, V. C. Plaut, C. Handron, and L. Hudson (2013) The stereotypical computer scientist: gendered media representations as a barrier to inclusion for women. Sex roles 69 (1-2), pp. 58–71. Cited by: Representativeness of an item..
  • A. Deshpande and L. Rademacher (2010) Efficient volume sampling for row/column subset selection. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 329–338. Cited by: Introduction.
  • F. Dobbin and A. Kalev (2016) Why diversity programs fail and what works better. Harvard Business Review 94 (7-8), pp. 52–60. Cited by: Introduction.
  • M. Drosou, H. Jagadish, E. Pitoura, and J. Stoyanovich (2017) Diversity in big data: a review. Big data 5 (2), pp. 73–84. Cited by: Introduction.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: Introduction, Introduction.
  • D. G. Embrick (2011) The diversity ideology in the business world: a new oppression for a new age. Critical sociology 37 (5), pp. 541–556. Cited by: Introduction.
  • T. B. Fitzpatrick (1988) The Validity and Practicality of Sun-Reactive Skin Types I Through VI. JAMA Dermatology 124 (6), pp. 869–871. External Links: ISSN 2168-6068, Document, Link Cited by: footnote 3.
  • B. Gong, W. Chao, K. Grauman, and F. Sha (2014) Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems, pp. 2069–2077. Cited by: Introduction.
  • L. Jost et al. (2009) Mismeasuring biological diversity: response to hoffmann and hoffmann (2008). Ecological Economics 68 (4), pp. 925–928. Cited by: Introduction.
  • A. Kalev, F. Dobbin, and E. Kelly (2006) Best practices or best guesses? assessing the efficacy of corporate affirmative action and diversity policies. American sociological review 71 (4), pp. 589–617. Cited by: Introduction.
  • A. Kulesza, B. Taskar, et al. (2012) Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5 (2–3), pp. 123–286. Cited by: Introduction.
  • P. Legendre, D. Borcard, and P. R. Peres-Neto (2008) Analyzing or explaining beta diversity? comment. Ecology 89 (11), pp. 3238–3244. Cited by: Introduction.
  • H. Lin and J. A. Bilmes (2012)

    Learning mixtures of submodular shells with application to document summarization

    arXiv preprint arXiv:1210.4871. Cited by: Introduction.
  • R. H. MacArthur (1965) Patterns of species diversity. Biological reviews 40 (4), pp. 510–533. Cited by: Introduction.
  • J. S. Mill (2016) Utilitarianism. In Seven masterpieces of philosophy, pp. 337–383. Cited by: Comparing Subset Inclusivity: Approaches from Social Choice Theory.
  • M. E. Mor-Barak and D. A. Cherin (1998) A tool to expand organizational understanding of workforce diversity: exploring a measure of inclusion-exclusion. Administration in Social Work 22 (1), pp. 47–64. Cited by: Introduction.
  • A. Paradiso (2017) Diversity is being asked to the party. inclusion is being asked to dance. #shrmdiv. The Society for Human Resource Management (SHRM) Blog. External Links: Link Cited by: Introduction.
  • L. H. Pelled, G. E. Ledford, and S. A. Mohrman (1999) Demographic dissimilarity and workplace inclusion. Journal of Management studies 36 (7), pp. 1013–1031. Cited by: Introduction.
  • J. Rawls (1974) Some reasons for the maximin criterion. The American Economic Review 64 (2), pp. 141–146. Cited by: Diversity Score, Comparing Subset Inclusivity: Approaches from Social Choice Theory.
  • Q. M. Roberson (2006) Disentangling the meanings of diversity and inclusion in organizations. Group & Organization Management 31 (2), pp. 212–236. Cited by: Introduction.
  • S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala (2018) The price of fair pca: one extra dimension. In Advances in Neural Information Processing Systems, pp. 10976–10987. Cited by: Introduction.
  • L. M. Shore, A. E. Randel, B. G. Chung, M. A. Dean, K. Holcombe Ehrhart, and G. Singh (2011) Inclusion and diversity in work groups: a review and model for future research. Journal of management 37 (4), pp. 1262–1289. Cited by: Introduction, Introduction.
  • A. Singh and T. Joachims (2017) Equality of opportunity in rankings. In Workshop on Prioritizing Online Content (WPOC) at NIPS, Cited by: Introduction.
  • T. Snell (2017) Tokenism: the result of diversity without inclusion. Medium. External Links: Link Cited by: Representativeness of an item..
  • H. Tuomisto (2011) Commentary: do we have a consistent terminology for species diversity? yes, if we choose to use it. Oecologia 167 (4), pp. 903–911. Cited by: Introduction.
  • R. H. Whittaker (1960) Vegetation of the siskiyou mountains, oregon and california. Ecological monographs 30 (3), pp. 279–338. Cited by: Introduction.
  • K. Yang and J. Stoyanovich (2017) Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 22. Cited by: Introduction.
  • T. Zhou, Z. Kuscsik, J. Liu, M. Medo, J. R. Wakeling, and Y. Zhang (2010) Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences 107 (10), pp. 4511–4515. Cited by: Introduction.