An Ethical Highlighter for People-Centric Dataset Creation

11/27/2020 ∙ by Margot Hanley, et al. ∙ cornell university 0

Important ethical concerns arising from computer vision datasets of people have been receiving significant attention, and a number of datasets have been withdrawn as a result. To meet the academic need for people-centric datasets, we propose an analytical framework to guide ethical evaluation of existing datasets and to serve future dataset creators in avoiding missteps. Our work is informed by a review and analysis of prior works and highlights where such ethical challenges arise.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, face datasets from the computer vision community have seen significant criticism, and many have subsequently been withdrawn by their creators. Why? In prioritizing properties like size and downstream task utility, other principles and factors have often taken a backseat, such as data curation practices, privacy violations, offensive labels, fair representation, and undesirable uses. Although datasets are often created with the positive aim of furthering scientific research, the ethical challenges prompting these takedowns reveal problematic (and often unintended) dataset properties and consequences. These issues are not limited to face datasets—e.g. the ImageNet dataset has been charged with offensive labels (inherited from WordNet 

Crawford and Paglen (2019); Yang et al. (2020)) and has faced other issues regarding its “Person” category Deng et al. (2009).

Although the dataset takedowns have exposed complex ethical challenges, the need for publicly available, large-scale datasets remains critical to open academic research. These datasets may serve common and positive needs, such as synthesizing people for privacy enhancing applications, image-to-text descriptions in support of accessibility, and computer vision–driven methods for revealing demographic biases in media Google (2017); Hong et al. (2020). Hence, the question remains—how can the community promote pro-social academic research through the creation and evaluation of critical data resources that can stand the test of time, given the serious ethical challenges faced by prior datasets?

Our work addresses this question by proposing an analytic framework for evaluating image datasets of faces, people, and other scenes featuring people—hereafter, people-centric datasets—and by providing guidance to dataset creators seeking to address ethical standards alongside important technical considerations. The framework, which we refer to as an “ethical highlighter”, comprises four components, each encapsulating a distinctive aspect of dataset construction where ethical challenges can arise: creation, composition, distribution, and purpose.

Our work shares common ground with prior efforts such as Datasheets for Datasets Gebru et al. (2018), a significant milestone in drawing attention to transparency and accountability in dataset construction. Building on these efforts, our framework crystallizes distinct aspects of dataset development and draws explicit threads between these aspects and ethical issues, thereby offering computer vision researchers the means for not only committing to ethical standards in principle, but also meeting them in practice.

2 Framework

Methodology. After reviewing many people-centric datasets, we selected 14 as an initial sample set—these datasets have either been widely-used or drawn prominent public criticisms on ethical grounds. While all had been publicly available at some point, 11 have since been taken down and only three remain available. Through an inductive analysis of academic manuscripts and popular press Harvey (2019), we produced a typology of critiques. While this was not an exhaustive review, our team collected this work to a point of conceptual saturation Glaser et al. (1968).

From our inductive analysis, we extracted two key dimensions around which the implicit and explicit themes we encountered could be organized. We identified components of dataset development as Creation, Composition, Distribution, and Purpose. And, we identified a non-exhaustive list of types of ethical issues which emerged through our inductive analysis: fairness, privacy, subject autonomy, safety/security, property rights, representational harm, offense, transparency/explainability, and accountability/responsibility. In some instances, we framed these concerns according to traditional ethical concepts, which are commonly cited in broader discussions of algorithmic and AI systems. We sought to understand how the ethical issues were dispersed across and manifested in our framework’s components. Due to space limitations in this short article, we briefly describe the components and illustrate each with one ethical issue. However, a fuller version in the future will provide a more comprehensive analysis, as well as critical reflections that can be used to guide dataset creators.

2.1 Creation

Creation encompasses the activities involved in producing a dataset, including sourcing, assembling, and cleaning data, as well as assigning labels. Typically, image datasets comprise images and their associated labels. These labels could be names of the people pictured, attributes (e.g., age), or potentially any other descriptions. More generally, we can think of labels as data in a textual modality associated with data in a visual modality (for now we exclude annotations like segmentation or depth maps). Our inductive analysis of this creation component revealed ethical concerns over violations of privacy and property rights, offensive data content, and subject autonomy.

Privacy. An early decision that dataset creators make is where to source the data. Broadly, this choice can be characterized as capturing images of subjects directly or sourcing images secondhand (e.g. scraping them from social media sites or search engines).

In order to amass large numbers of images of people, some dataset creators have previously captured photos and videos without subjects’ knowledge or consent, two important aspects of privacy. For example, the creators of Duke MTMC Ristani et al. (2016) and Brainwash Stewart et al. (2016) were criticized for assembling datasets without notice or consent. Duke MTMC, created in 2014, comprised live footage of students on campus. In the same year, the Brainwash dataset used a webcam to capture images of customers in the San Francisco Brainwash cafe. In response to these criticisms, both datasets were ultimately taken down by their authors, respectively in May and June of 2019. The authors of Duke MTMC acknowledged that they had deviated from IRB-approved protocols by filming outdoors and releasing data without protections Tomasi (2019).

Even if dataset creators aim to capture faces in natural environments, they must grapple with the question of consent and determine methods to mitigate other privacy concerns. The creators of UnConstrained College Students (another dataset created on a college campus) suggested that some of its value was from featuring subjects that were “photographed [at long-range] without [students’] knowledge”, rather than being “posed” 1. Images collected “in the wild” are considered to be particularly useful across a variety of naturalistic application domains. From the perspective of these creators, a lack of subject awareness (leading to a lack of consent) has been a feature, rather than a bug. It is also important to keep in mind that even when subjects give their consent to be included, they do not necessarily consent to all possible uses of a dataset—creators should be careful to specify the scope of the dataset’s purposes to subjects.

Another common method for gathering raw image data is to find and download images of people from the internet. For this method, it seems to be common practice to assume that “public persons” cede the right to privacy. For example, the creators of MS-Celeb-1M Guo et al. (2016) started by assembling a list of 1M ostensible celebrities, selecting a subset of 100K identities and scraping their images from search engines. However, the dataset creators’ definition of “celebrity” was very broad—the original list was not limited to individuals who consider themselves as public figures, but was rather one million notable people on the internet, including a number of private persons. Consequently, when the dataset was released, it emerged that many faces were not those of celebrities, but instead of non-public persons (including vocal privacy advocates and journalists) who had not consented to having their faces included in the dataset.

2.2 Composition

By composition, we refer to properties of the dataset, spanning content (e.g., data units or elements comprising the dataset) such as visual images and text-based labels, mappings among elements expressed in different modalities (e.g., labels to images), and higher-order, macro attributes of the dataset such as demographic representativeness. Our analysis revealed that the composition of a dataset may be a source of ethical harms through, e.g., bias and unfairness, representational harm, and offensiveness.


Offensive associations can be latent in popular machine learning datasets. A notable example where such associations were made visible was through a web-based demonstration called “ImageNet Roulette”, created by researcher-artists Kate Crawford and Trevor Paglan via training on the full ImageNet dataset 

Deng et al. (2009). By allowing users to upload images of themselves and publishing the resulting classifications, the project exposed shocking labels attributed to ImageNet imagery (which contains many people categories, unlike the subset used in the well-known ILSVRC challenge) to a general audience. For instance, labels included “rape suspect”, “pipe smoker”, “alcoholic”, and “bitch”. People of color could potentially be labeled with racial slurs. Just one week following the launch of this project, the ImageNet team took down the “Person” category for maintenance.

Sources of offensive associations have been traced to labels used to generate ImageNet—namely, a database of words and semantic relations called WordNet Fellbaum (2012). In their analysis of the ImageNet subtree, Yang et al. (including ImageNet team members) found that of the 2,832 people categories within the subtree, 1,593 were potentially offensive categories Yang et al. (2020). Databases like Wordnet are often used to build datasets and are adopted across the industry. Other datasets built on WordNet, such as 80 Million Tiny Images Torralba et al. (2008), have similarly inherited offensive associations from WordNet. However, in this case, the images featured content that would be too small to manually perceive and audit Prabhu and Birhane (2020). Indeed, datasets like ImageNet have included images that are offensive or portray certain sub-populations in perjorative ways. In order to prevent such offensive images and associations, there is a need for more thorough auditing of both labels and imagery Kyriakou et al. (2019); Barlas et al. (2019).

2.3 Distribution

Distribution is concerned with how creators make a dataset available, as well as that dataset”s terms of use and disclaimers. Our analysis revealed that the distribution of a dataset presents a source of ethical harms when it impedes accountability and violates subject autonomy.

Accountability (responsibility). Even if creators have the best intentions for their dataset, they must prepare for the possibility that users will not use it for its intended purpose. As part of the provisions of access, many datasets request that users use data only for non-commercial research purposes, but are unable to enforce this usage once third parties obtain the dataset. For example, although 69% of images in MegaFace had Creative Commons licenses prohibiting commercial use, it is evident from that paper’s citations that the dataset was obtained by companies, where there is no way to readily enforce research-only usage. Furthermore, although the MS-Celeb-1M dataset has been taken down by its authors, the data has “runaway” Harvey (2019)—the dataset itself remains available on Academic Torrents, where it continues to be downloaded. And, other researchers have created derivatives of this dataset, which also remain openly accessible online.

This question of enforcement is important; creators need to consider the potential for misuse of their dataset and what they are in a position to enforce. As illustrated above, it is clear that disclaimers are important to have, but are not enforceable alone. Therefore, in the case of potentially sensitive data, dataset creators should consider reviewing requests on a case-by-case basis, putting forth a good faith effort to reject requests whose intent does not match the purpose of the dataset.

Many datasets do not clearly communicate their limitations nor how they should be used. We have seen some efforts to standardize documentation and increase dataset transparency, notably Datasheets for Datasets Gebru et al. (2018); Holland et al. (2018)

, as well as an emerging culture of reflective practice to this end within academia and industry. In the past year, we have seen the addition of disclaimers to two existing datasets: the Labeled Faces in the Wild website was updated with a disclaimer about its potential lack of representation, and the VGGFace2 website similarly cautioned that their “distribution of identities… may not be representative of the global human population” and that users should “be careful of unintended societal, gender, racial and other biases”. While these examples indicate a shift towards greater transparency and acknowledgement of limitations in datasets, they also show that more work remains. Is it enough for dataset owners to qualify the use of datasets, or is it necessary to actively restrict access to and enforce the responsible use of datasets? What role do reflective and interrogatory processes (such as the framework we provide) have in leading to more responsible dataset practices?

2.4 Purpose

Purpose answers the question, “Why?” Philosophers sometimes refer to this component as “teleology,” which involves explaining a phenomenon not in terms of what caused it but what motivated it; teleology covers a range of questions, e.g., what is the dataset for; what are its intended uses; for what purposes is it optimized?

Purpose is an incredibly rich source of ethical concern. A direct challenge to the moral legitimacy of a dataset’s own purpose is one such concern. For example, a detractor may assert that the purpose for creating a face dataset being to tell apart gay individuals does not meet a moral threshold Wang and Kosinski (2018)

. Another way that purpose may stir up ethical concerns is in its relation to other characteristics of a dataset. One cannot overemphasize the potential for ethical discord that may follow. A typical instance is that of a face dataset, not specifically optimized for facial recognition, but used for this purpose. Though it may serve well in some capacities (e.g., face synthesis), it may result in bias, or representational harm to certain minority groups, when used to train a recognition algorithm 

Buolamwini and Gebru (2018); Benjamin (2019); Scheuerman et al. (2019). Similarly, creators of face datasets considering policies for distributing or providing access to them would want to understand how these policies would apply in relation to certain purposes (e.g., as tools for surveilling subpopulations 2).

Setting aside cases where a purpose itself is deemed morally reprehensible, purpose (or teleology) stands apart from the other components in that it is frequently relational in nature. By this, we mean that ethical issues emerge from mappings—between purpose and properties of creation, distribution, or composition. By implication, fastidious users of our framework will not consider their work complete until they perform a systematic pass of the other three components with a clear sense of a dataset’s teleology.

3 Conclusion

Our analytic framework extends beyond prior work, bringing into focus different components of dataset creation in which ethical issues may arise. It crystallizes these components, draws explicit threads between them and traditional ethical issues, and demonstrates how those issues manifest. We see our work as part of a broader agenda that strives to reflect critically on the role of academic research in the pursuit and oversight of ethical AI.

The work reported here is the beginning of a longer term effort. In the future, a pressing need is to extract a heuristic from the framework, providing concrete, practical guidance for identifying and mitigating ethical hazards. Furthermore, we will continue refining our framework through collaborations with computer vision researchers developing image datasets. Finally, we will investigate the extensibility of the framework to different modalities and other types of data.

We believe there must be a larger cultural shift within communities of academics and practitioners to acknowledge and address issues in datasets. At the same time, creators should be held accountable for dataset maintenance as limitations are revealed by users and analysts over time. What ultimately is at stake is not only promoting societal values but also maintaining society’s confidence in the work product of the research community in computer vision.


  • [1] (2018) 2nd unconstrained face detection and open set recognition challenge. External Links: Link Cited by: §2.1.
  • [2] (2017) Against black inclusion in facial recognition. External Links: Link Cited by: §2.4.
  • [3] P. Barlas, K. Kyriakou, S. Kleanthous, and J. Otterbacher (2019) Social b(eye)as: human and machine descriptions of people images. In Proc. Int. AAAI Conf. on Web and Social Media, Cited by: §2.2.
  • [4] R. Benjamin (2019) Race after technology: abolitionist tools for the new jim code. Social Forces. Cited by: §2.4.
  • [5] J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §2.4.
  • [6] K. Crawford and T. Paglen (2019) Excavating ai: the politics of images in machine learning training sets. External Links: Link Cited by: §1.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    Proc. Computer Vision and Pattern Recognition (CVPR)

    pp. 248–255. Cited by: §1, §2.2.
  • [8] C. Fellbaum (2012) WordNet. The encyclopedia of applied linguistics. Cited by: §2.2.
  • [9] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. III, and K. Crawford (2018) Datasheets for datasets. External Links: 1803.09010 Cited by: §1, §2.3.
  • [10] B. G. Glaser, A. L. Strauss, and E. Strutzel (1968) The discovery of grounded theory; strategies for qualitative research. Nursing research 17 (4), pp. 364. Cited by: §2.
  • [11] Google (2017) The women missing from the silver screen and the technology used to find them. External Links: Link Cited by: §1.
  • [12] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) MS-Celeb-1M: a dataset and benchmark for large scale face recognition. In Proc. European Conf. on Computer Vision (ECCV), Cited by: §2.1.
  • [13] Jules. Harvey (2019) MegaPixels: origins, ethics, and privacy implications of publicly available face recognition image datasets. External Links: Link Cited by: §2.3, §2.
  • [14] S. Holland, A. Hosny, S. Newman, J. Joseph, and K. Chmielinski (2018) The dataset nutrition label: a framework to drive higher data quality standards. arXiv preprint arXiv:1805.03677. Cited by: §2.3.
  • [15] J. Hong, W. Crichton, H. Zhang, D. Y. Fu, J. Ritchie, J. Barenholtz, B. Hannel, X. Yao, M. Murray, G. Moriba, M. Agrawala, and K. Fatahalian (2020) Analyzing who and what appears in a decade of us cable tv news. External Links: 2008.06007 Cited by: §1.
  • [16] K. Kyriakou, P. Barlas, S. Kleanthous, and J. Otterbacher (2019) Fairness in proprietary image tagging algorithms: a cross-platform audit on people images. In Proc. Int. AAAI Conf. on Web and Social Media, Cited by: §2.2.
  • [17] V. U. Prabhu and A. Birhane (2020) Large image datasets: a pyrrhic win for computer vision?. arXiv preprint arXiv:2006.16923. Cited by: §2.2.
  • [18] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In Proc. European Conf. on Computer Vision (ECCV), pp. 17–35. Cited by: §2.1.
  • [19] M. K. Scheuerman, J. M. Paul, and J. R. Brubaker (2019) How computers see gender: an evaluation of gender classification in commercial facial analysis services. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–33. Cited by: §2.4.
  • [20] R. Stewart, M. Andriluka, and A. Y. Ng (2016) End-to-end people detection in crowded scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), pp. 2325–2333. Cited by: §2.1.
  • [21] C. Tomasi (2019) Letter: video analysis research at duke. External Links: Link Cited by: §2.1.
  • [22] A. Torralba, R. Fergus, and W. T. Freeman (2008)

    80 million tiny images: a large data set for nonparametric object and scene recognition

    IEEE transactions on pattern analysis and machine intelligence 30 (11), pp. 1958–1970. Cited by: §2.2.
  • [23] Y. Wang and M. Kosinski (2018)

    Deep neural networks are more accurate than humans at detecting sexual orientation from facial images.

    Journal of personality and social psychology 114 (2), pp. 246. Cited by: §2.4.
  • [24] K. Yang, K. Qinami, L. Fei-Fei, J. Deng, and O. Russakovsky (2020) Towards fairer datasets: filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 547–558. Cited by: §1, §2.2.