Machine Learning Course (INFO 697-03) - Fall 2020
In this paper we investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class-wise accuracy, human-cardinality-analysis, and the semanticity of the image class information in order to statistically investigate the extent and subtleties of ethical transgressions. We then use the census to help hand-curate a look-up-table of images in the ImageNet-ILSVRC-2012 dataset that fall into the categories of verifiably pornographic: shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed private parts. We survey the landscape of harm and threats both society broadly and individuals face due to uncritical and ill-considered dataset curation practices. We then propose possible courses of correction and critique the pros and cons of these. We have duly open-sourced all of the code and the census meta-datasets generated in this endeavor for the computer vision community to build on. By unveiling the severity of the threats, our hope is to motivate the constitution of mandatory Institutional Review Boards (IRB) for large scale dataset curation processes.READ FULL TEXT VIEW PDF
Computer vision technology is being used by many but remains representat...
The ImageNet dataset ushered in a flood of academic and industry interes...
We have now entered the era of trillion parameter machine learning model...
Significant progress has been achieved in Computer Vision by leveraging
In this paper I investigate the effect of random seed selection on the
Scholars have recently drawn attention to a range of controversial issue...
Population growth in the last decades has resulted in the production of ...
Machine Learning Course (INFO 697-03) - Fall 2020
Born from World War II and the haunting and despicable practices of Nazi era experimentation  the 1947 Nuremberg code  and the subsequent 1964 Helsinki declaration , helped establish the doctrine of Informed Consent which builds on the fundamental notions of human dignity and agency to control dissemination of information about oneself. This has shepherded data collection endeavors in the medical and psychological sciences concerning human subjects, including photographic data [71, 8], for the past several decades. A less stringent version of informed consent, broad consent, proposed in 45 CFR 46.116(d) of the Revised Common Rule , has been recently introduced that still affords the basic safeguards towards protecting one’s identity in large scale databases. However, in the age of Big Data, the fundamentals of informed consent, privacy, or agency of the individual have gradually been eroded. Institutions, academia, and industry alike, amass millions of images of people without consent and often for unstated purposes under the guise of anonymization. These claims are misleading given there is weak anonymity and privacy in aggregate data in general  and more crucially, images of faces are not the type of data that can be aggregated. As can be seen in Table 1, several tens of millions of images of people are found in peer-reviewed literature. These images are obtained without consent or awareness of the individuals or IRB approval for collection. In Section 5-B of , for instance, the authors nonchalantly state “As many images on the web contain pictures of people, a large fraction (23%) of the 79 million images in our dataset have people in them”. With this background, we now focus on one of the most celebrated and canonical large scale image datasets: the ImageNet dataset. From the questionable ways images were sourced, to troublesome labeling of people in images, to the downstream effects of training AI models using such images, ImageNet and large scale vision datasets (LSVD) in general constitute a Pyrrhic win for computer vision. We argue, this win has come at the expense of harm to minoritized groups and further aided the gradual erosion of privacy, consent, and agency of both the individual and the collective.
|Open Images ()||9||20||0|
|ImageNet-(21k,11k1,1k) ()||(14, 12, 1)||(22, 11, 1)||0|
The emergence of the ImageNet dataset 
is widely considered a pivotal moment111“The data that transformed AI research—and possibly the world”: https://bit.ly/2VRxx3L
in the Deep Learning revolution that transformed Computer Vision (CV), and Artificial Intelligence (AI) in general. Prior to ImageNet, computer vision and image processing researchers trained image classification models on small datasets such as CalTech101 (9k images), PASCAL-VOC (30k images), LabelMe (37k images), and the SUN (131k images) dataset (see slide-37 in). ImageNet, with over 14 million images spread across 21,841 synsets, replete with 1,034,908 bounding box annotations, brought in an aspect of scale that was previously missing. A subset of 1.2 million images across 1000 classes was carved out from this dataset to form the ImageNet-1k dataset (popularly called ILSVRC-2012) which formed the basis for the Task-1: classification challenge in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This soon became widely touted as the Computer Vision Olympics222 https://engineering.missouri.edu/2014/01/team-takes-top-rankings-in-computer-vision-olympics/
. The vastness of this dataset allowed a Convolutional Neural Network (CNN) with 60 million parameters trained by the SuperVision team from University of Toronto to usher in the rebirth of the CNN-era (see ), which is now widely dubbed the AlexNet moment in AI.
Although ImageNet was created over a decade ago, it remains one of the most influential and powerful image databases available today. Its power and magnitude is matched by its unprecedented societal impact. Although an a posteriori audit might seem redundant a decade after its creation, ImageNet’s continued significance and the culture it has fostered for other large scale datasets warrants an ongoing critical dialogue.
The rest of this paper is structured as follows. In section 2, we cover related work that has explored the ethical dimensions that arise with LSVD. In section 3, we describe the landscape of both the immediate and long term threats individuals and society as a whole encounter due to ill-considered LSVD curation. In Section 4, we propose a set of solutions which might assuage some of the concerns raised in section 3. In Section 5, we present a template quantitative auditing procedure using the ILSVRC2012 dataset as an example and describe the data assets we have curated for the computer vision community to build on. We conclude with broad reflections on LSVDs, society, ethics, and justice.
The very declaration of a taxonomy brings some things into existence while rendering others invisible . A gender classification system that conforms to essentialist binaries, for example, operationalizes gender in a cis-centric way resulting in exclusion of non-binary and transgender people . Categories simplify and freeze nuanced and complex narratives, obscuring political and moral reasoning behind a category. Over time, messy and contingent histories hidden behind a category are forgotten and trivialized . With the adoption of taxonomy sources, image datasets inherit seemingly invisible yet profoundly consequential shortcomings. The dataset creation process, its implication for ML systems, and subsequently, the societal impact of these systems has attracted a substantial body of critique. We categorize such body of work into two groups that complement one another. While the first group can be seen as concerned with the broad downstream effects, the other concentrates mainly on the dataset creation process itself.
The absence of critical engagement with canonical datasets disproportionately negatively impacts women, racial and ethnic minorities, and vulnerable individuals and communities at the margins of society . For example, image search results both exaggerate stereotypes and systematically under-represent women in search results for occupations ; object detection systems designed to detect pedestrians display higher error rates for recognition of demographic groups with dark skin tones 
; and gender classification systems show disparities in image classification accuracy where lighter-skin males are classified with the highest accuracy while darker-skin females suffer the most misclassification. Gender classification systems that lean on binary and cis-genderist constructs operationalize gender in a trans-exclusive way resulting in tangible harm to trans people [61, 93]. With a persistent trend where minoritized and vulnerable individuals and communities often disproportionately suffer the negative outcomes of ML systems, D’Ignazio and Klein  have called for a shift in rethinking ethics not just as a fairness metric to mitigate the narrow concept of bias but as a practice that results in justice for the most negatively impacted. Similarly, Kasy and Abebe  contend that perspectives that acknowledge existing inequality and aim to redistribute power are pertinent as opposed to fairness-based perspectives. Such understanding of ethics as justice then requires a focus beyond ‘bias’ and fairnesss’ in LSVDs and requires questioning of how images are sourced, labelled, and what it means for models to be trained on them. One of the most thorough investigations in this regard can be found in . In this recent work, Crawford and Paglen present an in-depth critical examination of ImageNet including the dark and troubling results of classifying people as if they are objects. Offensive and derogatory labels that perpetuate historical and current prejudices are assigned to people’s actual images. The authors emphasise that not only are images that were scraped across the web appropriated as data for computer vision tasks, but also the very act of assigning labels to people based on physical features raises fundamental concerns around reviving long-discredited pseudo-scientific ideologies of physiognomy .
Within the dataset creation process, taxonomy sources pass on their limitations and underlying assumptions that are problematic. The adoption of underlying structures presents a challenge where — without critical examination of the architecture — ethically dubious taxonomies are inherited. This has been one of the main challenges for ImageNet given that the dataset is built on the backbone of WordNet’s structure. Acknowledging some of the problems, the authors from the ImageNet team did recently attempt to address  the stagnant concept vocabulary of WordNet. They admitted that only 158 out of the 2,832 existing synsets should remain in the person sub-tree333In order to prune all the nodes. They also took into account the imageability of the synsets and the skewed representation in the images pertaining to the
of the synsets and the skewed representation in the images pertaining to theImage retrieval phase. Nonetheless, some serious problems remain untouched. This motivates us to address in greater depth the overbearing presence of the WordNet effect on image datasets.
ImageNet is not the only large scale vision dataset that has inherited the shortcomings of the WordNet taxonomy. The 80 million Tiny Images dataset  which grandfathered the CIFAR-10/100 datasets and the Tencent ML-images dataset  also used the same path. Unlike ImageNet, these datasets have never been audited444In response to the mainstream media covering a pre-print of this work, we were informed that the curators of the dataset have withdrawn the dataset with a note accessible here: https://groups.csail.mit.edu/vision/TinyImages/ or scrutinized and some of the sordid results from inclusion of ethnophaulisms in Tiny-Images dataset’s label space are displayed in Figure 3. The figure demonstrates both the number of images in a subset of the offensive classes (sub-figure(a)) and the exemplar images (sub-figure(b)) that show the images in the noun-class labelled n****r555Due to its offensiveness, we have censored this word (and other words throughout the paper), however, it remains uncensored on the website at the time of writing., a fact that serves as a stark reminder that a great deal of work remains to be done by the ML community at large.
Similarly, we found at least 315 classes666See https://bit.ly/30DybmF of the potentially 1593 classes deemed to be non-imageable by the ImageNet curators in  still retained in the Tencent-ML-Images dataset that includes image classes such as [transvestite, bad person, fornicatress, orphan, mamma’s boy, and enchantress].
Finally, the labeling and validation of the curation process also present ethical challenges. Recent work such as  has explored the intentionally hidden labour, which they have termed as Ghost Work, behind such tasks. Image labeling and validation requires the use of crowd-sourced platforms such as MTurk, often contributing to the exploitation of underpaid and undervalued gig workers. Within the topic of image labeling but with a different dimension and focus, recent work such as  and  has focused on the shortcomings of human-annotation procedures used during the ImageNet dataset curation. These shortcomings, the authors point out, include single label per-image procedure that causes problems given that real-world images often contain multiple objects, and inaccuracies due to “overly restrictive label proposals”.
In this section, we survey the landscape of harm and threats, both immediate and long term, that emerge with dataset curation practices in the absence of careful ethical considerations and anticipation for negative societal consequences. Our goal here is to bring awareness to the ML and AI community regarding the severity of the threats and to motivate a sense of urgency to act on them. We hope this will result in practices such as the mandatory constitution of Institutional Review Boards (IRB) for large scale dataset curation processes.
Large image datasets, when built without careful consideration of societal implications, pose a threat to the welfare and well-being of individuals. Most often, vulnerable people and marginalised populations pay a disproportionately high price. Reverse image search engines that allow face search such as  have gotten remarkably and worryingly efficient in the past year. For a small fee, anyone can use their portal or their API777Please refer to the supplementary material in Appendix A for the screenshots to run an automated process to uncover the “real-world” identities of the humans of ImageNet dataset. For example, in societies where sex work is socially condemned or legally criminalized, re-identification of a sex worker through image search, for example, bears a real danger for the individual victim. Harmful discourse such as revenge porn, are part of a broader continuum of image-based sexual abuse . To further emphasize this specific point, many of the images in classes such as maillot, brassiere, and bikini contain images of beach voyeurism and other non-consensual cases of digital image gathering (covered in detail in Section 5). We were able to (unfortunately) easily map the victims, most of whom are women, in the pictures to “real-world” identities of people belonging to a myriad of backgrounds including teachers, medical professionals, and academic professors using reverse image search engines such as . Paying heed to the possibility of the Streisand effect888The Streisand effect “is a social phenomenon that occurs when an attempt to hide, remove, or censor information has the unintended consequence of further publicizing that information, often via the Internet” 
, we took the decision not to divulge any further quantitative or qualitative details on the extent or the location of such images in the dataset besides alerting the curators of the dataset(s) and making a passionate plea to the community not to underestimate the severity of this particular threat vector.
The attempt to build computer vision has been gradual and can be traced as far back as 1966 to Papert’s The Summer Vision Project , if not earlier. However, ImageNet, with its vast amounts of data, has not only erected a canonical landmark in the history of AI, it has also paved the way for even bigger, more powerful, and suspiciously opaque datasets. The lack of scrutiny of the ImageNet dataset by the wider computer vision community has only served to embolden institutions, both academic and commercial, to build far bigger datasets without scrutiny (see Table 1). Various highly cited and celebrated papers in recent years [54, 16, 11, 100], for example, have used the unspoken unicorn amongst large scale vision datasets, that is, the JFT-300M dataset [?]999We have decided to purposefully leave the ’?’ in place and plan to revisit it only after the dataset’s creator(s) publish the details of it’s curation. This dataset is inscrutable and operates in the dark, to the extent that there has not even been official communication as to what JFT-300M stands for. All that the ML community knows is it purportedly boasts more than 300M images spread across 18k categories. The open source variant(s) of this, the Open Images V4-5-6  contains a subset of 30.1M images covering 20k categories (and also has an extension dataset with 478k crowd-sourced images across more than 6000 categories). While parsing through some of the images, we found verifiably101010See https://bit.ly/2y1sC7i. We performed verification with the uploader of the image via the Flickr link shared. non-consensual images of children that were siphoned off of flickr hinting towards the prevalence of similar issues for JFT-300M from which this was sourced. Besides the other large datasets in Table 1, we have cases such as the CelebA-HQ dataset, which is actually a heavily processed dataset whose grey-box curation process only appears in Appendix-C of  where no clarification is provided on this "frequency based visual quality metric" used to sort the images based on quality. Benchmarking any downstream algorithm of such an opaque, biased and (semi-)synthetic dataset will only result in controversial scenarios such as , where the authors had to hurriedly incorporate addendums admitting biased results. Hence, it is important to reemphasize that the existence and use of such datasets bear direct and indirect impact on people, given that decision making on social outcomes increasingly leans on ubiquitously integrated AI systems trained and validated on such datasets. Yet, despite such profound consequences, critical questions such as where the data comes from or whether the images were obtained consensually are hardly considered part of the LSVD curation process.
The more nuanced and perhaps indirect impact of ImageNet is the culture that it has cultivated within the broader AI community; a culture where the appropriation of images of real people as raw material free for the taking has come be to perceived as the norm. Such norm and lack of scrutiny has played a role towards the creation of monstrous and secretive datasets without much resistance, prompting further questions such as ‘what other secretive datasets currently exist hidden and guarded under the guise of proprietary assets?’.
Current work that has sprung out of secretive datasets, such as Clearview AI 
111111 Clearview AI is a US based privately owned technology company that provides facial recognition services to various customers including North American law enforcement agencies. With more than 3 billion photos scraped from the web, the company operated in the dark until its services to law enforcement was reported in late 2019
Clearview AI is a US based privately owned technology company that provides facial recognition services to various customers including North American law enforcement agencies. With more than 3 billion photos scraped from the web, the company operated in the dark until its services to law enforcement was reported in late 2019, points to a deeply worrying and insidious threat not only to vulnerable groups but also to the very meaning of privacy as we know it .
In May 2007 the iconic case of Chang versus Virgin mobile: The school girl, the billboard, and virgin  unraveled in front of the world, leading to widespread debate on the uneasy relationship between personal privacy, consent, and image copyright, initiating a substantial corpus of academic debate (see [20, 21, 52, 15]). A Creative Commons license addresses only copyright issues – not privacy rights or consent to use images for training. Yet, many of the efforts beyond ImageNet, including the Open Images dataset , have been built on top of the Creative commons loophole that large scale dataset curation agencies interpret as a free for all, consent-included green flag. This, we argue, is fundamentally fallacious as is evinced in the views presented in  by the Creative commons organization that reads: “CC licenses were designed to address a specific constraint, which they do very well: unlocking restrictive copyright. But copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online.”. Datasets culpable of this CC-BY heist such as MegaFace and IBM’s Diversity in Faces have now been deleted in response to the investigations (see  for a survey) lending further support to the Creative Commons fallacy.
Akin to the ivory carving-illegal poaching, and diamond jewelry art-blood diamond nexuses, we posit that there is a similar moral conundrum at play here that effects all downstream applications entailing models trained using a tainted dataset. Often, these transgressions may be rather subtle. In this regard, we pick an examplar field of application that on the surface appears to be a low risk application area: Neural generative art. Neural generative art created using tools such as BigGAN  and Art-breeder  that in turn use pre-trained deep-learning models trained on ethically dubious datasets, bear the downstream burden121212Please refer to the appendix ( Section B.5) where we demonstrate one such real-world experiment entailing unethically generated neural art replete with responses obtained from human critiques as to what they felt about the imagery being displayed. of the problematic residues from non-consensual image siphoning, thus running afoul of the Wittgensteinian edict of ethics and aesthetics being one and the same. . We also note that there is a privacy-leakage facet to this downstream burden. In the context of face recognition, works such as  have demonstrated that CNNs with high predictive power unwittingly accommodate accurate extraction of subsets of the facial images that they were trained on, thus abetting dataset leakage131313We’d like to especially highlight the megapixel.cc project  for the ground-breaking work on datasets to train such facial recognition systems .
Finally, zooming out and taking a broad perspective allows us to see that the very practice of embarking on a classification, taxonomization, and labeling task endows the classifier with the power to decide what is a legitimate, normal, or correct way of being, acting, and behaving in the social world . For any given society, what comes to be perceived as normal or acceptable is often dictated by dominant ideologies. Systems of classification, which operate within a power asymmetrical social hierarchy, necessarily embed and amplify historical and cultural prejudices, injustices, and biases 
. In western societies, “desirable”, “positive”, and “normal” characteristics and ways of being are constructed and maintained in ways that align with the dominant narrative, giving advantage to those that fit the status quo. Groups and individuals on the margins, on the other hand, are often perceived as the “outlier” and the “deviant”. Image classification and labelling practices, without the necessary precautions and awareness of these problematic histories, pick up these stereotypes and prejudices and perpetuate them[74, 73, 35]. AI systems trained on such data amplify and normalize these stereotypes, inflicting unprecedented harm on those that are already on the margins of society. While the ImageNet team did initiate strong efforts towards course-correction , the Tiny Images dataset still contains harmful slurs and offensive labels. And worse, we remain in the dark regarding the secretive and opaque LSVDs.
Decades of work within the fields of Science and Technology Studies (STS) and the Social Sciences show that there is no single straightforward solution to most of the wider social and ethical challenges that we have discussed [99, 5, 28]. These challenges are deeply rooted in social and cultural structures and form part of the fundamental social fabric. Feeding AI systems on the world’s beauty, ugliness, and cruelty, but expecting it to reflect only the beauty is a fantasy . These challenges and tensions will exist as long as humanity continues to operate. Given the breadth of the challenges that we have faced, any attempt for a quick fix risks concealing the problem and providing a false sense of solution. The idea of a complete removal of biases, for example, might in reality be simply hiding them out of sight . Furthermore, many of the challenges (bias, discrimination, injustice) vary with context, history, and place, and are concepts that continually shift and change constituting a moving target . The pursuit of panacea in this context, therefore, is not only unattainable but also misguided. Having said that, there are remedies that can be applied to overcome the specific harms that we have discussed in this paper, which eventually potentially play constituent roles in improving the wider and bigger social and structural issues in the long run.
In , the authors concluded that within the person sub-tree of the ImageNet dataset, 1593 of the 2832 people categories were potentially offensive labels and planned to "remove all of these from ImageNet.". We strongly advocate a similar path for the offensive noun classes in the Tiny Images dataset that we have identified in section 2.1, as well as images that fall into the categories of verifiably141414We use the term verifiably to denote only those NSFW images that were hand-annotated by the volunteers indicating that they also contained the textual context that was of pornographic phraseology. We have an example grid of these images in the Appendix. pornographic, shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed genitalia in the ImageNet-ILSVRC-2012 dataset. In cases where the image category is retained but the images are not, the option of replacement with consensually shot financially compensated images arises. It is possible that some of the people in these images might come forward to consent and contribute their images in exchange for fair financial compensation, credit, or out of sheer altruism . We re-emphasize that our consternation focuses on the non-consensual aspect of the images and not on the category-class and the ensuing content of the images in it. This solution, however, brings forth further questions: does this make image datasets accessible only to those who can afford it? Will we end up with pool of images with a predominantly financially disadvantaged participants?
Science is self-correcting so long as it is accessible and open to critical engagement. We have tried to engage critically and map actionable ways forward given what we know of these LSVDs. The secretive and opaque LSVDs, however, thread a dangerous territory, given that they directly or indirectly impact society but remain hidden and inaccessible. Although the net benefit of the open science movement remains controversial, we strongly contend that making LSVDs open and accessible allows audits of these datasets, which is a first step towards a responsible scientific endeavour.
We found that some of the reverse image search engines do allow for users to remove particular image from our [sic] index via their "Report abuse" portals151515See https://pimeyes.com/en/faq/remove-from-database. This allows for dataset auditors to enlist images found in their dataset(s) containing identifiable individuals and direct them towards a guided image removal process from the reverse image search engine(s), in order to mitigate some aspects of immediate harm.
This path entails harnessing techniques such as DP-Blur  with quantifiable privacy guarantees to obfuscate the identity of the humans in the image. The Inclusive images challenge , for example, already incorporated blurring during dataset curation161616https://www.kaggle.com/c/inclusive-images-challenge and addressed the downstream effects surrounding change in predictive power of the models trained on the blurred versions of the dataset curated. We believe that replication of this template that also clearly included avenues for recourse in case of an erroneously non-blurred image being sighted by a researcher will be a step in the right direction for the community at large.
The basic idea here is to utilize (or augment) synthetic images in lieu of real images during model training. Approaches include using hand-drawn sketch images (ImageNet-Sketch ), using GAN generated images  and techniques such as Dataset distillation , where a dataset or a subset of a dataset is distilled down to a few representative synthetic samples. This is a nascent field with some promising results emerging in unsupervised domain adaptation across visual domains  and universal digit classification .
The specific ethical transgressions that emerged during our longitudinal analysis of ImageNet could have been prevented if there were explicit instructions provided to the MTurkers during the dataset curation phase to enable filtering of these images at the source (See Fig.9 in  for example). We hope ethics checks become an integral part of the User-Interface deployed during the humans-in-the-loop validation phase for future dataset curation endeavors.
As emphasized in Section 4, context is crucial in determining whether a certain dataset ethical or problematic as it provides a vital background information and datasheets are an effective way of providing context. Much along the lines of model cards  and datasheet for datasets , we propose dissemination of dataset audit cards. This allows large scale image dataset curators to publish the goals, curation procedures, known shortcomings and caveats alongside their dataset dissemination. In Figure 7, we have curated an example dataset audit card for the ImageNet dataset using the quantitative analyses carried out in Section 5
|df_insightface_stats.csv||(1000, 30)||24 classwise statistical parameters obtained by running the InsightFace model () on the ImageNet dataset|
|df_audit_age_gender_dex.csv||(1000, 12)||11 classwise (ordered by the wordnet-id) statistical parameters obtained from the json files (of the DEX paper) |
|df_nsfw.csv||(1000, 5)||The mean and std of the NSFW scores of the train and val images arranged per-class. (Unnamed: 0: WordNetID of the class)|
|df_acc_classwise_resnet50.csv||(1000, 7)||Classwise accuracy metrics (& the image level preds) obtained by running the ResNet50 model on ImageNet train and Val sets|
|df_acc_classwise_NasNet_mobile.csv||(1000, 7)||Classwise accuracy metrics (& the image level preds) obtained by running the NasNet model on ImageNet train and Val sets|
|df_imagenet_names_umap.csv||(1000, 5)||Dataframe with 2D UMAP embeddings of the Glove vectors of the classes of the ImageNet dataset|
|df_census_imagenet_61.csv||(1000, 61)||The MAIN census dataframe covering class-wise metrics across 61 parameters, all of which are explained in df_census_columns_interpretation.csv|
|df_census_columns_interpretation.csv||(61, 2)||The interpretations of the 61 metrics of the census dataframe above|
|df_hand_survey.csv||(61, 3)||Dataframe contaimning the details of the 61 images unearthed via hand survey (Do not pay heed to 61. it is a mere coincidence)|
|df_classes_tiny_images_3.csv||(75846, 3)||Dataframe containing the class_ind, class_name (wordnet noun) and n_images|
|df_dog_analysis.csv||(7, 4)||Dataframe containing breed, gender_ratio and survey result from the paper Breed differences in canine aggression’|
We performed a cross-categorical quantitative analysis of ImageNet to assess the extent of the ethical transgressions and the feasibility of model-annotation based approaches. This resulted in an ImageNet census, entailing both image-level as well as class-level analysis across the different metrics (see supplementary section) covering Count, Age and Gender (CAG), NSFW-scoring, semanticity of class labels and accuracy of classification using pre-trained models. We have distilled the important revelations of this census as a dataset audit card presented in Figure 7. This audit also entailed a human-in-the-loop based hybrid-approach that the pre-trained-model annotations (along the lines of [30, 115]) to segment the large dataset into smaller sub-sets and hand-label the smaller subsets to generate two lists covering 62 misogynistic images and 30 image-classes with co-occuring children.
We used the DEX  and the InsightFace  pre-trained models171717While harnessing these pre-trained gender classification models, we would like to strongly emphasize that the specific models and the problems that they were intended to solve, when taken in isolation, stand on ethically dubious grounds themselves. In this regard, we strongly concur with previous work such as  that gender classification based on appearance of a person in a digital image is both scientifically flawed and is a technology that bears a high risk of systemic abuse. to generate the cardinality, gender skewness, and age-distribution results captured in Figure 4. This resulted in discovery of 83,436 images with persons, encompassing 101,070 to 132,201 individuals, thus constituting of the dataset. Further, we munged together gender, age, class semanticity181818 Obtained using GloVe embeddings  on the labels and NSFW content flagging information from the pre-trained NSFW-MobileNet-v2 model  to help perform a guided search of misogynistic consent-violating transgressions. This resulted in discovery of five dozen plus images191919Listed in df_hand_survey.csv across four categories: beach-voyeur-photography, exposed-private-parts, verifiably pornographic and upskirt in the following classes: 445-Bikini, 638 -maillot, 639-tank suit, 655-miniskirt and 459-brassiere (see Figure 5).
Lastly, we harnessed literature from areas spanning from dog-ownership bias (,) to engendering of musical instruments (, ) to generate analysis of subtle forms of human co-occurrence-based gender bias in Figure 6.
Captured in Table 2 are the details of the csv formatted data assets curated for the community to build on. The CAG statistics are covered in df_insightface_stats.csv and df_audit_age_gender_dex.csv. Similarly, we have also curated NSFW scoring (df_nsfw.csv), Accuracy (df_acc_classwise_resnet50/_NasNet_mobile.csv) and Semanticity (df_imagenet_names_umap.csv) datasets as well. df_census_imagenet_61.csv contains the 61 cumulative paramaters for each of the 1000 classes (with their column interpretations in df_census_columns_interpretation.csv). We have duly open-sourced these meta-datasets and 14 tutorial-styled Jupyter notebooks (spanning both ImageNet and Tiny-Images datasets) for community access202020 Link: https://rb.gy/zccdps.
|Count, Age and Gender||DEX , InsightFace , RetinaFace , ArcFace |
|Semanticity||Glove , UMAP |
|Classification Accuracy||Resent-50 , NasNet-mobile |
|639||maillot, tank suit||0.18||26.67||0.769|
|459||brassiere, bra, bandeau||0.16||25.03||0.61|
We have sought to draw the attention of the machine learning community towards the societal and ethical implications of large scale datasets, such as the problem of non-consensual images and the oft-hidden problems of categorizing people. ImageNet has been championed as one of the most incredible breakthroughs in computer vision, and AI in general. We indeed celebrate ImageNet’s achievement and recognize the creators’ efforts to grapple with some ethical questions. Nonetheless, ImageNet as well as other large image datasets remain troublesome. In hindsight, perhaps the ideal time to have raised ethical concerns regarding LSVD curation would have been in 1966 at the birth ofThe Summer Vision Project . The right time after that was when the creators of ImageNet embarked on the project to “map out the entire world of objects”. Nonetheless, these are crucial conversations that the computer vision community needs to engage with now given the rapid democratization of imaging scraping tools ([92, 91, 105]) and dataset-zoos ([56, 102, 84]). The continued silence will only serve to cause more harm than good in the future. In this regard, we have outlined a few solutions, including audit cards, that can be considered to ameliorate some of the concerns raised. We have also curated meta-datasets and open-sourced the code to carry out quantitative auditing using the ILSVRC2012 dataset as a template. However, we posit that the deeper problems are rooted in the wider structural traditions, incentives, and discourse of a field that treats ethical issues as an afterthought. A field where in the wild is often a euphemism for without consent. We are up against a system that has veritably mastered ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping, and ethics shirking .
Within such an ingrained tradition, even the most thoughtful scholar can find it challenging to pursue work outside the frame of the “tradition”. Subsequently, radical ethics that challenge deeply ingrained traditions need to be incentivised and rewarded in order to bring about a shift in culture that centres justice and the welfare of disproportionately impacted communities. We urge the machine learning community to pay close attention to the direct and indirect impact of our work on society, especially on vulnerable groups. Awareness of historical antecedents, contextual, and political dimensions of current work is imperative is this regard. We hope this work contributes in raising awareness regarding the need to cultivate a justice centred practice and motivates the constitution of IRBs for large scale dataset curation processes.
This work was supported, in part, by Science Foundation Ireland grant 13/RC/2094 and co-funded under the European Regional Development Fund through the Southern & Eastern Regional Operational Programme to Lero - the Irish Software Research Centre (www.lero.ie).
The authors would like to thank Alex Hanna, Andrea E. Martin, Anthony Ventresque, Elayne Ruane, John Whaley, Mariya Vasileva, Nicolas Le Roux, Olivia Guest, Os Keyes, Reubs J. Walsh, Sang Han, and Thomas Laurent for their useful feedback on an earlier version of this manuscript.
As covered in the main paper, reverse image search engines that facilitate face search such as  have gotten remarkably and worryingly efficient in the past year. For a small fee, anyone can use their portal or their API to run an automated process and uncover the “real-world” identities of the humans of ImageNet dataset. While all genders in the imagenet dataset are under this risk, there is asymmetric risk here as the high NSFW classes such as bra, bikini and maillot are often the ones with higher female-to-men ratio (See Figure 15). Figure 8 showcases a snapshot image of one such reverse image search portal to demonstrate how easy it is for anyone to access their GUI and uncover “real world” identities of people which can lead to catastrophic downstream risks such as blackmailing and other forms on online abuse.
In this section, we cover the details of performing the quantitative analysis on the ImageNet dataset including the following metrics: Person CAG (Count -Age - Gender) , NSFW scoring of the images, Semanticity and classification accuracy. The pre-trained models used in this endeavor are covered in Table 3. All of these analyses and the generated meta-datasets have been open sourced at https://rb.gy/zccdps. Figure 9 covers the details of all the jupyter notebooks authored to generate the datasets covered in Table 2.
In order to perform a human-centric census covering metrics such as count, age, and gender, we used the InsightFace toolkit for face analysis , that provided implementations of: ArcFace for deep face recognition  and Retina-Face for face localisation (bounding-box generation) . We then combined the results of these models with the results obtained from  that used the DEX  model. The results are as shown in Table 4 that captures the summary statistics for the ILSVRC2012 dataset. In this table, the lower case denotes the number of images with persons identified in them whereas indicates the number of persons212121The difference is simply on account of more than one person being identified by the model in a given image. The superscript indicates the algorithm used (DEX or InsightFace (if) ) whereas the subscript has two fields: The train or validation subset indicator and the census gender-category. For example, implies that there were 3096 images in the ImageNet validation set (out of ) where the InsightFace models were able to detect a person’s face.
As shown, the InsightFace model identified 101,070 persons across 83,436 images (including the train and validation subsets) which puts the prevalence rate of persons whose presence in the dataset exists sans explicit consent to be around which is less aggressive compared to the predicted by the DEX model (that focussed on the training subset), which has a higher identification false positive rate. An example of this can be seen in Fig 10 which showcases an example image with the bounding boxes of the detected persons in the image.
Much akin to , we found a strong bias towards (relatively older) male presence (73,746 with a mean age of 33.24 compared to 26,840 with a mean age of 25.58). At this juncture, we would like to reemphasize that these high accuracy pre-trained models can indeed be highly error prone conditioned on the ethnicity of the person, as analyzed in [30, 14] and we would like to invite the community to re-audit these images with better and more ethically responsible tools (See Fig 11 for example of errors we could spot during the inference stage).
, presents the class-wise estimates of the number of persons in the dataset using the DEX and the InsightFace models. In Figure(b)b, we capture the variation in the estimates of count, gender and age of the DEX and the InsightFace models.
Before delving in to the discussions of the results obtained, we define the parameters that were measured. To begin, we denote to be the binary face-present indicator variable( ) with regards to the image indexed , (in the superscripts) to be the algorithm used (), and to be the number of images in the class . Now, we define the class-level mean person count (), mean-gender-skewness score () and mean-age () to be,
Here, is the age-estimate of the person generated by algorithm in the image and and
represent the mean and standard-deviation of the gender-estimate of the images belonging to classand estimated by algorithm respectively.
With regards to the first scatter-plot in Figure 14(b), we observe that the estimated class-wise counts of persons () detected by the DEX and InsightFace models in the images were in strong agreement () which helps to further establish the global person prevalence rate in the images to be in the order of . These scatter-plots constitute Figure 4 of the dataset audit card (Figure 7).
Now, we would like to draw the attention of the reader towards the weaker correlation () when it came to gender-skewness () and the mean age-estimates () scatter-plots in Figure 14(b). Given that the algorithms used are state-of-the-art with regards to the datasets they have been trained on (see  and ), the high disagreement on a “neutral” dataset like ImageNet exposes the frailties of these algorithmic pipelines upon experiencing population shifts in the test dataset. This, we believe, lends further credence to the studies that have demonstrated poor reliability of these so-termed accurate models upon change of the underlying demographics (see  and ) and further supports the need to move away from gender classification on account of not just the inherent moral and ethical repugnance of the task itself but also on its lack of merit for scientific validity .
Previous journalistic efforts (see ) had revealed the presence of strongly misogynistic content in the ImageNet dataset, specifically in the categories of beach-voyeur-photography, upskirt images, verifiably pornographic and exposed private-parts. These specific four categories have been well researched in digital criminology and intersectional feminism (see [49, 66, 82, 81]) and have formed the backbone of several legislations worldwide (see ,). In order to help generate a hand labelled dataset of these images amongst more than 1.3 million images, we used a hybrid human-in-the-loop approach where we first formed a smaller subset of images from image classes filtered using a model-annotated NSFW-average score as a proxy. For this, we used the NSFW-Mobilenet-v2 model  which is an image-classification model with the output classes being [drawings, hentai, neutral, porn, sexy]. We defined the NSFW score of an image by summing up the softmax values of the [hentai, porn, sexy] subset of classes and estimated the mean-NSFW score of all of the images of a class to obtain the results portrayed in Figure 16. On the left hand side of Figure 16, we see the scatter-plot of the mean-NSFW scores plotted against the mean-gender scores (obtained from the DEX model estimates) for the 1000 imagenet classes. We then found five natural clusters upon using the Affinity Propagation algorithm . Given the 0:FEMALE|1:MALE gender assignments in the model we used (see ), classes with lower mean-gender scores allude towards a women-majority class). The specific details of the highlighted cluster in the scatter-plot in Figure 16 are displayed in Table 5. Further introducing the age dimension (by way of utilising the mean-age metric for each class), we see in the right hand side of Figure 16, that the classes with the highest NSFW scores were those where the dominating demographic was that of young women. With this shortlisting methodology, we were left with approximately 7000 images which were then hand labelled by a team of five volunteers (three male, two female, all aged between 23-45) to curate a list of images where there was complete agreement over the 4 class assignment. We have open-sourced the hand-curated list (see Table 6), and the summary results are as showcased in Figure 19. In sub-figure Figure (a)a, we see the cross-tabulated class-wise counts of the four categories of images222222 This constitutes Figure 5( in the data audit card) across the imagenet classes and in Figure (b)b, we present the histogram-plots of these 61 hand-labelled images across the imagenet classes. As seen, the bikini, two-piece class with a mean NSFW score of was the main image class with 24 confirmed beach-voyeur pictures.
Here, we would like to strongly reemphasise that we are disseminating this list as a community resource so as to facilitate further scholarly engagement and also, if need be, to allow scholars in countries where incriminating laws (see ) may exist, to deal with in the appropriate topical way deemed fit. We certainly admit to the primacy of context in which the objectionable content appears. For example, the image n03617480_6206.jpeg in the class n03617480 - kimono that contained genital exposure, turned out to be a photographic bondage art piece shot by Nobuyoshi Araki that straddles the fine line between scopophilic eroticism and pornography. But, as explored in , the mere possession of a digital copy of this picture would be punishable by law in other nations and we believe that these factors have to be considered contextually while disseminating a large scale image dataset and should be detailed as caveats in the dissemination document.
We also analyzed the relationship between the semanticity of classes and NSFW scores. Firstly, we obtained a representative word for each of the 1000 class labels in ILSVRC2012 and used  to generate dense word-vector Glove embeddings in 300-D. Further, in order to generate the 2D/3D scatter-plots in Figure 15, we used the UMAP  algorithm to perform dimensionality reduction. df_imagenet_names_umap.csv contains the 2D UMAP embeddings of the resultant Glove vectors of the classes that are then visualized in Figure 15 (a). In Figure 15 (b), we see the 3D surface plot of the 2D UMAP semantic dimensions versus the NSFW scores. As seen, it is peaky in specific points of the semantic space of the label categories mapping to classes such as brassier, bikini and maillot.
Social, historical, and cultural biases prevalent in the society feed into datasets and the statistical models trained on them. In the context of Natural Language Processing (NLP), the framework of lexical co-occurrence has been harnessed to tease out these biases, especially in the context of gender biases. In, the authors analyzed occupation words stereotypically perceived as male (that they termed as M-biased words) as well as occupation words stereotypically perceived as female (F-biased
words) in large text corpora and the ensuing downstream effects when used to generate contextual word representations in SoTA models such as such as BERT and GPT-2. Further, in, direct normalized co-occurrence associations between the word and the representative concept words were proposed as a novel corpus bias measurement method, and its efficacy was demonstrated with regards to the actual gender bias statistics of the U.S. job market and its estimates measured via the text corpora. In the context of the ImageNet dataset, we investigated if such co-occurrence biases do exist in the context of human co-occurrence in the images. Previously, in , the authors had explored the biased representation learning of an ImageNet trained model by considering the class basketball where images containing black persons were deemed prototypical. Here, we tried to investigate if the gender of the person co-occurring in the background alongside the non-person class was skewed along the lines that it is purported to be in related academic work. We performed these investigations in the context of person-occurrence with regards to dog-breeds as well as musical instruments. Presented in Figure 22 (a) are the conditional violin plots relating the dog-breed group of the image class of a subset of the ImageNet dataset in comparison with the with the mean gender score obtained from the DEX model analyses. We obtained these measurements in two phases. In the first phase, we grouped the ImageNet classes of dog-breeds in to the following 7 groups: [Toy, Hound ,Sporting, Terrier, Non-Sporting, Working, Herding] following the formal American Kennel Club232323AKC claims that registered breeds are assigned to one of seven groups representing characteristics and functions the breeds were originally bred for. (AKC) groupings (see ). The remaining breeds not in the AKC list were placed into the Unknown group. Once grouped, we computed the gender-conditioned population spreads of person-concurrence using the mean-gender value of the constituent image classes obtained estimated from . Prior literature (see [55, 86]) has explored the nexus between the perceived manliness of dog groups and the ownership gender. These stereotypical associations were indeed reflected in the person co-occurrence gender distributions in Figure (a)a, where we see that the so perceived masculine dog groups belonging to the set [Non-Sporting, Working, Herding] had a stronger male-gender co-occurrence bias.
We found this category to be particularly pertinent both under the wake of strong legislations protecting privacy of children’s digital images as well as the extent of it. We found pictures of infants and children across the following 30 image classes (and possibly more): [’bassinet’, ’cradle’, ’crib’, ’bib’, ’diaper’, ’bubble’, ’sunscreen’, ’plastic bag’, ’hamper’, ’seat belt’, ’bath towel’, ’mask’, ’bow-tie’, ’tub’, ’bucket’, ’umbrella’, ’punching bag’, ’maillot - tank suit’, ’swing’, ’pajama’, ’horizontal bar’, ’computer keyboard’, ’shoe-shop’, ’soccer ball’, ’croquet ball’, ’sunglasses’, ’ladles’, ’tricycle - trike - velocipede’, ’screwdriver’, ’carousel’]. What was particularly unsettling was the prevalence of entire classes such as ’bassinet’, ’cradle’, ’crib’ and ’bib’ that had a very high density of images of infants. We believe this might have legal ramifications as well. For example, Article 8 of the European Union General Data Protection Regulation (GDPR), specifically deals with the conditions applicable to child’s consent in relation to information society services . The associated Recital 38 states verbatim that Children merit specific protection with regard to their personal data, as they may be less aware of the risks, consequences and safeguards concerned and their rights in relation to the processing of personal data. Such specific protection should, in particular, apply to the use of personal data of children for the purposes of marketing or creating personality or user profiles and the collection of personal data with regard to children when using services offered directly to a child. Further, Article 14 of GDPR explicitly states: Information to be provided where personal data have not been obtained from the data subject. We advocate allying with the legal community in this regard to address the concerns raised above.
Akin to the ivory carving-illegal poaching and diamond jewelry art-blood diamond nexuses, we posit there is a similar moral conundrum at play here and would like to instigate a conversation amongst the neural artists in the community. The emergence of tools such as BigGAN  and GAN-breeder  has ushered in an exciting new flavor of generative digital art , generated using deep neural networks (see  for a survey). A cursory search on twitter252525https://twitter.com/hashtag/biggan?lang=en reveals hundreds of interesting art-works created using BigGANs. There are many detailed blog-posts262626https://rb.gy/pr9pwb on generating neural art by beginning with seed images and performing nifty experiments in the latent space of BigGANs. At the point of writing the final version of this paper, (6/26/2020, 10:34 PM PST), users on the ArtBreeder app272727https://ganbreeder.app had generated 64683549 images. Further, Christie’s, the British auction house behemoth, recently hailed the selling of the neural network generated Portrait of Edmond Belamy for an incredible as signalling the arrival of AI art on the world auction stage. Given the rapid growth of this field, we believe this is the right time to have a critical conversation about a particularly dark ethical consequence of using such frameworks that entail models trained on the ImageNet dataset which has many images that are pornographic, non-consensual, voyeuristic and also entail underage nudity. We argue the use of ill-considered seed images to train the models trickles down to the final art-form in a way similar to the blood-diamond syndrome in jewelry art .
An example: Consider the neural art image in Figure 23 we generated using the GanBreeder app. On first appearance, it is not very evident as to what the constituent seed classes are that went into the creation of this neural artwork image. When we solicited volunteers online to critique the artwork (see the collection of responses in Table 7), none had an inkling regarding a rather sinister trickle down effect at play here. As it turns out, we craftily generated this image using hand-picked specific instances of children images emanating from what we have shown are two problematic seed image classes: Bikini and Brassiere. More specifically, for this particular image, we set the Gene weights to be: [Bikini: 42.35, Brassiere: 31.66, Comic Book - 84.84 ]. We would like to strongly emphasize at this juncture that the problem does not emanate from a visual patriarchal mindset , whereby we associate female undergarment imagery to be somehow unethical, but the root cause lies in the fact that many of the images curated into the dataset (at least with regards to the 2 above mentioned classes) were voyeuristic, pornographic, non-consensual and also entailed underage nudity.
|A- Grad student, CMU SCS||
|B- Grad student, Stanford CS||
|C- Data Scientist, Facebook Inc||Futurism|
|D- CS undergrad, U-Michigan||
|E - Senior software engineer, Mt View||
|F- Data Scientist, SF||
Given how besotted the computer vision community is with regards to classification accuracy metrics, we decided to indulge in devil’s advocacy by delving into the nature of variation of class-wise top-5 accuracies in those classes where humans co-occur asymmetrically between the training and validation sets. For this, we performed inference using the ResNet50  and NasNet  models and sorted all the 1000 classes as per the ratios (termed human-delta in the figure) and compared their accuracies with regards to the general population (amongst the 1000 classes). As gathered from Figure 24, we saw a statistically significant drop in top-5 accuracies () for the top-25 human-delta classes, thereby motivating that even for the purveyors of scientism fuelled pragmatism, there is motivation here to pay heed to the problem of humans in images. We would like to reemphasize that we are most certainly not advocating this to be the prima causa for instigating a cultural change in the computer vision community, but are sharing these resources and nuances for further investigation.
We embarked on this project with an aspiration to illustrate how problematic large scale image dataset curations are both in academia and industry and the need for a fundamental change. Through the course of this work, we solicited and incorporated feedback from scholars in the field who have pointed us towards three valid critiques that we would like to address first. To begin with, we solemnly acknowledge the moral paradox in our use of pre-trained gender classification models for auditing the dataset and duly address this in the previous section. Secondly, as covered in Section 3 on the threat landscape, we also considered the risks of the possible Streissand effect with regards to deanonymization of the persons in the dataset that ultimately lead us to not dive further into the quantitative or qualitative aspects of our findings in this regard, besides conveying a specific example via email to the curator of the dataset from which the deanonymization arose. Thirdly, we would like to acknowledge the continued efforts of ImageNet curators to improve the dataset. Although there remains much work to be done, in the large scheme of things and compared to secretive and opaque datasets, the ImageNet dataset allows examinations. Having said that, curating large datasets comes with responsibility (especially given such dataset directly or indirectly impact individual lives and the social world) and all curators need to be held accountable for what they create. With these caveats firmly in tow, we now proceed to conclude with the following Wish List of the impact we hope this work may bring about.
We aspire to see the institutions and individulas curating these large scale datasets to be proactive in establishing the primacy of ethics in the dataset curation process and not just reacting to exposes and pursing posthoc course corrections as an afterthought. We would be well served to remind ourselves that it took the community 11 years to go from the first peer-reviewed dissemination  of the imagenet dataset to achieving the first meaningful course correction in  whereas the number of floating-point operations required to train a classifier to AlexNet-level performance on ImageNet had decreased by a factor of 44x between 2012 and 2019 . This, we believe, demonstrates where the priorities lie and this is precisely where we seek to see the most impact.
At the outset, we question if Big Data can ever operate in a manner that caters the needs and welfares of marginalized communities - those disproportionately impacted by algorithmic injustice. Automated large scale data harvesting forays, by their very volition, tend to be BIG, in the sense that they are inherently prone to Bias, are Imperceptive to the lessons of human condition and recorded history of vulnerable people and Guileful to exploit the loopholes of legal frameworks that allow siphoning off of lived experiences of disfranchised individuals who have little to no agency and recourse to contend Big Data practices. Both collective silence and empty lip service 282828https://www.media.mit.edu/articles/beware-corporate-machinewashing-of-ai/, i.e. caricatured appropriations of ethical transgressions entailing ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping and ethics shirking  cause harm and damage. Given that these datasets emerged from institutions such as Google, Stanford, NYU and MIT, all with a substantial number of staff researching AI ethics and policy, we cannot help but feel that this hints towards not just compartmentalization and fetishization of ethics as a hot topic but also shrewd usage of the ethicists as agents of activism outsourcing.
As covered in the main paper, we could like to see this trend of using the creative commons loophole as an excuse for circumventing the difficult terrain of informed consent. We should, as a field, aspire to treat consent in the same rigorous way as researchers and practitioners in fields such as anthropological studies or medical studies. In this work, we have sought to draw the attention of the Machine Learning community towards the societal and ethical implications of large scale datasets, such as the problem of non-consensual images and the oft-hidden problems of categorizing people. We were inspired by the adage of Secrecy begets tyranny292929From Robert A. Heinlein’s 1961 science fiction novel titled Stranger in a Strange Land  and wanted to issue this as a call to the Machine Learning community to pay close attention to the direct and indirect impact of our work on society, especially on vulnerable groups. We hope this work contributes to raising awareness and adds to a continued discussion of ethics in Machine Learning, along with many other scholars that have been elucidating algorithmic bias, injustice, and harm.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
80 million tiny images: A large data set for nonparametric object and scene recognition.IEEE transactions on pattern analysis and machine intelligence, 30(11):1958–1970, 2008.