Large image datasets: A pyrrhic win for computer vision?

06/24/2020
by   Vinay Uday Prabhu, et al.
15

In this paper we investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class-wise accuracy, human-cardinality-analysis, and the semanticity of the image class information in order to statistically investigate the extent and subtleties of ethical transgressions. We then use the census to help hand-curate a look-up-table of images in the ImageNet-ILSVRC-2012 dataset that fall into the categories of verifiably pornographic: shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed private parts. We survey the landscape of harm and threats both society broadly and individuals face due to uncritical and ill-considered dataset curation practices. We then propose possible courses of correction and critique the pros and cons of these. We have duly open-sourced all of the code and the census meta-datasets generated in this endeavor for the computer vision community to build on. By unveiling the severity of the threats, our hope is to motivate the constitution of mandatory Institutional Review Boards (IRB) for large scale dataset curation processes.

READ FULL TEXT
research
12/16/2019

Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy

Computer vision technology is being used by many but remains representat...
research
05/03/2019

Auditing ImageNet: Towards a Model-driven Framework for Annotating Demographic Attributes of Large-Scale Image Datasets

The ImageNet dataset ushered in a flood of academic and industry interes...
research
02/14/2022

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Large datasets underlying much of current machine learning raise serious...
research
10/05/2021

Multimodal datasets: misogyny, pornography, and malignant stereotypes

We have now entered the era of trillion parameter machine learning model...
research
11/17/2017

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

Significant progress has been achieved in Computer Vision by leveraging ...
research
09/16/2021

Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision

In this paper I investigate the effect of random seed selection on the a...
research
05/26/2021

Computer Vision and Conflicting Values: Describing People with Automated Alt Text

Scholars have recently drawn attention to a range of controversial issue...

Please sign up or login with your details

Forgot password? Click here to reset