Beyond web-scraping: Crowd-sourcing a geographically diverse image dataset

01/05/2023
by   Vikram V. Ramaswamy, et al.
8

Current dataset collection methods typically scrape large amounts of data from the web. While this technique is extremely scalable, data collected in this way tends to reinforce stereotypical biases, can contain personally identifiable information, and typically originates from Europe and North America. In this work, we rethink the dataset collection paradigm and introduce GeoDE, a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, and no personally identifiable information, collected through crowd-sourcing. We analyse GeoDE to understand differences in images collected in this manner compared to web-scraping. Despite the smaller size of this dataset, we demonstrate its use as both an evaluation and training dataset, highlight shortcomings in current models, as well as show improved performances when even small amounts of GeoDE (1000 - 2000 images per region) are added to a training dataset. We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/

READ FULL TEXT

page 15

page 16

page 18

page 24

page 25

page 26

page 27

page 28

research
05/22/2020

From ImageNet to Image Classification: Contextualizing Progress on Benchmarks

Building rich machine learning datasets in a scalable manner often neces...
research
04/18/2023

Enhancing Textbooks with Visuals from the Web for Improved Learning

Textbooks are the primary vehicle for delivering quality education to st...
research
08/29/2019

Active Learning for UAV-based Semantic Mapping

Unmanned aerial vehicles combined with computer vision systems, such as ...
research
10/27/2020

Dataset: LoED: The LoRaWAN at the Edge Dataset

This paper presents the LoRaWAN at the Edge Dataset (LoED), an open LoRa...
research
03/30/2018

Scalable Deep Learning Logo Detection

Existing logo detection methods usually consider a small number of logo ...
research
09/28/2020

Reactive Supervision: A New Method for Collecting Sarcasm Data

Sarcasm detection is an important task in affective computing, requiring...
research
04/21/2022

SelfD: Self-Learning Large-Scale Driving Policies From the Web

Effectively utilizing the vast amounts of ego-centric navigation data th...

Please sign up or login with your details

Forgot password? Click here to reset