Efficient Deduplication and Leakage Detection in Large Scale Image Datasets with a focus on the CrowdAI Mapping Challenge Dataset

Recent advancements in deep learning and computer vision have led to widespread use of deep neural networks to extract building footprints from remote-sensing imagery. The success of such methods relies on the availability of large databases of high-resolution remote sensing images with high-quality annotations. The CrowdAI Mapping Challenge Dataset is one of these datasets that has been used extensively in recent years to train deep neural networks. This dataset consists of ∼280k training images and ∼60k testing images, with polygonal building annotations for all images. However, issues such as low-quality and incorrect annotations, extensive duplication of image samples, and data leakage significantly reduce the utility of deep neural networks trained on the dataset. Therefore, it is an imperative pre-condition to adopt a data validation pipeline that evaluates the quality of the dataset prior to its use. To this end, we propose a drop-in pipeline that employs perceptual hashing techniques for efficient de-duplication of the dataset and identification of instances of data leakage between training and testing splits. In our experiments, we demonstrate that nearly 250k(∼90 images in the training split were identical. Moreover, our analysis on the validation split demonstrates that roughly 56k of the 60k images also appear in the training split, resulting in a data leakage of 93 for the analysis and de-duplication of the CrowdAI Mapping Challenge dataset is publicly available at https://github.com/yeshwanth95/CrowdAI_Hash_and_search .

READ FULL TEXT

page 6

page 7

research
03/29/2019

ESFNet: Efficient Network for Building Extraction from High-Resolution Aerial Images

Building footprint extraction from high-resolution aerial images is alwa...
research
03/13/2023

FireRisk: A Remote Sensing Dataset for Fire Risk Assessment with Benchmarks Using Supervised and Self-supervised Learning

In recent decades, wildfires, as widespread and extremely destructive na...
research
09/05/2023

SyntheWorld: A Large-Scale Synthetic Dataset for Land Cover Mapping and Building Change Detection

Synthetic datasets, recognized for their cost effectiveness, play a pivo...
research
10/02/2018

An Entropic Optimal Transport Loss for Learning Deep Neural Networks under Label Noise in Remote Sensing Images

Deep neural networks have established as a powerful tool for large scale...
research
02/21/2022

Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images

In the application of deep learning on optical coherence tomography (OCT...
research
03/10/2021

Deep Convolutional Sparse Coding Network for Pansharpening with Guidance of Side Information

Pansharpening is a fundamental issue in remote sensing field. This paper...
research
10/19/2020

The Detection of Thoracic Abnormalities ChestX-Det10 Challenge Results

The detection of thoracic abnormalities challenge is organized by the De...

Please sign up or login with your details

Forgot password? Click here to reset