Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories

10/13/2018
by   Luann Jung, et al.
0

As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel "cleanliness" score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.

READ FULL TEXT

page 1

page 2

research
09/03/2021

J-Score: A Robust Measure of Clustering Accuracy

Background. Clustering analysis discovers hidden structures in a data se...
research
05/25/2023

Metrics for quantifying isotropy in high dimensional unsupervised clustering tasks in a materials context

Clustering is a common task in machine learning, but clusters of unlabel...
research
09/11/2021

A Novel Intrinsic Measure of Data Separability

In machine learning, the performance of a classifier depends on both the...
research
08/06/2022

Threddy: An Interactive System for Personalized Thread-based Exploration and Organization of Scientific Literature

Reviewing the literature to understand relevant threads of past work is ...
research
10/18/2022

A novel statistical methodology for quantifying the spatial arrangements of axons in peripheral nerves

A thorough understanding of the neuroanatomy of peripheral nerves is req...
research
01/21/2019

A principled methodology for comparing relatedness measures for clustering publications

There are many different relatedness measures, based for instance on cit...

Please sign up or login with your details

Forgot password? Click here to reset