Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories

As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel "cleanliness" score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.



page 1

page 2


J-Score: A Robust Measure of Clustering Accuracy

Background. Clustering analysis discovers hidden structures in a data se...

A Novel Intrinsic Measure of Data Separability

In machine learning, the performance of a classifier depends on both the...

Clustering with Confidence: Finding Clusters with Statistical Guarantees

Clustering is a widely used unsupervised learning method for finding str...

Online Clustering by Penalized Weighted GMM

With the dawn of the Big Data era, data sets are growing rapidly. Data i...

Optimizing Organizations for Navigating Data Lakes

Navigation is known to be an effective complement to search. In addition...

A principled methodology for comparing relatedness measures for clustering publications

There are many different relatedness measures, based for instance on cit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Traditional modes of organizing data repositories and filesystems are increasingly ineffective due to the size, heterogeneity, and complexity of data. Researchers are now turning to alternative organizational models such as data lakes—repositories for large quantities of raw data that are integrated in a pay-as-you-go fashion (Jeffery et al., 2008; Madhavan et al., 2007). However, users are often unwilling to spend time describing and organizing data, causing repositories to become opaque “data swamps”(Hai et al., 2016) with poor metadata and confusing directory structures.

To combat this problem, we propose a set of tools that automate the process of identifying content-based relationships between files. We present a parallel pipeline that crawls repositories, collects key information regarding data composition and distribution, and automatically clusters files based on extracted content and metadata. Our unsupervised clustering models aim to detect latent similarities in file subject, provenance, or purpose (Brackenbury et al., 2018) and then clusters accordingly. We use these clusters to define a novel “cleanliness” measure to quantify the organization of the data repository. This measure consists of a newly proposed frequency drop score which takes into account the directory composition and density of clusters generated by the pipeline. We explore the efficacy of our approach using synthetic data as well as a real-world climate science dataset (U.S. Dept. of Energy, 2017).

2. Methodology

We implement a clustering-based pipeline to identify similar data irrespective of how it is organized. The pipeline is composed of four major steps: crawling, preprocessing, clustering, and calculating cleanliness.

Figure 1. Clustering pipeline.

We focus on two data types: unstructured text and structured tabular data. First, we convert files into common formats (.txt/.csv

). Then, we preprocess file contents according to their data type. Text data are tokenized, stemmed, and vectorized into a TF-IDF matrix, while schemas are extracted from tabular data and used to compute a pairwise Jaccard distance matrix.

For text files, we implement classic -means clustering and the faster MiniBatch

-means clustering. For tabular files, we use agglomerative hierarchical clustering since it does not rely on centroids or other features of Euclidean space. After clustering both filetypes, we generate output clusters, composition statistics, and a dataset cleanliness score. The pipeline is then repeated over a user-specified range of

values to optimize the which best represents the data.

To measure cleanliness, we first define the frequency drop score for a clustering of some dataset by examining the distribution of directories constituting each cluster . Given the number of files from each directory in a cluster, we identify the location of the largest “frequency drop”—representing the point where the tail of the distribution begins. Let be the set of all directories containing files from cluster . We define the head as the set of all directories before the drop, and the tail as the set of all remaining directories of . Under the assumptions that similar data are physically close in well-organized datasets and that the clustering is sufficiently cohesive, the function yields a value in representing the cleanliness of the dataset. We define a logarithm-like function which is well-defined for a base of :


The frequency drop score for each cluster is given by


and the score for the entire clustering is given by


3. Evaluation

We evaluate our approach using synthetic data as well as the Carbon Dioxide Information and Analysis Center’s (CDIAC) data repository.

As a baseline, we generated three synthetic datasets based on -ary trees. Each synthetic dataset includes one parent directory (root node) with children, each of which has children, extended to any chosen height . Each leaf node contains twenty .txt files and twenty .csv files, with each file containing the same word repeated 100 times. Each word is unique to its leaf node, such that the number of expected clusters is equal to the number of leaf nodes. These datasets, when run through our pipeline, yield:

  • perfect clusters where each cluster contains only and all of the files with the same word.

  • a cleanliness score of 1.0.

With this as a baseline, we then shuffled the datasets such that files were randomly assigned to leaf directories. Table 1 shows that the cleanliness scores decrease as the dataset is shuffled.

% Scrambled
Dataset 0% 20% 40% 60% 80% 100%
2-ary,   5-height 1.000 0.806 0.619 0.420 0.227 0.093
3-ary,   3-height 0.963 0.765 0.595 0.429 0.188 0.079
6-ary,   2-height 1.000 0.792 0.593 0.451 0.225 0.106
40-ary, 1-height 0.950 0.780 0.579 0.341 0.217 0.109
Table 1. Cleanliness scores for shuffled synthetic datasets.

We compared our cleanliness score with two other measures: cluster cohesion and a modified Silhouette score (Rousseeuw, 1987), both computed with naïve filesystem tree distance. Figure 2 shows these measures calculated on progressively more shuffled synthetic datasets and real scientific data (from the pub8 subset of CDIAC). We conclude that the silhouette scores are inconsistent and noisy when compared to our cleanliness measure. The naïve tree distance score is comparable, but still fails to discriminate between repositories with vastly different organizational structures in some adversarial examples.

Figure 2. Comparison of cleanliness measures - -ary tree synthetic dataset of tabular files with height (left), and tabular files from pub8 (right).

4. Summary

We introduce a parallel pipeline for automated content-based clustering of files from large heterogeneous data repositories. These clusters are then used to derive a novel measure of the organizational cleanliness of a repository. The measure we developed exhibits better consistency than existing measures when tested on a variety of datasets. The code for our pipeline is available here:


  • (1)
  • Beckman et al. (2017) Paul Beckman, Tyler J Skluzacek, Kyle Chard, and Ian Foster. 2017. Skluma: A statistical learning pipeline for taming unkempt data repositories. In 29th International Conference on Scientific and Statistical Database Management. 41.
  • Brackenbury et al. (2018) Will Brackenbury, Rui Liu, Mainack Mondal, Aaron J. Elmore, Blase Ur, Kyle Chard, and Michael J. Franklin. 2018. Draining the Data Swamp: A Similarity-based Approach. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA’18). ACM, New York, NY, USA, Article 13, 7 pages.
  • Chessell et al. (2014) M. Chessell, F. Scheepers, N. Nguyen, R. van Kessel, and R. van der Starre. 2014. Governing and Managing Big Data for Analytics and Decision Makers.
  • Hai et al. (2016) Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2097–2100.
  • Jeffery et al. (2008) Shawn R Jeffery, Michael J Franklin, and Alon Y Halevy. 2008. Pay-as-you-go user feedback for dataspace systems. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 847–860.
  • Madhavan et al. (2007) Jayant Madhavan, Shawn R Jeffery, Shirley Cohen, Xin Dong, David Ko, Cong Yu, and Alon Halevy. 2007. Web-scale data integration: You can only afford to pay as you go. CIDR.
  • Rousseeuw (1987) Peter J. Rousseeuw. 1987.

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

    J. Comput. Appl. Math. 20 (1987), 53 – 65.
  • U.S. Dept. of Energy (2017) U.S. Dept. of Energy. 2017. Carbon Dioxide Information Analysis Center. Visited Feb. 28, 2017.