JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

07/06/2023
by   Michael J. Mior, et al.
0

Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a distributed setting. JSONoid subsumes several existing approaches to distributed schema discovery with similar performance. Our approach also adds significant useful additional information about data values to discovered schemas with linear scalability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2020

Valentine: Evaluating Matching Techniques for Dataset Discovery

Data scientists today search large data lakes to discover and integrate ...
research
02/27/2020

Data-Driven Metadata Tagging for Building Automation Systems: A Unified Architecture

This article presents a Unified Architecture for automated point tagging...
research
08/30/2017

Distributed Holistic Clustering on Linked Data

Link discovery is an active field of research to support data integratio...
research
02/03/2017

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various discipli...
research
10/17/2020

Automated Metadata Harmonization Using Entity Resolution Contextual Embedding

ML Data Curation process typically consist of heterogeneous federate...
research
06/22/2022

Positional Paper: Schema-First Application Telemetry

Application telemetry refers to measurements taken from software systems...
research
11/24/2022

Enhanced Inversion of Schema Evolution with Provenance

Long-term data-driven studies have become indispensable in many areas of...

Please sign up or login with your details

Forgot password? Click here to reset