Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)

01/14/2023
by   George Chernishev, et al.
0

Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers. Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well. In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2023

Solving Data Quality Problems with Desbordante: a Demo

Data profiling is an essential process in modern data-driven industries....
research
02/10/2019

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

This paper documents the release of the ELKI data mining framework, vers...
research
05/12/2020

Compositional Few-Shot Recognition with Primitive Discovery and Enhancing

Few-shot learning (FSL) aims at recognizing novel classes given only few...
research
09/21/2020

Towards application-specific query processing systems

Database systems use query processing subsystems for enabling efficient ...
research
12/23/2022

Detecting Exploit Primitives Automatically for Heap Vulnerabilities on Binary Programs

Automated Exploit Generation (AEG) is a well-known difficult task, espec...
research
06/25/2019

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Reinforcement learning agents that operate in diverse and complex enviro...
research
03/17/2022

Beauty and the beast: A case study on performance prototyping of data-intensive containerized cloud applications

Data-intensive container-based cloud applications have become popular wi...

Please sign up or login with your details

Forgot password? Click here to reset