Solving Data Quality Problems with Desbordante: a Demo

07/27/2023
by   George Chernishev, et al.
0

Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2023

Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)

Pioneering data profiling systems such as Metanome and OpenClean brought...
research
09/05/2020

PySAD: A Streaming Anomaly Detection Framework in Python

PySAD is an open-source python framework for anomaly detection on stream...
research
09/12/2021

AdViCE: Aggregated Visual Counterfactual Explanations for Machine Learning Model Validation

Rapid improvements in the performance of machine learning models have pu...
research
03/17/2022

An Interactive Explanatory AI System for Industrial Quality Control

Machine learning based image classification algorithms, such as deep neu...
research
07/18/2019

A Survey of Data Quality Measurement and Monitoring Tools

High-quality data is key to interpretable and trustworthy data analytics...
research
04/30/2021

Using Small MUSes to Explain How to Solve Pen and Paper Puzzles

Pen and paper puzzles like Sudoku, Futoshiki and Skyscrapers are hugely ...
research
09/22/2022

Query-based Industrial Analytics over Knowledge Graphs with Ontology Reshaping

Industrial analytics that includes among others equipment diagnosis and ...

Please sign up or login with your details

Forgot password? Click here to reset