Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values

11/12/2021
by   Viola Wenz, et al.
0

Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability and interoperability are impaired. This heterogeneity may be caused by quality problems in the data model. Data heterogeneity can occur in particular when the information given is not structured enough and just captured in data values, often due to missing or non-suitable structure in the underlying data model. We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values. It supports an explorative analysis of the existing data and can be configured by domain experts according to their domain knowledge. All values of a selected data field are clustered by syntactic similarity. Thereby an overview of the data values' diversity in syntax is provided. It shall help domain experts to understand how the data model is used in practice and to derive potential quality problems of the data model. We outline a proof-of-concept implementation and evaluate our approach using cultural heritage data.

READ FULL TEXT

page 1

page 14

research
07/22/2020

Detecting Quality Problems in Research Data: A Model-Driven Approach

As scientific progress highly depends on the quality of research data, t...
research
08/04/2022

WShEx: A language to describe and validate Wikibase entities

Wikidata is one of the most successful Semantic Web projects. Its underl...
research
04/21/2022

Why we should respect analysis results as data

The development and approval of new treatments generates large volumes o...
research
07/13/2020

Data from Model: Extracting Data from Non-robust and Robust Models

The essence of deep learning is to exploit data to train a deep neural n...
research
07/21/2017

Mastering Heterogeneous Behavioural Models

Heterogeneity is one important feature of complex systems, leading to th...
research
10/04/2021

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

In the last few years, the concept of data lake has become trendy for da...
research
08/28/2023

Machine Unlearning Methodology base on Stochastic Teacher Network

The rise of the phenomenon of the "right to be forgotten" has prompted r...

Please sign up or login with your details

Forgot password? Click here to reset