Eris: Measuring discord among multidimensional data sources

01/31/2022
by   Alberto Abelló, et al.
0

Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and record merging. To solve the latter, it is mostly assumed that ground truth can be determined, either as master data or from user feedback. However, in many cases, this is not the case because firstly the merging processes cannot be accurate enough, and also the data gathering processes in the different sources are simply imperfect and cannot provide high quality data. Instead of enforcing consistency, we propose to evaluate how concordant or discordant sources are as a measure of trustworthiness (the more discordant are the sources, the less we can trust their data). Thus, we define the discord measurement problem in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different data (for example, cases and deaths), we wish to assess whether the different sources are concordant, or if not, measure how discordant they are. We also define a set of algebraic operators to describe the alignments, together with two alternative relational implementations that reduce the problem to linear or quadratic programming. These are evaluated against both COVID-19 and synthetic data, and our experimental results show that discordancy measurement can be performed efficiently in realistic situations.

READ FULL TEXT
research
07/26/2021

An Automatic Schema-Instance Approach for Merging Multidimensional Data Warehouses

Using data warehouses to analyse multidimensional data is a significant ...
research
08/05/2018

Schema Integration on Massive Data Sources

As the fundamental phrase of collecting and analyzing data, data integra...
research
11/30/2017

Towards Data Quality Assessment in Online Advertising

In online advertising, our aim is to match the advertisers with the most...
research
09/13/2021

An End-to-end Point of Interest (POI) Conflation Framework

Point of interest (POI) data serves as a valuable source of semantic inf...
research
06/01/2020

NEMA: Automatic Integration of Large Network Management Databases

Network management, whether for malfunction analysis, failure prediction...
research
01/25/2016

Bayesian Estimation of Bipartite Matchings for Record Linkage

The bipartite record linkage task consists of merging two disparate data...
research
07/26/2018

General Context-Aware Data Matching and Merging Framework

Due to numerous public information sources and services, many methods to...

Please sign up or login with your details

Forgot password? Click here to reset