DomainNet: Homograph Detection for Data Lake Disambiguation

03/17/2021
by   Aristotelis Leventidis, et al.
0

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38 method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89

READ FULL TEXT
research
12/26/2022

Heliophysics Discovery Tools for the 21st Century: Data Science and Machine Learning Structures and Recommendations for 2020-2050

Three main points: 1. Data Science (DS) will be increasingly important t...
research
10/31/2014

Validation of Matching

We introduce a technique to compute probably approximately correct (PAC)...
research
06/17/2020

Using Weighted P-Values in Fisher's Method

Fisher's method prescribes a way to combine p-values from multiple exper...
research
09/13/2017

Efficient Computation of Multiple Density-Based Clustering Hierarchies

HDBSCAN*, a state-of-the-art density-based hierarchical clustering metho...
research
09/05/2021

A Q-Q plot aids interpretation of the False Discovery Rate

A method is demonstrated for representing the false discovery rate (FDR)...
research
06/22/2023

An Interactive Interface for Novel Class Discovery in Tabular Data

Novel Class Discovery (NCD) is the problem of trying to discover novel c...
research
10/01/2018

Network Modeling and Pathway Inference from Incomplete Data ("PathInf")

In this work, we developed a network inference method from incomplete da...

Please sign up or login with your details

Forgot password? Click here to reset