Recognizing Variables from their Data via Deep Embeddings of Distributions

09/11/2019
by   Jonas Mueller, et al.
13

A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be more robustly addressed by leveraging the data values themselves rather than just relying on their arbitrarily selected variable names. Here, we present a computationally efficient method to identify high-confidence variable matches between a given set of data values and a large repository of previously encountered datasets. Our approach enjoys numerous advantages over distributional similarity based techniques because we leverage learned vector embeddings of datasets which adaptively account for natural forms of data variation encountered in practice. Based on the neural architecture of deep sets, our embeddings can be computed for both numeric and string data. In dataset search and schema matching tasks, our methods outperform standard statistical techniques and we find that the learned embeddings generalize well to new data sources.

READ FULL TEXT
research
10/05/2020

LEAPME: Learning-based Property Matching with Embeddings

Data integration tasks such as the creation and extension of knowledge g...
research
06/12/2021

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

In this paper, we present SANTA, a scalable framework to automatically n...
research
05/19/2022

Gender Bias in Meta-Embeddings

Combining multiple source embeddings to create meta-embeddings is consid...
research
09/03/2019

Local Embeddings for Relational Data Integration

Integrating information from heterogeneous data sources is one of the fu...
research
03/14/2023

AutoTransfer: AutoML with Knowledge Transfer – An Application to Graph Neural Networks

AutoML has demonstrated remarkable success in finding an effective neura...
research
09/05/2018

Merging datasets through deep learning

Merging datasets is a key operation for data analytics. A frequent requi...
research
06/07/2023

Answering Compositional Queries with Set-Theoretic Embeddings

The need to compactly and robustly represent item-attribute relations ar...

Please sign up or login with your details

Forgot password? Click here to reset