Unsupervised Matching of Data and Text

12/16/2021
by   Naser Ahmadi, et al.
0

Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve promising results for these two tasks, there is no clear solution for the more general problem of matching textual content and structured data. We introduce a framework that supports this new task in an unsupervised setting for any pair of corpora, being relational tables or text documents. Our method builds a fine-grained graph over the content of the corpora and derives word embeddings to represent the objects to match in a low dimensional space. The learned representation enables effective and efficient matching at different granularity, from relational tuples to text sentences and paragraphs. Our flexible framework can exploit pre-trained resources, but it does not depends on their existence and achieves better quality performance in matching content when the vocabulary is domain specific. We also introduce optimizations in the graph creation process with an "expand and compress" approach that first identifies new valid relationships across elements, to improve matching, and then prunes nodes and edges, to reduce the graph size. Experiments on real use cases and public datasets show that our framework produces embeddings that outperform word embeddings and fine-tuned language models both in results' quality and in execution times.

READ FULL TEXT
research
04/03/2019

Probing Biomedical Embeddings from Language Models

Contextualized word embeddings derived from pre-trained language models ...
research
09/05/2019

Fusing Vector Space Models for Domain-Specific Applications

We address the problem of tuning word embeddings for specific use cases ...
research
02/13/2023

Evaluation of Word Embeddings for the Social Sciences

Word embeddings are an essential instrument in many NLP tasks. Most avai...
research
09/03/2019

Local Embeddings for Relational Data Integration

Integrating information from heterogeneous data sources is one of the fu...
research
09/24/2019

Assessing the Lexico-Semantic Relational Knowledge Captured by Word and Concept Embeddings

Deep learning currently dominates the benchmarks for various NLP tasks a...
research
02/22/2023

Learning from Multiple Sources for Data-to-Text and Text-to-Data

Data-to-text (D2T) and text-to-data (T2D) are dual tasks that convert st...
research
05/22/2020

Living Machines: A study of atypical animacy

This paper proposes a new approach to animacy detection, the task of det...

Please sign up or login with your details

Forgot password? Click here to reset