Matching Table Metadata with Business Glossaries Using Large Language Models

09/08/2023
by   Elita Lobo, et al.
0

Enterprises often own large collections of structured data in the form of large databases or an enterprise data lake. Such data collections come with limited metadata and strict access policies that could limit access to the data contents and, therefore, limit the application of classic retrieval and analysis solutions. As a result, there is a need for solutions that can effectively utilize the available metadata. In this paper, we study the problem of matching table metadata to a business glossary containing data labels and descriptions. The resulting matching enables the use of an available or curated business glossary for retrieval and analysis without or before requesting access to the data contents. One solution to this problem is to use manually-defined rules or similarity measures on column names and glossary descriptions (or their vector embeddings) to find the closest match. However, such approaches need to be tuned through manual labeling and cannot handle many business glossaries that contain a combination of simple as well as complex and long descriptions. In this work, we leverage the power of large language models (LLMs) to design generic matching methods that do not require manual tuning and can identify complex relations between column names and glossaries. We propose methods that utilize LLMs in two ways: a) by generating additional context for column names that can aid with matching b) by using LLMs to directly infer if there is a relation between column names and glossary descriptions. Our preliminary experimental results show the effectiveness of our proposed methods.

READ FULL TEXT
research
11/01/2018

Embedding Individual Table Columns for Resilient SQL Chatbots

Most of the world's data is stored in relational databases. Accessing th...
research
05/31/2019

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval

Tables contain valuable knowledge in a structured form. We employ neural...
research
01/02/2020

Online Similarity Learning with Feedback for Invoice Line Item Matching

The procure to pay process (P2P) in large enterprises is a back-end busi...
research
07/09/2021

Can Deep Neural Networks Predict Data Correlations from Column Names?

For humans, it is often possible to predict data correlations from colum...
research
10/17/2020

Automated Metadata Harmonization Using Entity Resolution Contextual Embedding

ML Data Curation process typically consist of heterogeneous federate...
research
07/24/2023

Making Metadata More FAIR Using Large Language Models

With the global increase in experimental data artifacts, harnessing them...
research
08/08/2022

Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

Recent work demonstrates a bias in the GPT-3 model towards generating vi...

Please sign up or login with your details

Forgot password? Click here to reset