RetClean: Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

03/29/2023
by   Mohammad Shahmeer Ahmad, et al.
0

Can foundation models (such as ChatGPT) clean your data? In this proposal, we demonstrate that indeed ChatGPT can assist in data cleaning by suggesting corrections for specific cells in a data table (scenario 1). However, ChatGPT may struggle with datasets it has never encountered before (e.g., local enterprise data) or when the user requires an explanation of the source of the suggested clean values. To address these issues, we developed a retrieval-based method that complements ChatGPT's power with a user-provided data lake. The data lake is first indexed, we then retrieve the top-k relevant tuples to the user's query tuple and finally leverage ChatGPT to infer the correct value (scenario 2). Nevertheless, sharing enterprise data with ChatGPT, an externally hosted model, might not be feasible for privacy reasons. To assist with this scenario, we developed a custom RoBERTa-based foundation model that can be locally deployed. By fine-tuning it on a small number of examples, it can effectively make value inferences based on the retrieved tuples (scenario 3). Our proposed system, RetClean, seamlessly supports all three scenarios and provides a user-friendly GUI that enables the VLDB audience to explore and experiment with the system.

READ FULL TEXT
research
09/14/2023

When is a Foundation Model a Foundation Model

Recently, several studies have reported on the fine-tuning of foundation...
research
02/09/2023

Offsite-Tuning: Transfer Learning without Full Model

Transfer learning is important for foundation models to adapt to downstr...
research
12/06/2016

Tag Prediction at Flickr: a View from the Darkroom

Automated photo tagging has established itself as one of the most compel...
research
11/29/2022

On the power of foundation models

With infinitely many high-quality data points, infinite computational po...
research
05/08/2019

FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance

Frequently Asked Question (FAQ) retrieval is an important task where the...
research
02/20/2023

Improving User Controlled Table-To-Text Generation Robustness

In this work we study user controlled table-to-text generation where use...
research
03/20/2023

Generative AI and the Digital Commons

Many generative foundation models (or GFMs) are trained on publicly avai...

Please sign up or login with your details

Forgot password? Click here to reset