Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example

06/03/2021
by   Yue Gong, et al.
0

Identifying a project-join view (PJ-view) over collections of tables is the first step of many data management projects, e.g., assembling a dataset to feed into a business intelligence tool, creating a training dataset to fit a machine learning model, and more. When the table collections are large and lack join information–such as when combining databases, or on data lakes–query by example (QBE) systems can help identify relevant data, but they are designed under the assumption that join information is available in the schema, and do not perform well on pathless table collections that do not have join path information. We present a reference architecture that explicitly divides the end-to-end problem of discovering PJ-views over pathless table collections into a human and a technical problem. We then present Niffler, a system built to address the technical problem. We introduce algorithms for the main components of Niffler, including a signal generation component that helps reduce the size of the candidate views that may be large due to errors and ambiguity in both the data and input queries. We evaluate Niffler on real datasets to demonstrate the effectiveness of the new engine in discovering PJ-views over pathless table collections.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2020

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applicati...
research
11/26/2019

Dataset-On-Demand: Automatic View Search and Presentation for Data Discovery

Many data problems are solved when the right view of a combination of da...
research
01/09/2023

Data Discovery using Natural Language Questions via a Self-Supervised Approach

Data discovery systems help users identify relevant data among large tab...
research
10/22/2018

Selection of BJI configuration: Approach based on minimal transversals

Decision systems deal with a large volume of data stored in new database...
research
12/11/2020

Discovering Multi-Table Functional Dependencies Without Full Join Computation

In this paper, we study the problem of discovering join FDs, i.e., funct...
research
05/30/2018

Progressive Evaluation of Queries over Tagged Data

Modern information systems often collect raw data in the form of text, i...
research
06/21/2023

Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph

Business Intelligence (BI) is crucial in modern enterprises and billion-...

Please sign up or login with your details

Forgot password? Click here to reset