Data Discovery using Natural Language Questions via a Self-Supervised Approach

01/09/2023
by   Qiming Wang, et al.
0

Data discovery systems help users identify relevant data among large table collections. Users express their discovery needs with a program or a set of keywords. Users may express complex queries using programs but it requires expertise. Keyword search is accessible to a larger audience but limits the types of queries supported. An interesting approach is learned discovery systems which find tables given natural language questions. Unfortunately, these systems require a training dataset for each table collection. And because collecting training data is expensive, this limits their adoption. In this paper, we introduce a self-supervised approach to assemble training datasets and train learned discovery systems without human intervention. It requires addressing several challenges, including the design of self-supervised strategies for data discovery, table representation strategies to feed to the models, and relevance models that work well with the synthetically generated questions. We combine all the above contributions into a system, S2LD, that solves the problem end to end. The evaluation results demonstrate the new techniques outperform state-of-the-art approaches on wellknown benchmarks. All in all, the technique is a stepping stone towards building learned discovery systems. The code is open-sourced at https://github.com/TheDataStation/open_table_discovery.

READ FULL TEXT

page 9

page 12

research
09/29/2021

Localizing Objects with Self-Supervised Transformers and no Labels

Localizing objects in image collections without supervision can help to ...
research
05/04/2021

Retrieving Complex Tables with Multi-Granular Graph Representation Learning

The task of natural language table retrieval (NLTR) seeks to retrieve se...
research
03/27/2023

TabIQA: Table Questions Answering on Business Document Images

Table answering questions from business documents has many challenges th...
research
06/03/2021

Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example

Identifying a project-join view (PJ-view) over collections of tables is ...
research
10/15/2020

Learning Better Representation for Tables by Self-Supervised Tasks

Table-to-text generation aims at automatically generating natural text t...
research
07/14/2018

Generating Synthetic Data for Neural Keyword-to-Question Models

Search typically relies on keyword queries, but these are often semantic...
research
04/09/2021

INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]

A full-fledged data exploration system must combine different access mod...

Please sign up or login with your details

Forgot password? Click here to reset