Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation

06/15/2019
by   Ji Sun, et al.
0

An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The traditional wisdom is to sequentially apply the human feedback, obtained by asking specific questions, within some budget in each phase. However, these questions are highly correlated; the answer to one can influence the outcome of any of the phases of the pipeline. Hence, interleaving them has the potential to offer significant benefits. In this paper, we propose a human-in-the-loop framework that interleaves different types of questions to optimize human involvement. We propose benefit models to measure the quality improvement from asking a question, and cost models to measure the human time it takes to answer a question. We develop a question scheduling framework that judiciously selects questions to maximize the accuracy of the final golden records. Experimental results on three real-world datasets show that our holistic method significantly improves the quality of golden records from 70 approaches.

READ FULL TEXT

page 2

page 6

page 17

research
09/29/2017

Entity Consolidation: The Golden Record Problem

Four key processes in data integration are: data preparation (i.e., extr...
research
12/17/2021

WebGPT: Browser-assisted question-answering with human feedback

We fine-tune GPT-3 to answer long-form questions using a text-based web-...
research
10/31/2016

Knowledge Questions from Knowledge Graphs

We address the novel problem of automatically generating quiz-style know...
research
04/26/2023

HeySQuAD: A Spoken Question Answering Dataset

Human-spoken questions are critical to evaluating the performance of spo...
research
11/02/2020

Exploring Question-Specific Rewards for Generating Deep Questions

Recent question generation (QG) approaches often utilize the sequence-to...
research
10/24/2022

Multi-Type Conversational Question-Answer Generation with Closed-ended and Unanswerable Questions

Conversational question answering (CQA) facilitates an incremental and i...
research
05/14/2018

A Cost-Effective Framework for Preference Elicitation and Aggregation

We propose a cost-effective framework for preference elicitation and agg...

Please sign up or login with your details

Forgot password? Click here to reset