Ground Truth Inference for Weakly Supervised Entity Matching

11/13/2022
by   Renzhi Wu, et al.
0

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching performance; however, they require many labeled examples, which are often expensive or infeasible to obtain. This has inspired us to approach data labeling for EM using weak supervision. In particular, we use the labeling function abstraction popularized by Snorkel, where each labeling function (LF) is a user-provided program that can generate many noisy match/non-match labels quickly and cheaply. Given a set of user-written LFs, the quality of data labeling depends on a labeling model to accurately infer the ground-truth labels. In this work, we first propose a simple but powerful labeling model for general weak supervision tasks. Then, we tailor the labeling model specifically to the task of entity matching by considering the EM-specific transitivity property. The general form of our labeling model is simple while substantially outperforming the best existing method across ten general weak supervision datasets. To tailor the labeling model for EM, we formulate an approach to ensure that the final predictions of the labeling model satisfy the transitivity property required in EM, utilizing an exact solution where possible and an ML-based approximation in remaining cases. On two single-table and nine two-table real-world EM datasets, we show that our labeling model results in a 9 also show that a deep learning EM end model (DeepMatcher) trained on labels generated from our weak supervision approach is comparable to an end model trained using tens of thousands of ground-truth labels, demonstrating that our approach can significantly reduce the labeling efforts required in EM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2021

Demonstration of Panda: A Weakly Supervised Entity Matching System

Entity matching (EM) refers to the problem of identifying tuple pairs in...
research
08/16/2019

AutoER: Automated Entity Resolution using Generative Modelling

Entity resolution (ER) refers to the problem of identifying records in o...
research
08/02/2023

MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching

Entity Matching (EM), which aims to identify all entity pairs referring ...
research
07/27/2022

Learning Hyper Label Model for Programmatic Weak Supervision

To reduce the human annotation efforts, the programmatic weak supervisio...
research
12/17/2021

A data-centric weak supervised learning for highway traffic incident detection

Using the data from loop detector sensors for near-real-time detection o...
research
03/01/2020

GPM: A Generic Probabilistic Model to Recover Annotator's Behavior and Ground Truth Labeling

In the big data era, data labeling can be obtained through crowdsourcing...
research
10/21/2020

Complex data labeling with deep learning methods: Lessons from fisheries acoustics

Quantitative and qualitative analysis of acoustic backscattered signals ...

Please sign up or login with your details

Forgot password? Click here to reset