In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling

03/02/2017
by   Neil G. Marchant, et al.
0

Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates for rigorous evaluation. This paper addresses this important challenge with the OASIS algorithm: a sampler and F-measure estimator for ER evaluation. OASIS draws samples from a (biased) instrumental distribution, chosen to ensure estimators with optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on unlabelled items providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 83 to estimate accuracy.

READ FULL TEXT
research
06/12/2020

A general framework for label-efficient online evaluation with asymptotic guarantees

Achieving statistically significant evaluation with passive sampling of ...
research
02/21/2021

Adaptive Importance Sampling for Efficient Stochastic Root Finding and Quantile Estimation

In solving simulation-based stochastic root-finding or optimization prob...
research
10/21/2020

Optimal Off-Policy Evaluation from Multiple Logging Policies

We study off-policy evaluation (OPE) from multiple logging policies, eac...
research
03/20/2019

Adaptive importance sampling by kernel smoothing

A key determinant of the success of Monte Carlo simulation is the sampli...
research
05/02/2018

Selection of proposal distributions for generalized importance sampling estimators

The standard importance sampling (IS) method uses samples from a single ...
research
06/26/2019

Near Optimal Stratified Sampling

The performance of a machine learning system is usually evaluated by usi...

Please sign up or login with your details

Forgot password? Click here to reset