Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints

04/08/2020
by   Xi Liang, et al.
0

Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples. We describe how to process SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard error bounds with testable constraints. We propose an optimization algorithm based on an integer program that reconciles a set of such constraints, even if they are overlapping, conflicting, or unsatisfiable, into such bounds. Our experiments on real-world datasets against several statistical imputation and inference baselines show that statistical techniques can have a deceptively high error rate that is often unpredictable. In contrast, our framework offers hard bounds that are guaranteed to hold if the constraints are not violated. In spite of these hard bounds, we show competitive accuracy to statistical baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2022

QUIP: Query-driven Missing Value Imputation

Missing values widely exist in real-world data sets, and failure to clea...
research
11/17/2015

Optimized Linear Imputation

Often in real-world datasets, especially in high dimensional data, some ...
research
09/01/2021

RIFLE: Robust Inference from Low Order Marginals

The ubiquity of missing values in real-world datasets poses a challenge ...
research
05/26/2021

ReStore – Neural Data Completion for Relational Databases

Classical approaches for OLAP assume that the data of all tables is comp...
research
09/16/2011

High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity

Although the standard formulations of prediction problems involve fully-...
research
03/26/2021

Synthesizing Linked Data Under Cardinality and Integrity Constraints

The generation of synthetic data is useful in multiple aspects, from tes...

Please sign up or login with your details

Forgot password? Click here to reset