Sample Debiasing in the Themis Open World Database System (Extended Version)

02/23/2020
by   Laurel Orr, et al.
0

Open world database management systems assume tuples not in the database still exist and are becoming an increasingly important area of research. We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population. We leverage apriori population aggregate information to develop and combine two different approaches for automatic debiasing: sample reweighting and Bayesian network probabilistic modeling. We build a prototype of Themis and demonstrate that Themis achieves higher query accuracy than the default AQP approach, an alternative sample reweighting technique, and a variety of Bayesian network models while maintaining interactive query response times. We also show that is robust to differences in the support between the sample and population, a key use case when using social media samples.

READ FULL TEXT

page 11

page 13

page 15

research
12/17/2019

Mosaic: A Sample-Based Database System for Open World Query Processing

Data scientists have relied on samples to analyze populations of interes...
research
03/29/2019

Query the model: precomputations for efficient inference with Bayesian Networks

We consider a setting where a Bayesian network has been built over a rel...
research
05/15/2023

Bayesian predictive inference when integrating a non-probability sample and a probability sample

We consider the problem of integrating a small probability sample (ps) a...
research
11/02/2021

MillenniumDB: A Persistent, Open-Source, Graph Database

In this systems paper, we present MillenniumDB: a novel graph database e...
research
12/19/2022

A Bayesian algorithm for sample selection bias correction

In this paper we present a technique to couple non-traditional data with...
research
10/24/2017

Computational Social Scientist Beware: Simpson's Paradox in Behavioral Data

Observational data about human behavior is often heterogeneous, i.e., ge...

Please sign up or login with your details

Forgot password? Click here to reset