Mosaic: A Sample-Based Database System for Open World Query Processing

12/17/2019
by   Laurel Orr, et al.
0

Data scientists have relied on samples to analyze populations of interest for decades. Recently, with the increase in the number of public data repositories, sample data has become easier to access. It has not, however, become easier to analyze. This sample data is arbitrarily biased with an unknown sampling probability, meaning data scientists must manually debias the sample with custom techniques to avoid inaccurate results. In this vision paper, we propose Mosaic, a database system that treats samples as first-class citizens and allows users to ask questions over populations represented by these samples. Answering queries over biased samples is non-trivial as there is no existing, standard technique to answer population queries when the sampling probability is unknown. In this paper, we show how our envisioned system solves this problem by having a unique sample-based data model with extensions to the SQL language. We propose how to perform population query answering using biased samples and give preliminary results for one of our novel query answering techniques.

READ FULL TEXT
research
02/23/2020

Sample Debiasing in the Themis Open World Database System (Extended Version)

Open world database management systems assume tuples not in the database...
research
04/26/2021

Provenance-based Data Skipping (TechReport)

Database systems analyze queries to determine upfront which data is need...
research
02/25/2020

The Power of Many Samples in Query Complexity

The randomized query complexity R(f) of a boolean function f{0,1}^n→{0,1...
research
09/06/2012

The Sample Complexity of Search over Multiple Populations

This paper studies the sample complexity of searching over multiple popu...
research
04/03/2022

Probability and Non-Probability Samples: Improving Regression Modeling by Using Data from Different Sources

Non-probability sampling, for example in the form of online panels, has ...
research
03/12/2018

The Everlasting Database: Statistical Validity at a Fair Price

The problem of handling adaptivity in data analysis, intentional or not,...
research
08/29/2020

STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data

Online sampling-supported visual analytics is increasingly important, as...

Please sign up or login with your details

Forgot password? Click here to reset