A Random Sample Partition Data Model for Big Data Analysis

12/12/2017
by   Salman Salloum, et al.
0

Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) to represent a big data set as a set of non-overlapping data subsets, i.e. RSP data blocks, where each RSP data block has the same probability distribution with the whole big data set. Then, the block-based sampling is used to directly select representative samples for a variety of data analysis tasks. We show how RSP data blocks can be employed to estimate statistics and build models which are equivalent (or approximate) to those from the whole big data set.

READ FULL TEXT

page 5

page 6

research
08/14/2022

Sharp Frequency Bounds for Sample-Based Queries

A data sketch algorithm scans a big data set, collecting a small amount ...
research
05/03/2019

Big Data Model "Entity and Features"

The article deals with the problem which led to Big Data. Big Data infor...
research
02/22/2021

Divide-and-conquer methods for big data analysis

In the context of big data analysis, the divide-and-conquer methodology ...
research
11/01/2018

Score-Matching Representative Approach for Big Data Analysis with Generalized Linear Models

We propose a fast and efficient strategy, called the representative appr...
research
08/05/2018

Mining CFD Rules on Big Data

Current conditional functional dependencies (CFDs) discovery algorithms ...
research
06/02/2017

ICABiDAS: Intuition Centred Architecture for Big Data Analysis and Synthesis

Humans are expert in the amount of sensory data they deal with each mome...
research
01/15/2018

Divide and Recombine for Large and Complex Data: Model Likelihood Functions using MCMC

In Divide & Recombine (D&R), big data are divided into subsets, each ana...

Please sign up or login with your details

Forgot password? Click here to reset