How Many Samples Required in Big Data Collection: A Differential Message Importance Measure

01/12/2018
by   Shanyun Liu, et al.
0

Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2018

Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization

Data collection is a fundamental problem in the scenario of big data, wh...
research
01/12/2018

State Variation Mining: On Information Divergence with Message Importance in Big Data

Information transfer which reveals the state variation of variables can ...
research
01/04/2019

Information Measure Similarity Theory: Message Importance Measure via Shannon Entropy

Rare events attract more attention and interests in many scenarios of bi...
research
03/26/2018

A Switch to the Concern of User: Importance Coefficient in Utility Distribution and Message Importance Measure

This paper mainly focuses on the utilization frequency in receiving end ...
research
06/27/2019

Chi-squared Test for Binned, Gaussian Samples

We examine the χ^2 test for binned, Gaussian samples, including effects ...
research
10/25/2021

A Constructive Proof of the Glivenko-Cantelli Theorem

The Glivenko-Cantelli theorem states that the empirical distribution fun...
research
04/16/2018

BELIEF: A distance-based redundancy-proof feature selection method for Big Data

With the advent of Big Data era, data reduction methods are highly deman...

Please sign up or login with your details

Forgot password? Click here to reset