General hypergeometric distribution: A basic statistical distribution for the number of overlapped elements in multiple subsets drawn from a finite population

by   Xing-gang Mao, et al.

Suppose drawing K balls at random from a finite population of N balls containing M red and N-M green balls, the classical hypergeometric distribution (HGD) described the probability that exactly x of the selected balls are red p(X=x |N,M,K ), here x is the number of overlapped elements (NOE) in two subsets (containing M and K elements, respectively). However, if the selected subsets are more than 2, there is no statistical distribution can describe the NOE, and we define these kinds of distribution as General hypergeometric distribution (GHGD). There is a huge potential demand of such a theory along with the accelerated accumulation of big data, especially in biological research, where the elements (such as genes) with a certain degree of overlap from several datasets (which are subsets of the genome) were often considered to be worth for further study. The overlapped elements are often visualized by Venn diagram method, but the statistical distribution has not been established yet, mainly because of the difficulty of the problem. Suppose there are totally T subsets, the elements in these subsets can be overlapped for 2 to T times. Therefore, besides the population size N and the numbers of elements in each subset [M[i]](i=0 T-1), the GHGD has an additional parameter: level of overlap (LO). GHGD described not only the distribution of NOE that are overlapped in all of the subsets (LO=T), but also the NOE that are overlapped in a portion of the subsets (LO=t or LO>=t, (0<=t<=T). Here, we developed algorithms to calculate the GHGD and discovered graceful formulas of the essential statistics for the GHGD, including mathematical expectation, variance, and high order moments. In addition, statistical theory to estimate the significance of the NOE based on these formulas was established by applying Chebyshev's inequalities.


Mean, Variance and Asymptotic Property for General Hypergeometric Distribution

General hypergeometric distribution (GHGD) definition: from a finite spa...

Red Blue Set Cover Problem on Axis-Parallel Hyperplanes and Other Objects

Given a universe 𝒰=R ∪ B of a finite set of red elements R, and a finite...

A greedoid and a matroid inspired by Bhargava's p-orderings

Consider a finite set E equipped with a "weight function" w : E →R and a...

Finitely Supported Sets Containing Infinite Uniformly Supported Subsets

The theory of finitely supported algebraic structures represents a refor...

Cluster-based Specification Techniques in Dempster-Shafer Theory

When reasoning with uncertainty there are many situations where evidence...

Results in descriptive set theory on some represented spaces

Descriptive set theory was originally developed on Polish spaces. It was...

Robust Algorithms for the Secretary Problem

In classical secretary problems, a sequence of n elements arrive in a un...

Please sign up or login with your details

Forgot password? Click here to reset