General hypergeometric distribution: A basic statistical distribution for the number of overlapped elements in multiple subsets drawn from a finite population
Suppose drawing K balls at random from a finite population of N balls containing M red and N-M green balls, the classical hypergeometric distribution (HGD) described the probability that exactly x of the selected balls are red p(X=x |N,M,K ), here x is the number of overlapped elements (NOE) in two subsets (containing M and K elements, respectively). However, if the selected subsets are more than 2, there is no statistical distribution can describe the NOE, and we define these kinds of distribution as General hypergeometric distribution (GHGD). There is a huge potential demand of such a theory along with the accelerated accumulation of big data, especially in biological research, where the elements (such as genes) with a certain degree of overlap from several datasets (which are subsets of the genome) were often considered to be worth for further study. The overlapped elements are often visualized by Venn diagram method, but the statistical distribution has not been established yet, mainly because of the difficulty of the problem. Suppose there are totally T subsets, the elements in these subsets can be overlapped for 2 to T times. Therefore, besides the population size N and the numbers of elements in each subset [M[i]](i=0 T-1), the GHGD has an additional parameter: level of overlap (LO). GHGD described not only the distribution of NOE that are overlapped in all of the subsets (LO=T), but also the NOE that are overlapped in a portion of the subsets (LO=t or LO>=t, (0<=t<=T). Here, we developed algorithms to calculate the GHGD and discovered graceful formulas of the essential statistics for the GHGD, including mathematical expectation, variance, and high order moments. In addition, statistical theory to estimate the significance of the NOE based on these formulas was established by applying Chebyshev's inequalities.
READ FULL TEXT