 # The maximum negative hypergeometric distribution

An urn contains a known number of balls of two different colors. We describe the random variable counting the smallest number of draws needed in order to observe at least c of both colors when sampling without replacement for a pre-specified value of c=1,2,... . This distribution is the finite sample analogy to the maximum negative binomial distribution described by Zhang, Burtness, and Zelterman (2000). We describe the modes, approximating distributions, and estimation of the contents of the urn.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Abstract

An urn contains a known number of balls of two different colors. We describe the random variable counting the smallest number of draws needed in order to observe at least of both colors when sampling without replacement for a prespecified value of . This distribution is the finite sample analogy to the maximum negative binomial distribution described by Zhang, Burtness, and Zelterman (2000). We describe the modes, approximating distributions, and estimation of the contents of the urn.

Keywords: discrete distributions; negative binomial distribution; riff-shuffle distribution; hypergeometric distribution; negative hypergeometric distribution; maximum negative binomial distribution

## 1 Introduction

And the LORD said unto Noah, Come thou and all thy house into the ark;

for thee have I seen righteous before me in this generation.

Of every clean beast thou shalt take to thee by sevens, the male and his female:

and of beasts that are not clean by two, the male and his female.

Genesis 7:1–2. King James translation

This charge to Noah required seven pairs of clean animals. How many animals did Noah plan on catching in order to be reasonably sure of achieving male and female pairs? He didn’t want to handle more dangerous, wild creatures than necessary. In the case of a rare or endangered species, the finite population size could be small. A “clean” animal meant it was suitable for consumption or sacrifice.

In a sequence of independent and identically distributed Bernoulli  random variables, the negative binomial distribution describes the behavior of the number of failures observed before observing successes, for integer valued parameter

. This well-known distribution has probability mass function

 Pr[Y=y]=(c+y−1c−1)pc(1−p)y (1)

defined for .

The negative binomial distribution (1) is discussed in detail by Johnson, Kotz, and Kemp (1992, Ch. 5). In this introductory section we will describe several sampling schemes closely related to the negative binomial. Table 1 may be useful in illustrating the various relations between these distributions.

The maximum negative binomial distribution is the distribution of the smallest number of trials needed in order to observe at least successes and failures for integer valued parameter . This distribution is motivated by the design of a medical trial in which we want to draw inference on the Bernoulli parameter in an infitely large population. If the prevalance of a binary valued genetic trait in cancer patients is very close to either zero or one then there is little to be gained in screening them for it. The statistical test of interest then, is whether is moderate or whether it is extremely close to either 0 or 1.

In order to test this hypothesis we have decided to sequentially test patients until we have observed at least of both the wildtype (normal) and abnormal genotypes. A small number of observations necessary to obtain at least of both genotypes is statistical evidence that is not far from 1/2. Similarly, a large number of samples needed to observe at least of both genotypes is statistical evidence that the Bernoulli parameter is extreme.

Let denote the ‘excess’ number of trials needed beyond the minimum of . The probability mass function of the maximum negative binomial distribution is

 Pr[Y=y]=(2c+y−1c−1)(py+qy)(pq)c (2)

for and .

The maximum negative binomial distribution is so-named because it represents the larger of two negative binomial distributions: the number of failures before the th success is observed and the number of successes until the th failure is observed. This distribution is also a mixture of two negative binomial distributions (1) that are left-truncated at .

An intuitive description of the terms in (2) are as follows. There are successes and failures that occur with probability . All of the extra trials beyond must all be either successes or failures, hence the term. Finally, the last Bernoulli trial must be the one that completes the experiment ending with either the th success or the th failure.

In Zhang, et al. (2000) we describe properties of the distribution (2). The maximum negative hypergeometric distribution given at (7) below and developed in the following sections is the finite sample analogue to the maximum negative binomial distribution (2).

The parameters and are not identifiable in (2). Specifically, the same distribution in (2) results when and are interchanged. Similarly, it is impossible to distinguish between inference on and on without additional information. In words, we can’t tell if we are estimating or unless we also know how many successes and failures were observed at the point at which we obtained at least of each. A similar identifiability problem is presented for the maximum negative hypergeometric distribution described in Section 4.

The minimum negative binomial or riff-shuffle distribution is the distribution of the smallest number of Bernoulli trials needed in order to observe either successes or failures. Clearly, at least and fewer than Bernoulli trials are necessary. The random variable counts the total number of trials needed until either successes or failures are observed for . The experiment ends with sample numbered from the Bernoulli population.

The mass function of the minimum negative binomial distribution is

 Pr[Y=y]=(c+y−1c−1)(pcqy+pyqc) (3)

for .

The naming of (3) as the minimum negative binomial refers to the smaller of two dependent negative binomial distributions: the number of failures before the th success, and the number of successes before the th failure. In words, distribution (3) says that there will be either Bernoulli successes and failures or else failures and Bernoulli successes. This distribution is introduced by Uppuluri and Blot (1970) and described in Johnson, Kotz, and Kemp (1992, pp 234–5). Lingappaiah (1987) discusses parameter estimation for distribution (3).

The three discrete distributions described up to this point are based on sampling from an infinitely large Bernoulli parent population. Each of these distributions also has a finite sample analogy. These will be described next.

The negative hypergeometric distribution (Johnson, Kotz, and Kemp, 1992, pp 239–42) is the distribution of the number of unsuccessful draws from an urn with two different colored balls until a specified number of successful draws have been obtained. If out of balls are of the ‘successful’ type then the number of unsuccessful draws observed before of the successful types are obtained is

 Pr[Y=y]=(c+y−1c−1)(N−c−ym−c)/(Nm) (4)

with parameters satisfying and range . The expected value of in (4) is .

The negative hypergeometric distribution (4) is the finite sample analogy to the negative binomial distribution (1). Unlike the negative binomial distribution, the negative hypergeometric distribution has a finite range. The maximum negative hypergeometric distribution described in the following sections is the larger of two, dependent negative hypergeometric distributions.

The minimum negative hypergeometric distribution describes the smallest number of urn draws needed in order to observe either successes or failures. This distribution is the finite sample analogy to the riff-shuffle distribution (3). The probability mass function of the minimum negative hypergeometric distribution is

 Pr[Y=y]=(c+y−1c−1){(mc)(N−my)+(my)(N−mc)}/{(c+yc)(Nc+y)} (5)

for .

In the example of the charge to Noah, we have male/female pairs of animals captured from a finite population of males and females.

In Section 2 we give the probability mass function of the maximum negative hypergeometric distribution. Section 3 details some approximations to this distribution. In Section 4 we discuss estimation of the parameter that describes the contents of the urn.

## 2 The distribution

An urn contains balls: of one color; and the remaining of another color. We continue sampling from the urn without replacement until we have observed balls of both colors, for integer parameter . Sampling with replacement is the same as sampling from the maximum negative binomial distribution (2) with parameter

Let denote the random variable counting the number of extra draws needed beyond the minimum . That is, on draw numbered we will have first observed at least of both colors. All of the extra draws from the urn must be of the same color so there will be of one color and of the other color at the end of the experiment. We will describe the distribution and properties of this random variable.

For define the factorial polynomial

 z(k)=z(z−1)⋯(z−k+1).

We also define .

The maximum negative hypergeometric distribution probability mass function can be written as

 Pr[Y=y]=(2c+y−1c−1){m(c+y)(N−m)(c)+m(c)(N−m)(c+y)}/N(2c+y) (6)

defined for the range of :

 0≤y≤max{m−c,N−m−c}.

The integer valued parameters are constrained to

 1≤c≤m

Similarly,

 Pr[Y=y]={c/(2c+y)}{(mc+y)(N−mc)+(mc)(N−mc+y)}/(N2c+y) (7)

expresses the maximum negative hypergeometric distribution (6) in terms of binomial coefficients.

The same distribution in (6) and (7) result when the parameter is interchanged with . This remark illustrates the identifiability problem with the parameters in the maximum negative hypergeometric distribution. A similar identifiability problem occurs in the maximum negative binomial distribution given at (2). We will describe the estimation of the parameter in Section 4.

Special cases of this distribution are as follows. For general parameter values,

 Pr[Y=0]=(N−2cm−c)(2cc)/(Nm).

If then the maximum negative hypergeometric distribution is degenerate and all of its probability is a point mass at . In words, if then there can be only one possible outcome. In this case, all of the balls in the urn must be drawn before we can observe balls of both colors.

The special case of with has the form

and zero otherwise. This is also the form of the distribution for and .

The special case for and has mass function

and zero otherwise. This is also the distribution of for and . In words, this represents the distribution of the color of the last ball remaining after all but one have been drawn from the urn.

## 3 Properties and Approximations

There are five basic shapes that the maximum negative hypergeometric distribution will assume. These are illustrated in Figs. 1 through 5. In each figure, the limiting maximum negative binomial distribution (2) is also presented. This limit can be expressed, more formally, as follows.

Lemma 1. For fixed values of , let and both grow large such that for bounded between zero and one. Then the behavior of the maximum negative hypergeometric random variable (6) approaches the maximum negative binomial distribution (2) with parameters c and p.

Proof. Values of remain bounded with high probability under these conditions. In (6) we write

 m(c+y)(N−m)(c)/N(2c+y) ≥ (m−c−y)c+y(N−m−c)c/N2c+y = (m/N)c+y{(N−m)/N}c{1−(c+y)/m}c+y{1−c/(N−m)}c = pc+yqc{1+Op(N−1)}

where and .

We can also write

 m(c+y)(N−m)(c)/N(2c+y) ≤ mc+y(N−m)c/(N−2c−y)2c+y = (m/N)c+y{(N−m)/N}c/{1−(2c+y)/N}2c+y = pc+yqc{1+Op(N−1)}.

A similar argument shows

 m(c)(N−m)(c+y)/N(2c+y)=pcqc+y{1+Op(N−1)}

completing the proof.

In words, if and are both large then sampling from the urn without replacement is almost the same as sampling with replacement. Sampling with replacement is the same as sampling from a Bernoulli parent population yielding the maximum negative binomial distribution (2).

We next describe the modes for this distribution. The maximum negative hypergeometric distribution can have either one or two modes. Write

 Pr[Y=0]/Pr[Y=1]=(c+1)/c

to show that this distribution always has at least one local mode at .

The maximum negative binomial distribution (2) also has at least one local mode at for all values of the parameter . The local mode of the maximum negative hypergeometric distribution at is clearly visible in Figs. 1, 2, 4, and 5. The local mode at in Fig. 3 is also present but it is very small.

Table 2 presents examples of parameter values corresponding to unimodal distributions in (7). In general, there will be only one mode at when is not too far from 1/2. The range of with unimodal distributions becomes narrower as becomes larger when is fixed. If then the distribution is always unimodal.

### 3.1 A gamma approximation

An approximate gamma distribution is illustrated in Fig. 4. Under the conditions of the following lemma, the local mode at

becomes negligible.

Lemma 2. For fixed if grows as for large and some then behaves approximately as the sum of independent standard exponential random variables.

Proof. Begin at (6) and write

 (2c+y−1c−1) = c−1∏i=1(y+2c−i)/i = yc−1(1+\rm Op(N−1/2))/Γ(c)

Define as

 Δ=m(c)(N−m)(y+c)/N(2c+y).

Under the conditions of his lemma, the term

 m(c+y)(N−m)(c)/N(2c+y)

will be much smaller than and can be ignored.

We have

 logΔ = c−1∑i=0log{(m−i)/(N−i)}+y+c−1∑j=0log{(N−m−j)/(N−c−j)} = clog{θN−1/2+\rm O(N−1)}+y+c−1∑j=0log{1−(m−c)/(N−c−j)}

For near zero, write

 ϵ−ϵ2/2≤log(1+ϵ)≤ϵ

so that

 logΔ=clog{θN−1/2+\rm O(N−1)}−θy/N1/2+\rm Op(y/N)

The transformation has Jacobian so

 (N1/2/θ)Pr[Y]=xc−1e−x/Γ(c)

ignoring terms that tend to zero for large vales of This is the density function of the sum of independent, standard exponential random variables.

### 3.2 A half-normal approximation

If

has a standard normal distribution then the distribution of

is said to be standard half-normal or folded normal. The density function of the random variable is

 (2/π)1/2exp(−x2/2)

for (Stuart and Ord, 1987, p 117). The approximate half-normal behavior of the maximum negative hypergeometric distribution is illustrated in Fig. 5.

Lemma 3. When becomes large, if and grows as then behaves approximately as a standard half-normal random variable.

The proof involves expanding all factorials in (7) using Stirling’s approximation. The details are provided in Appendix A.

### 3.3 A normal approximation

The normal approximation to the maximum negative hypergeometric distribution can be seen in Fig. 3. This is proved more formally in Lemma 4, below. No generality is lost by requiring because and can be interchanged to yield the same distribution.

Lemma 4. For large values of , suppose grows as and for Then behaves approximately as standard normal where

 μ=c(p−q)/q

for and

 σ=(cp)1/2/q.

The proof of thsi lemma is given in Appendix B. The details involve using Stirling’s approximation to all of the factorials in (7) and expanding these in a two-term Taylor series.

## 4 Estimation

The most practical situation concerning parameter estimation involves estimating the parameter when and are both known. In terms of the original, motivating example drawing inference on the genetic markers in cancer patients, the finite population size will be known, and the parameter is chosen by the investigators in order to achieve specified power and significance levels. The parameter describes the composition of the individuals in the finite-sized population. The value of is known without error if all subjects are observed.

The estimation of in this section is made on the basis of a single observation of the random variable . We will treat the unknown parameter as continuous valued rather than as a discrete integer as it has been used in previous sections.

The log-likelihood kernel function of in (6) is

 Λ(m)=log{(m(c)(N−m)(c+y)+m(c+y)(N−m)(c))/N(2c+y)}.

As a numerical illustration, the function is plotted in Fig. 6 for and . Observed values of are given as in this figure. The range of valid values of the parameter are for the values of and in this example. Smaller observed values of in this example exhibit log-likelihood functions with a single mode corresponding to maximum likelihood estimates of For values of the likelihood has two modes, symmetric about .

Intuitively, if the observed value is small then we are inclined to believe that the urn is composed of an equal number of balls of both colors. That is, if we quickly observe of both colored balls then this is good statistical evidence of an even balance of the two colors in the urn. Conversely, if the observed is relatively large then we will estimate an imbalance in the composition of the urn. Without the additional knowledge of the number of successes and failures observed then we are unable to tell if we are estimating or .

More generally, there will be either one mode of at or else two modes, symmetric about depending on the sign of

 Λ′′(m)=(∂/∂m)2Λ(m)

evaluated at . If is negative then there will be one mode of at .

Useful rules for differentiating factorial polynomials are as follows. For ,

 (∂/∂m)m(c)=m(c)c−1∑i=0(m−i)−1

and for ,

 (∂/∂m)2m(c)=2m(c)c−1∑i=0i

Use these rules to write

 (∂/∂m)m(c)(N−m)(y+c)=m(c)(N−m)(y+c){c−1∑i=0(m−i)−1−y+c−1∑j=0(N−m−j)−1}

and show that the likelihood always has a critical value at :

 (∂/∂m)Λ(m)∣∣m=N/2=0.

The critical point of at may either be a global maximum or else a local minimum as seen in the example of Fig. 6. This distinction depends on the sign of the second derivative of .

The second derivative of can be found from

 (∂/∂m)2m(c)(N−m)(y+c) = m(c)(N−m)(y+c) ×⎡⎢⎣{c−1∑i=0(m−i)−1−y+c−1∑j=0(N−m−j)−1}2−c−1∑i=0(m−i)−2−y+c−1∑j=0(N−m−j)−2⎤⎥⎦.

The sign of is the same as that of

 ϕ(N,c,y)=y−1∑k=0k

The first summation in is zero when or 1. The function is then negative for small values of demonstrating that the maximum likelihood estimate of is in these cases. Similarly, is an increasing function of and may eventually become positive for larger values of so that will have two modes. These modes are symmetric about because

 Λ(m)=Λ(N−m)

for all .

In other words, a small observed value of leads us to believe that there are an equal number of balls of both colors in the urn and estimate by . Similarly, a large observed value of relative to leads us to estimate an imbalance in the composition of the urn.

## References

Johnson, N.L., S. Kotz, and A.W. Kemp (1992). Univariate Discrete Distributions. New York: John Wiley & Sons.

Lingappiah, G.S. (1987). Some variants of the binomial distribution. Bulletin of the Malaysian Mathematical Society 10: 82–94.

Stuart, A. and J.K. Ord (1987). Kendall’s Advanced Theory of Statistics Vol 1, 5th Edition: Distribution Theory, New York: Oxford University Press.

Uppuluri, V.R.R., and W.J. Blot (1970). “A probability distribution arising in a riff-shuffle.”

Random Counts in Scientific Work, 1: Random Counts in Models and Structures, G.P. Patil (editor), University Park: Pennsylvania State University Press, pp 23–46.

Zhang, Z., B.A. Burtness, and D. Zelterman (2000). The maximum negative binomial distribution. Journal of Statistical Planning and Inference 87: 1–19.

## Appendix A: Proof of the half-normal approximation

The proof of Lemma 3 is provided here. We assume that and The random variable is

Expand all of the factorials in (7) using Stirling’s approximation giving

 logPr[Y=y]=−1/2log(2π)+T1(c)+T2(N)+\rm Op(N−1/2)

where

 T1 = log{2c/(2c+y)}−(c+y+1/2)log(c+y) −(c+1/2)log(c)+(2c+y+1/2)log(2c+y)

contains terms in and

 T2 = (N+1)log(N/2)−(N+1/2)log(N)−(N/2−c+1/2)log(N/2−c) −(N/2−c−y+1/2)log(N/2−c−y)+(N−2c−y+1/2)log(N−2c−y)

contains terms that are .

In all of the following expansions it is useful to keep in mind that is approximately equal to and . Write as

 T1 = −log{(2c+y)/2c}+(2c+y+1/2)log2−1/2log(c) +clog{(c+y/2)2/c(c+y)}+(y+1/2)log{(c+y/2)/(c+y)} = −log(1+y/2c)+(2c+y+1/2)log2−1/2log(c) +clog{1+y2/4c(c+y)}+(y+1/2)log{1−y/2(c+y)}.

Expand every appearance of for near zero to show

 T1=(2c+y+1/2)log2−1/2log(c)−y2/4c+\rm Op(N−1/4).

Similarly, we can write

 T2 = (N/2−c)log{(N−2c−y)/(N−2c)} +ylog{(N−2c−2y)/(N−2c−y)}−(2c+y−1)log2 +1/2log{N/(N−2c)}+1/2log{(N−2c−y)/(N−2c−2y)} = (N/2−c)log{1+y2/\rm Op(N2)}+ylog{1−y/% \rm Op(N)} −(2c+y−1)log2+1/2log{1+\rm O(c/N)}+1/2log{1+y/\rm Op(N)}.

Then write for near zero, giving

 T2=−(2c+y−1)log2+\rm Op(N−1/2).

These expressions for and in give

 logPr[Y=y]=log2−1/2log(πc)−y2/4c+\rm Op(N−1/4).

Finally, we note that is the Jacobian of the transformation . Then

 log{(2c)1/2Pr[Y=y]}=1/2log(2/π)−x2/2+\rm Op(N−1/4)

is the density of the folded normal distribution, except for terms that tend to zero with high probability.

## Appendix B: Standard normal approximate distribution

The details of the proof of Lemma 4 are given here. Define as

 Ω={c/(2c+y)}(mc+y)(N−mc)/(N2c+y).

The term

 {c/(2c+y)}(mc)(N−mc+y)/(N2c+y).

in (7) is much smaller than and can be ignored under the conditions of this lemma.

Expand all of the factorials in using Stirling’s formula giving

 logΩ=−1/2log(2π)+S1(N)+S2(N−c)+S3(y)+\rm Op(N−1/2)

where

 S1=(Np+1/2)log(Np)+(Nq+1/2)log(Nq)−(N+1/2)logN

corresponds to and ;

 S2 = (N−2c−y+1/2)log(N−2c−y)−(Np−c−y+1/2)log(Np−c−y) −(Nq−c+1/2)log(Nq−c)

corresponds to and ; and

 S3 = log{c/(2c+y)}+(2c+y+1/2)log(2c+y) −(c+y+1/2)log(c+y)−(c+1/2)log(c)

corresponds to and .

Write out all of the terms in to show

 S1=Nplogp+Nqlogq+1/2log(Npq).

We can write as

 S2 = (N−2c−y+1/2)logN+(N−2c−y+1/2)log{1−(2c+y)/N)} −(Np−c−y+1/2)log(Np)−(N−c−y+1/2)log{1−(c+y)/Np)} −(Nq−c+1/2)log(Nq)−(Nq−c+1/2)log(1−c/N).

Then

 S1+S2 = clog(pq)+ylogp+(N−2c−y+1/2)log{1−(2c+y)/N} −(Np−c−y+1/2)log{1−(c+y)/Np}−(Nq−c+1/2)log(1−c/Nq).

Since

 (c+y)/N=\rm Op(N−1/2)

we can expand

 log(1+ϵ)=ϵ−ϵ2/2+\rm O(ϵ3)

for near zero to show

 S1+S2 = clog(pq)+ylogp+(2c+y)2/2N −(c+y)2/2Np−c2/2Nq+\rm Op(N−1/2).

Then write

 y=μ+Zσ

where is a random variable, giving

 S1+S2 = clog(pq)+ylogp+{(2c+μ)2pq−(c+μ)2q−c2p}/2Npq +Zσ{(2c+μ)p−c−μ}/Np+\rm Op(N−1/2).

Substitute to show

 S1+S2=clog(pq)+ylogp+\rm Op(N−1/2). (8)

Next write as

 S3=(c+y)log{(2c+y)/(c+y)}+(c−1/2)log{(2c+y)/c)}−1/2log(c+y).

Expand the argument of the first logarithm in here in a two-term Taylor series showing

 (2c+y)/(c+y) = (2c+μ+Zσ)/(c+μ+Zσ) = {(2c+μ)/(c+μ)}[{1+Zσ/(2c+μ)}/{1+Zσ/(c+μ)}] = {(2c+μ)/(c+μ)} ×[1−Zσc/{(c+μ)(2c+μ)}+Z2σ2c/{(c+μ)2(2c+μ)}+\rm O(N−3/4)].

Then expand

 log(1+ϵ)=ϵ−ϵ2/2+\rm O(ϵ3)

to show

 (c+y)log{(2c+y)/(c+y)} = (c+μ+Zσ)log{(2c+μ)/(c+μ)} −(c+μ+Zσ)Zσc/{(c+μ)(2c+μ)} +Z2σ2c/{(c+μ)(2c+μ)} −1/2(Zσc)2/{2(c+μ)(2c+μ)2}+\rm Op(N−1/4).

Similarly,

 (c−1/2)log{(2c+y)/c)} = (c−1/2)log(2+μ/c)+(c−1/2)log{1+Zσ/(2c+μ)} = (c−1/2)log(2+μ/c)+Zσc/(2c+μ) −Z2σ2c/{2(2c+μ)2}+\rm Op(N−1/4)

and

 1/2log(c+y) = 1/2log{(c+μ)(1+Zσ/(c+μ))} = 1/2log(c+μ)+\rm Op(N−1/4)

so that

 S3 = (c+y)log{(2c+μ)/(c+μ)}−1/2log(c+μ) +(c−1/2)log{(2c+μ)/c}−Z2σ2c/{2(c+μ)(2c+μ)}+\rm Op(N−1/4).

Substitute values of and giving