Mixture distribution; partial integration; stochastic dominance; summation by parts; discrete distribution; special functions
Parametric models of probability distributions are essential for statistical inference. Hence a vast number of distributions of all types has been created, and a common way to generate new distributions is to modify an old one. Jones (2015) reviews the main techniques for generalizing univariate symmetric distributions, and Lai (2012) gives a comprehensive account of ways to modify survival distributions.
Transforming the random variable is probably the most popular method. It is a technique that is often used to evaluate unknown integrals, and is used ‘in reverse’ where the integral of the pdf (unity) is already known, to generate new pdfs. For example, the exponential distribution with survival functionbecomes the Weibull distribution with on setting .
Other integration techniques can be similarly used to create new distributions. For example, Azzalini’s method (e.g. Azzalini and Capitanio, 2018) of transforming a symmetric pdf to , where , can be thought of as a trick to simplify asymmetric integrands of type used in reverse. These reflections prompt the thought that since a probability density function (pdf) must integrate to unity, all the ‘tricks’ used to simplify unknown integrals, if used in reverse, can be used to generate more complex integrands (pdfs).
A method for simplifying integrands is integration by parts (IBP), and the creation of new distributions using this method was introduced by Baker (2019). This article further explores the use of IBP and its discrete analogue, summation by parts (SBP) to generate new distributions.
Before giving the general methodology, we show the power of the method with an example, starting with the exponential distribution. This is a special case of a distribution given in Baker (2019). Write the exponential pdf for and as , where ; this is just one of many choices of and that could be made.
Then , and on evaluating the integral by changing variable to where we see that , where
is the normal distribution function. We have that, and hence using the method of parts, integrating and differentiating , the integrand (the new pdf) is
This is a 1-parameter distribution, that can be used for modelling lifetimes when the hazard function initially high, and the pdf and hazard function are shown in figure 1. Further details are given in appendix A.
This example shows how potentially useful and fairly tractable new distributions can be generated quite easily using IBP, and illustrates the point that the pdf is often a special function.
The next section gives the general methodology and some general properties of the new distributions for the continuous case, and then some promising distribution classes are introduced. For discrete distributions, the method of summation by parts can be used analogously and gives broadly similar results, discussed in section 6.
2 General theory
2.1 Deriving probability density functions
Some notation is now introduced and the basic idea described more formally. The method gives a transformation of the integrand (pdf) leading to a new pdf, and the two pdfs can conveniently be called ‘L’ and ‘R’, where the R form stochastically dominates the L form, so that the probability mass is further to the right. The original distribution is sometimes referred to here as the ‘base’ distribution.
Let the support of a distribution be , which covers distributions defined on the whole real line, survival distributions, and doubly-bounded distributions. Write the R-pdf as , where is a positive monotone increasing function so that , with , and a positive decreasing function so that , with . Then applying integration by parts,
The first term on the right vanishes, giving the L-pdf
When and , (1) yields a pdf
Here by construction . In Baker (2019) are crafted so that the simpler form (2) could be used. One can of course transform the integrand by going from the base distribution instead of as shown above, to obtain from by integrating and differentiating .
2.2 Distribution functions and Stochastic dominance
Let the distribution functions corresponding to be respectively. Write the survival distributions etc. Then integrating by parts again
and . The L-distribution has probability mass shifted to the left, and the R-distribution dominates it stochastically, i.e. . The mean can be written as . The R-distribution may or may not dominate when (3) is used, and
The shifting of probability mass can be better understood by tagging the mass at using a Dirac delta-function. Then we apply IBP to , to obtain for , else zero. Hence a unit probability mass at has been smeared out onto the range with pdf , distribution function . Going from the L to the R distribution, the probability mass is smeared out over with survival function .
2.3 Random numbers
The above suggests a method of generating random numbers from the L-distribution, given that we can generate them from the R-distribution . Let be a random number from and a uniform random number. Then a random variable from satisfies so that . If can be inverted, this is a simple way to generate random numbers from the L-distribution.
2.4 Families of distributions
One usually seeks to generalize a pdf so that a family of distributions can be generated indexed by one or more parameters, with the original distribution as a special case. The freedom available in choosing and makes this possible because if decreases infinitely fast, the integral is zero, and for some value of the indexing parameter ( in the examples given later). This makes large-sample inference easier, as one can e.g. use twice the log-likelihood increase on adding the new parameter in a chi-squared test of whether the model fit to a dataset has improved. It also enables parametric tests of stochastic dominance: appendix B gives a brief discussion.
The next section discusses particular classes of distributions generated using IBP.
3 Some distribution classes
3.1 Using a function of the distribution function for
With a function of , will be a function of , so this yields general families of distributions. The most interesting classes have as the parameter . In general, when as , the pdf will be infinite at the lower limit , if is finite.
Moving , one choice is to take , where , so that . Then . Hence the L-distribution has pdf
and distribution function
This is a negative mixture of the original pdf with weight and the top -th order statistic with weight . As , and as , , i.e. the probability mass is zero at above the lower limit. When , from (6) or directly, and .
Expanding in a Taylor series for small , we see in the right tail that , showing the pulling in of the right tail that follows from the movement of probability mass to the left. The hazard function is , twice that of the base distribution.
One may wonder how this is possible, given that as . The solution to the paradox is that the tail where the hazard is double that of the base distribution occurs at larger and larger as . For small , when we have that the hazard function , so that the hazard and pdf is similar to the base distribution, but both are larger by a factor . For , we have that , so that if is finite, the hazard would be infinite.
Going , set , so and and . This is a negative mixture that is longer-tailed than was the L-distribution. When , .
On taking and using (5),
Since is a distribution function, is the negative mixture
When is exponential, is a truncated extreme-value distribution.
AS , (8) yields , and as , , e.g. a component lifetime distribution tends to the distribution of the lifetime of a series system where either of two components must fail for a failure of the system.
In the tail where is small, expanding (8) shows that the hazard function is twice that of the base distribution (which is true for all as ).
In the left tail,
i.e. the pdf is scaled up from the base distribution by a factor not exceeding 2. For survival distributions as base distribution, this distribution therefore tends to give an increasing hazard function.
Other increasing functions of can also be used, e.g. . This gives
which is messy. Any increasing function can be used for , leading to a vast number of possible (and messy) distributions.
3.2 Using the transformation for survival and doubly bounded distributions
When is a power of the random variable (time ), more specific results follow. All lifetime distributions must have a scale factor . Without loss of generality, we set for now. The base pdf may include a power of . Let this term be , where
is positive for Weibull and gamma distributions, andfor the lognormal distribution. Take . Then from (2)
We must have that so that the pdf is positive, hence .
Baker (2019) discusses this case in detail and defines reliability growth as
The mean lifetime is , so the proportional increase in expected lifetime on eliminating inferior items is
and the proportional decrease in lifetime caused by inferior items is . As increases and the base distribution is regained, goes to zero.
4 Special cases of continuous distributions
This section briefly discusses specific distributions; a great variety of distributions can be generated, of which these are only a small sample.
With the exponential distribution as base, shifting right using the transformation from section 3, we have and . This is a 2-parameter phase-type distribution that occurs e.g. as the prevalence of intermediate radioactive decay products. The hazard function increases linearly with slope and levels off at
. The moments are tractable, e.g. the mean is.
Using the transformation to shift right yields , a mixture of exponential and gamma distributions.
Baker (2019) explored left-shifting distributions such as gamma and Weibull. These are special cases of the Stacy distribution. The Stacy or generalized gamma distribution has pdf
where and is the gamma function. It includes gamma, Weibull and lognormal distributions as special cases. Going with yields a simple mixture of Stacy distributions, but going is more productive. In integrating , the incomplete gamma function will be needed, defined as
where . The gamma function itself is . If , is still defined but cannot be computed using software that computes the incomplete gamma function.
The resulting pdf is
and the survival function is
The relation between the moments for L and R-distributions for the case was given in section 3.2.
The beta distribution is the most common doubly-bounded distribution. Here the pdf, where denotes the beta function. Taking , we have that , where . The constant of integration had to be zero for survival distributions as , but here does not. Hence
where is the complement of the unregularized incomplete beta function. This yields the pdf
The distribution function is
On integrating by parts, we have that
We have that . This 4-parameter distribution allows a much more flexible pdf than the beta distribution. The transformation can of course also be applied to by changing in the transformed distribution. Note that the beta distribution is label-invariant ( also follows a beta distribution) but the transformed distribution is not.
Random numbers can be generated as follows:
With a uniform r.v., if , generate , where is a uniform r.v.;
if generate , a r.v. from the parent beta distribution ;
Then , where is a uniform r.v.
To keep generation of random numbers to a minimum, and could be generated by affine transformations on .
On the whole real line, the normal distribution is the most important by far, and has been skewed in many ways, e.g. Azzalini and Capitanio (2018). Takingleads to the lagged-normal or normal-exponential distribution (e.g. Johnson et al 1995). This is exponential in the tail, and can also be derived as the sum of normal and exponential random variables, so is not new. The transformation yields the class (7) which with , where is the normal distribution function, yields a skew distribution where . The moments cannot be found simply.
5 Data fitting
The method generates vast numbers of distributions, some of which are new. The aim of the analysis was merely to show by an example that the new distributions can be useful.
In their book on survival analysis, Klein and Moeschberger (2003) use a dataset of days to death of 863 kidney-transplant patients whose transplants were performed at the Ohio State University Transplant Center between 1982 and 1992. Available covariates are age, gender and white/black. Only 140 patients’ survival times were not censored. We discretized age into 5 bands: 1-16, 17-32, 33-48, 49-64 and 65+ and fitted the distribution from (12) with , i.e. a modified Weibull distribution. An accelerated time model was used (e.g. Chiou et al, 2014), so that , where
is a vector of covariates anda vector of regression coefficients. The model parameters are then the , the baseline time-scale, the Weibull shape parameter and , reparameterized so that as defined in (10) is used instead of . The fit was done by maximum-likelihood in a purpose-written fortran program. The incomplete gamma function for negative argument is not available as a standard special function, and so was evaluated as an integral for in (11). The NAG library routine D01AMF was used. This routine transforms the integration range to be finite and then integrates adaptively. Similar routines exist on many platforms.
The fit to data cannot be shown because of the censoring, but figure 2 shows the hazard function of the fitted distribution for the central and oldest age-bands. From this it can be seen that the hazard of death decreases steadily after transplant, effectively to a constant for the central age-band, but starts increasing again for the top age band. This result is consistent with hazard plots using all patients produced by various methods by Klein and Moeschberger (2003).
The table shows fitted parameters and standard errors. Standard errors are large, sometimes very large, because little information comes from the censored survival times. The baseline age group was 33-48 years. It can be seen that the hazard of death increases rapidly from the lowest age band, and then increases more slowly with age. Women have a slightly higher hazard of death than men, and people of colour higher than white, but these effects are not statistically significant. Because, the hazard of death would rise eventually for all age groups. The value of 1.6 for means that, for a given set of covariate values, the expected lifetime could be increased by 160% of its current value, if all patients could be made to respond as well as the best ones. This suggests that much could potentially be gained by studying and improving survival rates.
|1-16 age band||-29.685||168.886|
|17-32 age band||-2.354||0.530|
|49-64 age band||-.558||0.266|
|65+ age band||1.338||0.377|
6 Discrete distributions from summation by parts
A way to generalize discrete (count data) distributions, is to use partial sum distributions. Wimmer and Mačutek (2012) give a general definition of partial-sum distributions, where for parent (base) distributions with support on the non-negative integers, with probabilities the descendant probabilities are given by , where is a real function. The simplest such distribution has where is the mean of the parent distribution; this is a discrete form of the length-biased distributions found in renewal theory. Wimmer and Altmann (2001) describe more general distributions of this type, and Johnson et al (2005) give a summary.
Here a methodology is given that yields classes of partial-sum distributions. The two classes that have been explored in detail reduce to the parent distribution when a parameter or . We shall call them -Poisson,
-binomial, etc. The r-class interpolates between a distribution and a variant of its length-biased form.
The population mean is needed for statistical inference. For example, one often wishes to regress the mean on some covariates. This is computationally more difficult (but still possible) when the mean cannot be readily computed in terms of model parameters. The class of r-distributions has the advantage of tractable moments and so allows more straightforward inference.
The general methodology is next given, followed by the properties of the 2 classes of distributions and example of a fit to data is given.
6.1 The general pmf
The mathematical technique of summation by parts (SBP) can be used to transform distributions into new ones. The SBP identity can be written
for any . The proof follows from the telescoping property of the sum . Distributions are commonly defined on the integers to , where usually,
, the main exception being the binomial distribution. Let a ‘parent’ probability mass function (pmf), and let with . Since , . Then from (13)
is also a pmf.
Set and set the pmf . Then from (14) for
Since and , it follows that . Since , . The function can be chosen to be zero at , e.g. where , or where . In this case simply . Otherwise
Using (13) in this way, both parent and descendant distributions have the same upper limit and hence the same support.
Note that for we require . Since and , we have that as required. If is indexed by a parameter so that as , then from (15) as . This is so because , , and . Hence the class of descendant distributions generalizes the parent distribution.
6.2 The discrete distribution function and moments
Denoting distribution functions for parent and descendant distributions by respectively, we have from (13) that
Hence the distribution function is readily calculable once the are calculated and the pmf is known. If , and the parent stochastically dominates the descendant distribution.
From (15) the mean is
On reversing the order of summation,
For the -th non-central moment,
6.3 Random numbers
A unit probability mass at for the parent distribution becomes a pmf for , so random numbers can be generated by generating a random number from the parent distribution,then choosing a random number with probability . If is not too large, this can be done by generating a uniform random number , and choosing the smallest such that . This is the table lookup method (e.g. Shmerling, 2013).
From (16) on setting , we have that , so when , which makes these distributions useful when the parent distribution (e.g. Poisson) underfits the probability of no events occurring.
6.4 The distribution
With the choice , and the new system of distributions
As , , and as , and all the probability mass is at zero.
The case where is briefly mentioned in Johnson et al (2005). In general, the moments are intractable, unless . There (18) enables the mean to be found, and higher moments are also tractable. Letting and setting gives
This is a class of distributions that generalizes the Bissinger system of distributions (Johnson et al 2005, p. 509), in which .
The choice where is more tractable, and is discussed next.
6.5 The distribution
Here for . Define , then and
as . Also, . The pmf for a Poisson parent distribution is shown in figure 3.
The function is the probability generating function (pgf) for the parent distribution. From (21) on reversing the order of summation, the pgf for the derived distribution is
From this, the moments can be read off, e.g. the mean is
The pgf is known in analytic form for the major discrete distributions, e.g. for the Poisson, . Hence the moments of the descendant distribution can be found as functions of the parameters of the parent distribution.
Random numbers can be computed using the general method described earlier. This now particularizes to the following: generate a random number from the parent distribution. Then compute
where is a uniform random number. Then the integer part of is , a r.v. from the descendant distribution. This is the inverse-transform method (Shmerling, 2013).
As we have that , while when the numerator and denominator of (21) go to zero. Applying L’Hospital’s rule and using , , a partial sum distribution corresponding to the parent distribution.
6.6 Long-tailed discrete distributions
Going from to a new class of long-tailed distributions can be obtained. For doubly-bounded discrete distributions, one can use (15), with . In general, (13) is used in reverse. For example, for the -distributions, set so . This yields so that the pmf is given by
As this simplifies to
For such distributions , so , and the probability of zero events is reduced. The probabilities are easy to compute, as an infinite sum is not required.
The -distribution with is trivial: the pgf is simply
the product of the parent pgf and the pgf of a geometric distribution with. Hence the effect of SBP has been to add a geometric random variable to the original.
6.7 Other discrete distributions
Each functional form for gives a new family of distributions, and if can increase arbitrarily fast with , the parent distribution can be regained, so the descendant distribution will generalize the parent. For example, with one obtains
as . As , . In general, the mean cannot be simply expressed. Clearly, more parameters than one can be introduced, and a host of new distributions created. In general, they will be quite messy, for example when . However, there may well be other functional forms for that give attractive distributions.
An interesting curiosity is a distribution obtained by setting , where is a positive integer. This has , and . The pmf is therefore
) the variance is
where is the variance of the parent distribution. As the parent distribution is regained. This distribution generalizes the case mentioned briefly in Johnson et al (2005).
We usually wish to study the effect of covariates on the mean . The mean is therefore predicted as , where is a vector of covariates. For the example, for the -distributions, the model parameter was found for each case from (22), using Newton-Raphson iteration, which usually converged in 4 iterations. The pgf for the negative binomial distribution. The probabilities were then computed and fits were by maximum likelihood.
Note that for the -distributions the mean is not readily calculable in terms of parent distribution parameters. Model fitting is still possible however, because the mean can be computed using the probabilities taken up to some large cutoff value of , and the Newton-Raphson iteration still done for .
Hilbe (2011) fits negative binomial distributions to a number of datasets. One is the ‘affairs’ dataset, with 601 observations from Fair (1978), reporting counts of extramarital affairs over a year in the USA. Table 2 shows results of fitting the Poisson and NB distributions, and on adding the parameter, and table 3 shows the covariates and the regression coefficients. The mean quoted is for the average values of the covariates.
|Negative Binomial + r||711.45||1444.91|
|Years married||.107 (.0355)||.0024|
|Educ. level||-.000610 (.0560)||.991|
Clearly, adding the parameter to the negative binomial model has improved the fit significantly. The model has gone to the limiting case of the partial-sum distribution where . The conclusions about the effect of covariates agree broadly with Hilbe’s analysis. The significant predictors are self-rating of the marriage from unhappy to happy on a scale of 1-5, degree of religiosity on a scale from 1 to 5 (anti to very) and years of marriage. Religious people with a happy marriage who have not been married long have fewest affairs.
This example shows how the new distributions can improve model fit and so enable better inference. Computations were done with a purpose-written fortran program, that used the Numerical Algorithms Group (NAG) function minimisers.
Integration by parts is a general method that yields a cornucopia of distributions, only a few of which have been explored here. Some are new, while some have been derived by other methods and now have a new characterization. This could be useful e.g. in generating random numbers.
IBP is an addition to other general methods for modifying distributions, such as transforming variables. Summation by parts can be used similarly for discrete distributions, and its usefulness may be relatively greater, as the variety of methods for generating new distributions from old is more limited in the discrete case.
One can shift the probability mass of a distribution left or right. Shifting it left is useful when dealing with failure-time distributions in reliability, where the left-shifted distributions can reproduce the bathtub-shaped hazard functions sometimes seen in practice (e.g. Lai and Xie, 2006). For discrete distributions, the left shifted distributions have a higher probability of zero events occurring. This zero-inflation can substantially improve the model fit to data. Shifting probability mass right gives long-tailed distributions, which are needed in many areas, e.g. finance.
Comparing modifying distributions by integration by parts to using the transformation of variables method, one could sum up as follows:
transformation gives a simple pdf, whereas IBP gives the pdf as an integral, which may be simple or may be a special function;
the distribution function for the transformation of variables method may not be tractable, but for IBP it is easily derived once the pdf is known and is simply related to it by (4);
the IBP method has the further property of yielding a new distribution that either stochastically dominates the original, or is dominated by it. This property can be used in a test of stochastic dominance as described in appendix B.
The IBP method changes the skewness of distributions, but cannot change tail length whilst leaving skewness unchanged, as does e.g. the arcsinh transformation applied to the normal distribution. However, IBP could be applied twice e.g. and then to do this by lengthening each tail in turn.
The methodology does not apply to circular distributions, but it is possible that an analogous method could be developed here. Possible multivariate applications have not been mentioned. Clearly, with a bivariate distribution one can shift the probability mass of given , to obtain a new distribution with a different marginal distribution for , but the original marginal distribution for . This type of procedure will not lead to a copula, because the marginal distribution of has changed, but such distributions could still be useful.
It is possible that other integration techniques, such as contour integration, could also be profitably used. Much further work could be done in modifying distributions using partial integration, either developing the general methodology, or using it to derive more new distributions. Fast computation of the resulting special functions such as (11) is also needed.
The mathematical method of summation by parts enables the creation of new families of discrete distributions that generalize existing distributions. Any increasing function defined on non-negative integers (strictly, also on -1) gives a family of distributions, and two such functions, and have been discussed. The latter gives a relatively tractable distribution, with moments that can be found analytically in terms of the parameters of the parent distribution when the pgf can be found analytically.
The distributions can be used for modelling data and statistical inference, and could give a useful sensitivity analysis when a model such as negative binomial has been fitted. The class of long-tailed distributions of the form is completely new.
Further work could include derivation of new classes of distribution using other functions for , and more detailed exploration of their properties. Bivariate distributions have not been considered here, but new bivariate distributions could be derived from old by applying SBP to for each level of .
-  Azzalini, A. and Capitanio, A. (2018). The skew normal and related families, Cambridge University Press, Cambridge UK.
-  Baker, R. D. (2019). New survival distributions that quantify the gain from eliminating flawed components, Reliability Engineering and System Safety, in press.
-  Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance, Econometrica, 71, 71 104.
-  Chiou, S. H., Kang, S. and Yan, J. (2014). Fast accelerated failure time modeling for case-cohort data, Statistics and Computing, 24, 559–568.
-  Fair, R. (1978). A theory of extramarital affairs. Journal of political economy 86, 45-61.
-  Hilbe, J. M. (2011). Negative Binomial Regression, 2nd. ed., Cambridge University press, Cambridge.
-  Johnson, N.L., Kemp, A. W., Kotz, S. (2005). Univariate Discrete Distributions, 3rd ed., Wiley, New York.
-  Johnson, N.L., Kotz, S., Balakrishnan, N. (1995). Continuous Univariate Distributions, 2nd ed., Wiley, New York.
-  Johnson, N.L., Kemp, A. W., Kotz, S. (2005). Univariate Discrete Distributions, 3rd ed., Wiley, New York.
-  Jones, M. C. (2015). On families of distributions with shape parameters, International Statistical Review, 83, 175-192.
-  Klein, J. P. and Moeschberger, M. L. (2003). Survival Analysis: techniques for censored and truncated data, 2nd ed., Springer, New York.
-  Lai, C-D., Xie, M. (2006). Stochastic ageing and Dependence for Reliability, Springer, Berlin.
-  Lai, C. D. (2012). Constructions and applications of lifetime distributions, Applied Stochastic Models in Business and Industry, 29, 127-140.
Shmerling, E. (2013). A range reduction method for generating discrete random variables, Statistics and Probability Letters,83, 1094-1099.
-  Wimmer, G. and Altmann, G. (2001). A new type of partial-sums distributions, Statistics and Probability Letters, 52, 359-364.
-  Wimmer, G. and Mačutek, J. (2012). New integrated view at partial-sums distributions, Tatra Mountains Mathematical Publications, 51, 183-190.
Appendix A: the modified exponential distribution
The distribution is a special case of the modified Stacy distribution where , . Some properties of this distribution are now given without proof, for completeness, and to show the tractability of the distribution. The pdf is initially infinite, and is exponential in the tail. The survival function is
The hazard function decreases from infinity at to a constant , at large .
The moments are , giving , , . The coefficient of variation is thus , compared with 1 for the exponential distribution. Random numbers are generated as , where
are uniformly distributed on.
The third moment about the mean is , giving skewness of
, larger than 2 for the exponential distribution. The excess kurtosis is, compared to 6 for the exponential distribution.
Appendix B: Stochastic dominance tests
Given two samples of returns from investments, one might wish to test for (first order) stochastic dominance of investment B by investment A (e.g. Barrett and Donald, 2003). A possible test is to fit a parametric form to the sample, with the indexing parameter fixed at zero (giving the R-distribution) for sample A, and floating for sample B. The chi-squared test or other test that can then be used to test whether investment A dominates B stochastically. An exact test would compute , then permute observations between the two groups A and B, refitting to generate the null distribution of . The p-value is the proportion of permuted samples for which is no greater than the observed value. This test can also be done with discrete distributions of gains from investment/gambling. Note that unlike the description here, is often taken that there is stochastic dominance. This area of inference is complex and not the main subject of this article, but clearly good parametric tests could be devised.