A faster and more accurate algorithm for calculating population genetics statistics requiring sums of Stirling numbers of the first kind

03/11/2020
by   Swaine L. Chen, et al.
Centrum Wiskunde & Informatica
0

Stirling numbers of the first kind are used in the derivation of several population genetics statistics, which in turn are useful for testing evolutionary hypotheses directly from DNA sequences. Here, we explore the cumulative distribution function of these Stirling numbers, which enables a single direct estimate of the sum, using representations in terms of the incomplete beta function. This estimator enables an improved method for calculating an asymptotic estimate for one useful statistic, Fu's F_s. By reducing the calculation from a sum of terms involving Stirling numbers to a single estimate, we simultaneously improve accuracy and dramatically increase speed.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/17/2020

A new asymptotic representation and inversion method for the Student's t distribution

Some special functions are particularly relevant in applied probability ...
10/10/2017

On some difficulties in the addition of trapezoidal ordered fuzzy numbers

At the first, we revise the Kosinski definition of the sum of ordered fu...
01/10/2022

An examination of the spillage distribution

We examine a family of discrete probability distributions that describes...
06/30/2020

A Computational Criterion for the Irrationality of Some Real Numbers

In this paper, we compute the asymptotic average of the decimals of some...
08/31/2020

Discrete convolution statistic for hypothesis testing

The question of testing for equality in distribution between two linear ...
10/14/2018

q-Stirling numbers arising from vincular patterns

The distribution of certain Mahonian statistic (called BAST) introduced ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The dominant paradigm in population genetics is based on a comparison of observed data with parameters derived from a theoretical model [1, 2]. Specifically for DNA sequences, many techniques have been developed to test for extreme relationships between average sequence diversity (number of DNA differences between individuals) and the number alleles (distinct DNA sequences in the population). In particular, such methods are widely used to predict selective pressures, where certain mutations confer increased or decreased survival to the next generation [2]. Such selective pressures are relevant for understanding and modeling practical problems such as influenza evolution over time [3] and during vaccine production [4]; adaptations in human populations, which may impact disease risk [5, 6]; and the emergence of new infectious diseases and outbreaks [7].

Many population genetics tests are therefore formulated as unidimensional test statistics, where the pattern of DNA mutations in a sample of individuals is reduced to a single number

[2, 1, 8]

. Such statistics are heavily informed by combinatorial sampling and probability distribution theories, many of which are built upon the foundational Ewens’s sampling formula

[9]. Ewens’s sampling formula describes the expected distribution of the number of alleles in a sample of individuals, given the nucleotide diversity. Calculation of subsets of this distribution are useful for testing deviations of observed data from a null model; such subsets often require the calculation of Stirling numbers of the first kind (hereafter referred to simply as Stirling numbers). In particular, two population genetics statistics, the Fu’s and Strobeck’s statistics, utilize this approach [8, 10]. The former has recently been shown to be potentially useful for detecting genetic loci under selection during population expansions (such as an infectious outbreak) both in theory and in practice [7]. However, Stirling numbers rapidly grow large and overwhelm the standard floating point range of modern computers.

In previous work, an asymptotic estimate for individual Stirling numbers was used to solve the problem of computing Fu’s for large datasets that are now becoming common due to rapid progress in DNA sequencing technology [11]. This new algorithm solved problems of numerical overflow and underflow, maintained good accuracy, and substantially increased speed compared with other existing software packages. However, the estimation of individual Stirling numbers led to the use of an estimator at least and at most times. Here, we explore the potential for further increasing both accuracy and speed in calculating Fu’s by using a single estimator.

The new estimator for Fu’s has been implemented in R and is available at https://github.com/swainechen/hfufs.

2 Background Theory

2.1 General definitions

We take a population of individuals, each of which carries a particular DNA sequence (referred to as the allele of individual ). We define a metric, to be the number of positions at which sequence differs from . Then, we denote the average pairwise nucleotide difference as (hereafter referred to simply as ), defined as:

(2.1)

We also define a set of unique alleles which have the property of . The ordinality of is denoted , i.e. the number of distinct alleles in the data set.

Building upon on Ewens’s sampling formula [8, 9], it has been shown that the probability that, for given and , at least alleles would be found, is

(2.2)

where is the Pochhammer symbol, defined by

(2.3)

is a Stirling number and is defined by:

(2.4)

Fu’s is then defined as:

(2.5)

Fu’s thus measures the probability of finding a more extreme (equal or higher) number of alleles than actually observed. It requires computing a sum of terms containing Stirling numbers, which rapidly become large and therefore impractical to calculate explicitly even with modern computers [11].

Because of the relation in (2.4), the statistics quantity satisfies . Also, this relation and (2.3) show that are non-negative. We have the special values

(2.6)

There is a recurrence relation

(2.7)

which easily follows from (2.4). For a concise overview of properties, with a summary of the uniform approximations, see [12, §11.3].

We introduce a complementary relation

(2.8)

leading to an alternate calculation for Fu’s of

(2.9)

The recent algorithm considered in [11] is based on asymptotic estimates of derived in [13], which are valid for large values of , with unrestricted values of . It avoids the use of the recursion relation given in (2.7).

In the present paper we derive an integral representation of and of the complementary function , for which we can use the same asymptotic approach as for the Stirling numbers without calculating the Stirling numbers themselves. From the integral representation we also obtain a representation in which the incomplete beta function occurs as the main approximant. In this way we have a convenient representation, which is available as well for many classical cumulative distribution functions. We show numerical tests based on a first-order asymptotic approximation, which includes the incomplete beta function. In a future paper we give more details on the complete asymptotic expansion of , and, in addition, we will consider an inversion problem for large and : to find either from the equation , when is given, or from the equation , when is given.

2.2 Remarks on computing

When computing the quantity defined in (2.5), numerical instability may happen when is close to 1. In that case, the computation of suffers from cancellation of digits. For example, take , , . Then , and becomes about when using the first relation in (2.9). However, when we calculate and use the second relation, then we obtain the more reliable result .

We conclude that, when , it is better to switch and obtain from the sum in (2.8), and by using the second relation of in (2.9). A simple criterion to decide about this can be based on using the saddle point (see Remark 3.1 below).

A second point is the overflow in numerical computations when is large, because of the large values of when is small with respect to . For example, when , we have

(2.10)

Therefore, it is convenient to scale the Stirling number in the form . In addition, the Pochhammer term in front of the sum in (2.2) will also be large with ; we have .

We can write the sum in (2.2) in the form

(2.11)

Leading to a corresponding modification in the recurrence relation in (2.7) for the scaled Stirling numbers:

(2.12)

To control overflow, we can consider the ratio

(2.13)

This function satisfies if . For small values of we can use recursion in the form

(2.14)

For large values of and all we can use a representation based on asymptotic forms of the gamma function.

Remark 2.1.

It should be observed that using the recursion in (2.7) and (2.12) is a rather tedious process when is large. For example, when we use it to obtain for all , we need all previous with for all . Table look-up for in floating point form may be a solution. When is large enough, the algorithm mentioned in [11] evaluates each needed Stirling number by using the asymptotic approximation derived in [13].

3 Results and Discussion

3.1 An integral representation of

We use the integral representation of the Stirling numbers that follows from the definition given in (2.4). That is, by using Cauchy’s formula,

(3.1)

where is a circle around the origin with radius . We can take as large as we like. As in [13, §3], it is convenient to proceed with

(3.2)

We derive an integral representation of

(3.3)

We use (3.2) and obtain

(3.4)

We can take to have on the circle , and we can perform the summation to , because all terms with do not give contributions. In this way we obtain the requested integral representation

(3.5)

To obtain this result we need , but in the integral representation we can take when we pick up the residue at . The result is

(3.6)

and we find for (see (2.8))

(3.7)

For the asymptotic analysis we write (3.5) in the form

(3.8)

where

(3.9)

Then the saddle point of the integral in (3.8) follows from the equation

(3.10)

There is a positive saddle point when .

Remark 3.1.

When crosses the value , becomes (almost) . Especially when the parameters and are large, starts with very small values for small , its values is about when and it becomes quickly 1 as increases. We call the transition value for .

For fixed values of there is also a transition value for , say, . When is large, starts at values near 1 for small , it becomes about when crosses the transition value , and it becomes quickly small as .

3.2 An asymptotic representation of

We use the transformation, as in [13, §3],

(3.11)

with condition , where is the saddle point in the -domain and also the zero of

(3.12)

Also

(3.13)

With this choice of , the variables and correspond with each other at the respective saddle points.

Using the transformation we obtain

(3.14)

where

(3.15)

and

(3.16)

The contour runs around the origin and includes a pole at that corresponds with the pole in the -plane at .

3.3 A representation in terms of the incomplete beta function

The integrands of the integral representations of have a pole at . For the contour integrals this is not a complication, because by using the theory of analytic functions we can deform the contour to avoid the pole, and we can even cross the pole and pick up the residue as we did to obtain the representation in (3.6).

For the integral in the -domain given in (3.14) the same can be done. The function has a pole at a point , say, that follows from the transformation given in (3.11). That means, is defined by the equation

(3.17)

and we can show the existence of the pole of the function defined in (3.16) writing

(3.18)

In asymptotic analysis the presence of such a pole is of great interest, especial when (in the -domain) the saddle point (here ) is close to a pole (here ), or even when these points coalesce. See, for example, [14, Chapter 21]. Usually, the error function is introduced to handle the asymptotic analysis, in the present can we use an incomplete beta function.

We split off the pole from and write

(3.19)

where we assume that is well defined at . To find we use the analytical relation in (3.11) between and , in particular at (or ). Applying l’Hôpital’s rule, we conclude that as , which gives . Hence, substituting this form of in (3.14), we find

(3.20)

where we have used (see (3.15) and (3.17))

(3.21)

The radius of the circle in the first integral is larger than , for the second integral we take a circle around the origin such that the singularities of are outside the circle.

The first integral can be evaluated in terms of the incomplete beta function defined by

(3.22)

where is the complete beta function

(3.23)

We will show in the Appendix that

(3.24)

Hence,

(3.25)

where

(3.26)

A first-order approximation of this function follows from replacing by its value at the saddle point . This gives

(3.27)

where

(3.28)

This expression of follows from the definition of given in (3.16). In a future publication we will give details about the complete asymptotic expansion of the term .

For the complementary function (see (2.8)) we obtain

(3.29)

where we have used

(3.30)
Remark 3.2.

The incomplete beta function in (3.25) has the representation (see [15, §8.17(i)])

(3.31)

and from the complementary relation in (3.30) it follows that the function in (3.29) has the expansion

(3.32)

3.4 Numerical tests

We summarize the steps to obtain the first-order approximations (see (3.25) or (3.29) and (3.27))

(3.33)

or

(3.34)

for given , and , and to compute Fu’s by using (2.9).

  1. Compute the saddle point , the positive zero of ; see (3.10).

  2. With , the positive zero of (see (3.12)), compute , the solution of the equation (see (3.17))

    (3.35)

    with defined in (3.9) and defined in (3.11). When there is one solution . When there are two positive solutions, and we take the one that satisfies the condition .

  3. When , hence , compute the approximation of by using (3.33), and from the first relation in (2.9).

  4. When , hence, , compute the approximation of by using (3.34), and from the second relation in (2.9).

Table 1: Relative errors in the computation of defined in (2.5). We have used the asymptotic result (3.27).

In Table 1 we give the relative errors in the computation of defined in (2.5). The values of , , and correspond with those in Table 1 of [11]. We have used the asymptotic result (3.27). Computations are done with Maple, with Digits = 16. The ”exact” values are obtained by using Maple’s code for , which computes the Stirling numbers of the first kind.

We additionally performed a comparison with the recently published algorithm in [11]. We performed 10,000 calculations with each algorithm and compared the results with an exact calculator. As expected, since the previous algorithm required estimating a Stirling number for each term of the sum, while the current asymptotic estimate directly calculates the sum, both error and compute speed were improved. Relative error for the single term estimate in (3.25) was well controlled at for nearly 99% of the calculations; for 411 calculations where the previous hybrid estimator had an error , the estimate in (3.25) was more accurate in all but one case (; 3.08e-3 relative accuracy using [11]; 3.32e-3 relative accuracy using (3.25)) (Figure 1). The fewer calculations led to a clear improvement in calculation speed (median 54.6x faster; Figure 2).

Figure 1: Comparison of relative error of the estimator from [11] and the single term asymptotic estimator in (3.25). Relative error for each is calculated against the arbitrary precision implementation described in [11]. In total, 10,000 calculations were performed with

randomly sampled from a uniform distribution between 50 and 500;

between 2 and ; and between 1 and 50. A solid diagonal line is drawn at . Dotted lines are drawn at a relative error of 0.001. Numbers within each quadrant defined by the dotted lines indicate the number of points in each quadrant. The red dot indicates the one case where the relative error was and the error of (3.25) was greater than the estimator from [11].
Figure 2: Comparison of run times between the hybrid algorithm from [11] and the single term asymptotic estimator in (3.25

). 100 iterations were run, each with 10,000 calculations. The same set of parameters were used for each algorithm. The order of running the algorithms was alternated with each iteration. The dark horizontal line indicates the median, the box indicates the first and third quartiles, the whiskers are drawn at 1.5x the interquartile range, and outliers are represented by open circles. The median for the hybrid algorithm is 62.64 s; the median for the asymptotic algorithm is 1.17 s.

4 Conclusion

The rapid growth of sequencing data has been an enormous boon to population genetics and the study of evolution. Traditional population genetics statistics are still in common use today. The statistics Fu’s and Strobeck’s have been difficult to calculate using previous methods; we now further improve both accuracy and speed for the calculation of Fu’s for large, modern data sets, using the main estimator in (3.25). Our plan for a paper about the ability to invert the calculation provides additional future directions in understanding the performance of these statistics, and the methods used herein may be useful for the development of new statistics that more effectively capture different types of selection.

Acknowledgments

SLC acknowledges Shyam Prabhakar and members of the Chen lab for fruitful discussions. NMT acknowledges CWI, Amsterdam, for scientific support.
SLC was supported by the National Medical Research Council, Ministry of Health, Singapore (grant numbers NMRC/OFIRG/0009/2016 and
NMRC/CIRG/1467/2017).
NMT was supported by the Ministerio de Ciencia e Innovación, Spain, projects MTM2015-67142-P (MINECO/FEDER, UE) and
PGC2018-098279-B-I00 (MCIU/AEI/FEDER, UE). The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables.

5 Appendix

We give a proof of the incomplete beta integral in (3.17). We use the integral representation of the hypergeometric function (see [16, §15.6])

(5.1)

where the contour starts at the origin, encircles the point in the anti-clockwise direction, and returns to the origin. The main conditions are and that is outside the contour.

We also use the relation between the incomplete beta function and the -function (see [15, §8.17(ii)])

(5.2)

It follows that

(5.3)

and after the substitution and writing , we obtain with , , and as in (3.11)

(5.4)

where the pole at is outside the contour. We modify the contour and pick up the residue. In this way we find the relation in (3.17).

References

  • [1] S. Casillas and A. Barbadilla. Molecular Population Genetics. Genetics, 205(3):1003–1035, Mar 2017.
  • [2] R. Nielsen. Statistical tests of selective neutrality in the age of genomics. Heredity (Edinb), 86(Pt 6):641–647, Jun 2001.
  • [3] B. T. Grenfell, O. G. Pybus, J. R. Gog, J. L. Wood, J. M. Daly, J. A. Mumford, and E. C. Holmes. Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 303(5656):327–332, Jan 2004.
  • [4] H. Chen, J. J. S. Alvarez, S. H. Ng, R. Nielsen, and W. Zhai. Passage Adaptation Correlates With the Reduced Efficacy of the Influenza Vaccine. Clin. Infect. Dis., 69(7):1198–1204, Sep 2019.
  • [5] A. Wollstein and W. Stephan. Inferring positive selection in humans from genomic data. Investig Genet, 6:5, 2015.
  • [6] L. Quintana-Murci. Understanding rare and common diseases in the context of human evolution. Genome Biol., 17(1):225, 11 2016.
  • [7] Z. Wu, B. Periaswamy, O. Sahin, M. Yaeger, P. Plummer, W. Zhai, Z. Shen, L. Dai, S. L. Chen, and Q. Zhang. Point mutations in the major outer membrane protein drive hypervirulence of a rapidly expanding clone of Campylobacter jejuni. Proc. Natl. Acad. Sci. U.S.A., 113(38):10690–10695, 09 2016.
  • [8] Y. X. Fu. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics, 147(2):915–925, Oct 1997.
  • [9] W. J. Ewens. The sampling theory of selectively neutral alleles. Theor Popul Biol, 3(1):87–112, Mar 1972.
  • [10] C. Strobeck. Average number of nucleotide differences in a sample from a single subpopulation: a test for population subdivision. Genetics, 117(1):149–153, Sep 1987.
  • [11] S. L. Chen. Implementation of a Stirling number estimator enables direct calculation of population genetics tests for large sequence datasets. Bioinformatics, 35(15):2668–2670, 2019.
  • [12] A. Gil, J. Segura, and N. M. Temme. Numerical Methods for Special Functions. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
  • [13] N. M. Temme. Asymptotic estimates of Stirling numbers. Stud. Appl. Math., 89(3):233–243, 1993.
  • [14] N. M. Temme. Asymptotic methods for integrals, volume 6 of Series in Analysis. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2015.
  • [15] R. B. Paris. Chapter 8, Incomplete gamma and related functions. In NIST Handbook of Mathematical Functions, pages 173–192. U.S. Dept. Commerce, Washington, DC, 2010. http://dlmf.nist.gov/8.
  • [16] A. B. Olde Daalhuis. Chapter 15, Hypergeometric function. In NIST Handbook of Mathematical Functions, pages 383–401. Cambridge University Press, Cambridge, 2010. http://dlmf.nist.gov/15.