The Wright-Fisher (WF) model is a popular discrete-time model for the evolution of gene frequencies in a population. Consider a population of individuals, i.e. chromosomes, and assume that each individual has an associated genetic type, with being the set of possible types. In the classical WF model the population has constant (large) size and it evolves in discrete non-overlapping generations according to the following random processes: i) each individual in the next generation chooses, uniformly at random, an individual in the current generation and copies it, with the choice made by different individuals being independent; ii) the type of each progeny of an individual of type is with probability , and with probability , that is mutations occur with probability
per individual per generation, according to a Markov chain with zero-diagonal transition matrix. Two additional common assumptions are that has a unique stationary distribution and that the evolution of the population is neutral, namely all variants in the population are equally fit and are thus equally likely to be transmitted. The assumption of neutrality allows for a crucial simplification of the above WF evolution. Indeed under this assumption the random process describing the demography of the population becomes independent of the random process describing the genetic types carried by the individuals. Although rather simple, the neutral WF model captures many important features of the evolution of human and other populations, thus providing a statistical model which is at the basis of most existing inference methods in population genetics. We refer to the monographs by Ewens  and Tavaré  for a comprehensive and stimulating account on the WF model.
In the WF model one can describe the genetic composition of the population at any point in time by giving a list of the genetic types currently present, and the corresponding proportion of the population currently of each type. Note that such a description corresponds to giving a probability measure on the set of possible population types. In such a framework one obtains a discrete time probability-measure-valued Markov process, namely a discrete time Markov process whose state space corresponds to the space of the probability measures on . As the population size becomes large a suitable rescaling of the Markov process converges to a diffusion limit: time is measured in units of generations, and the mutation rates are rescaled as . The limiting process, called the Fleming-Viot (FV) process, is formulated as a diffusion process whose state space is the space of probability measures on an arbitrary compact metric space . See Ethier and Kurtz  and references therein for a rigorous treatment with a view towards population genetics. Intuitively, the FV process can thus be thought of as an approximation to a large population evolving in time according to the WF model. For instance, the classical WF diffusion on the set is a special case of the FV process which arises when there are only two possible genetic types and one tracks the population frequency of one of the types.
The coalescent arises by looking backward in time at the evolution described by the WF model. See, e.g., the seminal works by Griffiths , Kingman , Kingman  and Tavaré , and the monographs by Ewens  and Tavaré . Consider a population evolving in time as a WF model with scaled mutation rate , for , and with parent-independent transition matrix . In the large population limit the genealogical history of a sample of individuals from the population may be described by a rooted random binary tree, where branches represent genealogical lineages, i.e., lines of descent. The tree initially has lineages for a period of time , after which a lineage is lost for the occurrence of one of the following events: i) a mutation according to the transition matrix ; ii) a coalescence, namely a pair of lineages, chosen uniformly at random and independently of all other events, join. Recursively, the times , for for which the tree has
lineages are independent exponential random variables with parameter, after which a lineage is lost by mutation with probability or by coalescence with probability . When , the resulting random tree is referred to as the -coalescent, or simply the coalescent, and was first described in the seminal work of Kingman . When the process is a coalescent with mutation, with antecedents including Griffiths . As shown in Donnelly and Kurtz , in the large population limit the -coalescent describes the genealogy of a sample of individuals from a population evolving as the FV process. There exists even a natural limit, as the sample size , of the -coalescent. This can be thought of as the limit of the genealogy of the whole population, or alternatively as the genealogy of the infinite population described by the FV process.
This paper considers the problem of making ancestral inference, i.e. inference on the genealogy of a genetic population, from a Bayesian nonparametric predictive perspective. The statistical setting we deal with can be described as follows. We consider a population with an (ideally) infinite number of types, and we assume that the population’s composition evolves in time according to a neutral FV process whose unique stationary distribution is the law of the Dirichlet process by Ferguson . From a Bayesian perspective, the law of the FV process, or its dual law determined with respect to the Kingman’s coalescent process, plays the role of a nonparametric prior for the evolution of the population. Given the observed data, which are assumed to be a random sample sample of individuals from the population at time , we characterize the posterior distribution of some statistics of the enlarged -coalescent induced by an additional unobservable sample of size
. Corresponding Bayesian nonparametric estimators, with respect to a squared loss function, are then given in terms of posterior expectations. Of special interest is the posterior distribution of the number of non-mutant lineages surviving from timeto time , that is the number of non-mutant ancestors in generation in a sample at time . This, in turn, leads to the posterior distribution of the time of the most recent common ancestor in the -coalescent. As a by-product of our posterior characterizations we introduce a class of Bayesian nonparametric estimators of the probability of discovery of non-mutant lineages. This is a novel class of estimators which can be thought of as ancestral counterparts of the celebrated Good-Turing type estimators developed in Good  and Good and Toulmin .
Ancestral inference has had much attention in the statistical literature, both in practical data analysis, and from a theoretical and methodological point of view. See, e.g., Griffiths and Tavaré , Griffiths and Tavaré , Stephens and Donnelly , Stephens  and Griffiths and Tavaré . Given an (observable) random sample sample of individuals from the population at time
, most contributions in the literature have aimed at making frequentist or Bayesian inference on quantities related to the genealogy of the sample, e.g., the number of non-mutant lineages, the age of the alleles in the sample, the time of the most recent common ancestor, the age of particular mutations in the ancestry, etc. This is typically done in a parametric setting by using suitable summary statistics of the data, or by combining the full data with suitable approximations of the likelihood function obtained via importance sampling or Markov chain Monte Carlo techniques. In this paper, instead, we introduce a Bayesian nonparametric predictive approach that makes use of the observed sample ofindividuals to infer quantities related to the genealogy of an additional unobservable sample. For instance, how many non-mutant lineages would I expect a time ago if I enlarged my initial observable sample by unobservable samples? How many of these non-mutant lineages have small frequencies? In the context of ancestral inference, these questions are of great interest because they relate directly to the speed of evolution via the rate of turnover of alleles. See Stephens and Donnelly  and references therein for a comprehensive discussion. Our approach answers these and other questions under the (prior) assumption that the genealogy of the population follows the Kingman coalescent. To the best of our knowledge, this is the first predictive approach to ancestral inference in this setting.
The paper is structured as follows. In Section 2 we recall some preliminaries on the neutral FV process and Kingman’s coalescent, we introduce new results on ancestral distributions, and we characterize the posterior distribution of the number of non-mutant lineages at time back in the enlarged -coalescent. A suitable refinement of this posterior distribution and a class of Good-Turing type estimators for ancestral inference are also introduced. In Section 3 we show how to implement our results, and we present a numerical illustration based on genetic data. Section 4 contains a discussion of the proposed methodology, and outlines future research directions. Proofs are deferred to the Appendix.
2 Ancestral posterior distributions
All the random elements introduced in this section are meant to be assigned on a probability space , unless otherwise stated. Let be a compact metric space. For any and any non-atomic probability measure on , let be the distribution of a Dirichlet process on with base measure . We refer to Ferguson  for a definition and distributional results on the Dirichlet process. In our context will correspond to the mutation rate, to the stationary distribution of the mutation process, and to the stationary distribution of the population type frequencies in the diffusion limit. For any let where is a pure death process, almost surely, with rate . It is known from Griffiths  that
for each . Here and elsewhere, for any nonnegative we use and, for any , and , i.e. rising and falling factorial numbers. If , with being the mutation rate of the WF model, then is the number of non-mutant lineages surviving from time to time in the large population limit of the WF model when the sample size , and is the total backwards-in-time rate of loss of lineages when there are currently lineages. The pure death process is typically referred to as the ancestral (genealogical) process. See, e.g., Griffiths  and Tavaré  for a detailed account on the ancestral process.
Let be the space of probability measures on equipped with the topology of weak convergence. The neutral FV process is a diffusion process on , namely a probability-measure-valued diffusion. Here we focus on the neutral FV process whose unique stationary distribution is . Among various definitions of this FV process, the most intuitive is in terms of its transition probability functions. In particular Ethier and Griffiths  shows that has transition function given for any and by
For each and , the transition probability function (2.2) is a (compound) mixture of distributions of Dirichlet processes. More precisely, recalling the conjugacy property of the Dirichlet process with respect to multinomial sampling (see, e.g., Ferguson ), Equation (2.2) reads as the posterior law of a Dirichlet process with base measure , where: i) the conditioning sample is randomized with respect to the -fold product measure ; ii) the sample size is randomized with respect to the marginal distribution of the ancestral process, i.e. in (2.1).
Consider a population whose composition evolves in time according to the transition probability function (2.2). Given a sample from such a population at time , the -coalescent describes the genealogy of such a sample. We denote by the -coalescent of the sample , and by the number of non-mutant lineages surviving from time to time in . The distribution of was first introduced by Griffiths , and further investigated in Griffiths  and Tavaré . In particular, Griffiths  showed that
for any and each time . If denotes the time until there are non-mutant lineages left in the sample, then the following identity is immediate:
Note that (2.4) with gives the distribution of the time of the most recent common ancestor in the sample, which is of special interest in genetic applications. For any the stochastic process may be thought as the sampling version of the ancestral process . Indeed it can be easily verified that (2.3) with coincides with the probability (2.1). There are two different, but equivalent, ways to describe the evolution of with respect to the transition probability functions (2.2). Let be a sample from the population at time , i.e. a sample from a non-atomic probability measure. The first way follows Kingman’s coalescent and looks backward in time: based on , for any define to be the total number of equivalent classes in the -coalescent starting with , that is is the total number of non-mutant ancestors of at time in the past. An alternative way is forward looking in time and follow the lines of descent: based on , for any define to be the number of individuals that have non-mutant descendants at time . It is known from the works of Griffiths  and Tavaré  that and have the same distribution, which coincides with (2.3).
2.1 New results on ancestral distributions
We start by introducing a useful distributional identity for . For any let be independent random variables identically distributed according to a non-atomic probability measure and, for any , let be a random sample from a Dirichlet process with atomic base measure . The random variables will be used to denote the genetic types of the ancestors. In order to keep track of the different ancestors, we add the non-atomic requirement for the law so that different ancestors will be represented by different types. Due to the almost sure discreteness of the Dirichlet process, the composition of the sample can be described as follows. We denote by the labels identifying the distinct types in which do not coincide with any of the atoms ’s. Moreover, we set
where denotes the number of ’s that coincide with the atom , for any ;
where denotes the number of ’s that coincide with the label , for any ;
denotes the number of ’s which do not coincide with any of the labels .
Observe that the statistic includes all the information of , i.e., is sufficient for . See Appendix .1 for a detailed description of the distribution of . Now, consider the random variable
Precisely, denotes the number of distinct types in the sample that coincide with the atoms ’s. In the next theorem we derive the distribution of , and we introduce a distributional identity between and a suitable randomization of with respect to . See Appendix .1 for the proof.
For any let be a sample from a Dirichlet process with atomic base measure , for . Then, for
for each , where is the death process with marginal distribution (2.1).
The distributional identity (2.7) introduces a Bayesian nonparametric interpretation on the sampling ancestral process , in the sense that it establishes an interplay between and the sampling from a Dirichlet process prior with atomic base measure . Intuitively, the identity (2.7) can be explained by taking the view of the forward looking description of the evolution of . Starting at time with an infinite number of individuals sampled (at random) from a non-atomic probability measure, the number of non-mutant lineages that survive at time is described by the ancestral process . Now, consider a random sample of individuals at time 0, that is a random sample from a non-atomic probability measure which allows us to distinguish individuals. Then the sampling ancestral process describes the number of non-mutant lineages surviving at time , for any . The transition probability function (2.2) then says that conditionally on , the genetic types of the descendants (including mutants) of the individuals at time correspond to a sample from a Dirichlet process prior with atomic base measure . That the population size remains constant comes from the Wright-Fisher approximation. Given , the genetic types of the ancestors must have types that belong to , the distinct types of the ancestors. The distribution of is independent of the exact distribution of as long as they are distinct. Thus the conditional distribution of given is simply the distribution of .
We now present a novel refinement of the ancestral distribution (2.3). Such a refinement takes into account the lines of descent frequencies at time of lines beginning at individual roots at time and surviving to time ; these frequencies do not include new mutants. Among the non-mutant lineages surviving from time to time , we denote by the number of non-mutant lineages at time in the past having frequency at time , for any . We are interested in the distribution on . Let be the usual random sample from a Dirichlet process with atomic base measure , and define
Precisely, denotes the number of distinct types in that coincide with the atoms ’s and have frequency . From the discussion above it is clear that, given , has the same distribution as . Thus we obtain that
and the marginal distribution of is given by(2.1). Note that the random variable represents a natural refinement of in the sense that
We stress the fact that may be different from , since frequency counts do not include new mutants. Although a large amount of literature has been devoted to the study of distributional properties of , to the best of our knowledge has never been investigated, and not even introduced, before. In the next theorem we derive the distribution of . See Appendix .1 for the proof.
For any let be a sample from a Dirichlet process with atomic base measure , for . Then, for
where denotes the minimum between and the integer part of .
According to the distributional identity (2.8), the distribution of follows by combining the ancestral process (2.1) with the distribution (2.9), for any . As a representative example of the distribution of , we consider the case . The distribution of is of special interest because it corresponds to the sampling ancestral distribution of ”rare” non-mutant lineages with frequency 1, i.e. non-mutant lineages composed by a unique individual. If we apply the distribution (2.1) to randomize in (2.9) with , then
for any and each . The study of finite and asymptotic properties of is out of the scope of the present paper, and it is deferred to future work. In the rest of this section we introduce a Bayesian nonparametric predictive approach to ancestral inference under the prior assumption that the composition of the population evolves in time according to (2.2). In particular, we consider a sample from such a population at time and we make use of (2.7) and (2.3) to determine the conditional, or posterior, distribution of given . A natural refinement of this conditional distribution is also obtained by means of the identity (2.8).
2.2 Ancestral conditional distributions
Let be a sample from a Dirichlet process with atomic base measure and, for any , let be an additional sample. More precisely may be viewed as a sample from the conditional distribution of the Dirichlet process with base measure , given the initial sample . We refer to Appendix .2 for a detailed description of the composition of . We denote by the number of ’s that coincide with the atom , and we introduce the random variable
which denotes the number of distinct types in the enlarged sample that coincide with the atoms ’s. Observe that if we set then reduces to in (2.5). We also introduce the following random variable
which is the number of distinct types in the additional sample that coincide with the atoms ’s that have frequency in the samples . In the next theorem we derive the conditional distribution of given , and the conditional distribution of given . Interestingly, it turns out that such conditional distributions depend on solely through the statistics and , respectively. See Appendix .2 for the proof.
For any and let be a sample from a Dirichlet process with atomic base measure , for . Then one has
Therefore, and are sufficient to predict and , respectively.
The predictive sufficiency of in (2.13) plays a fundamental role for deriving the conditional counterpart of the sampling ancestral distribution (2.3). In particular, consider a population whose composition evolves in time according to (2.2), and let be a sample of individuals from the population at time . Furthermore, for any let be an additional unobservable sample. The identity (2.7) and the sufficiency of imply that the conditional distribution of given can be obtained by randomizing the parameter in (2.13) with respect to the distribution
). Moments of (2.16) are obtained by randomizing , with respect to (2.15), in the corresponding moments of (2.13) given in (.2). Equation (2.16) introduces a novel sampling ancestral distribution under the Kingman coalescent. Observe that, due to the identity (2.4), the distribution (2.16) leads to the conditional distribution of the time until there are non-mutant lineages left in the -coalescent.
We now consider a refinement of (2.16) which takes into account the frequency counts of non-mutant lineages. Specifically, we determine the conditional distribution of the number of non-mutant lineages surviving from time to time in whose frequency in the lineages ancestral to the initial sample is . As a representative example we focus on . Due to (2.8) and the sufficiency of to predict , the conditional distribution of given is obtained by randomizing the parameter in (2.14) with respect to the distribution
for any and each . Moments of (2.18) are obtained by randomizing the parameter , with respect to (2.17), in the corresponding moments of (2.14) given in (.28). We stress that the sufficiency of to predict plays a fundamental role for determining the conditional distribution of .
If we interpret the FV transition probability function (2.2) as a prior distribution on the evolution in time of the composition of the population, then the conditional distributions (2.16) and (2.18) take on a natural Bayesian nonparametric meaning. Specifically, they correspond to the posterior distributions of and , respectively, given the initial sample whose ancestry features non-mutant lineages of which are of frequency . Given the information on and from the initial observed sample, the expected values of (2.16) and (2.18) provide us with Bayesian nonparametric estimators, under a squared loss function, of and . It is worth pointing out that and , and in general the -coalescent , are latent quantities, in the sense that they are not directly observable from the data. However, one can easily infer and , as well as the mutation parameter , from the observed data and then combine their estimates with the posterior distributions (2.16) and (2.18). This approach for making predictive ancestral inference will be detailed in Section 3. We conclude this section with a proposition that introduces an interesting special case of the posterior distributions (2.16) and (2.18). See Appendix .2 for the proof. Let
be the number of new non mutant lineages, that is is the number non-mutant lineages at back in the additional sample of size which do not coincide with any of the non-mutant lineages at time back in the initial sample of size .
Consider a population whose composition evolves in time according to the FV transition probability function (2.2). Then for each one has
Proposition 2.1 introduces two Bayesian nonparametric estimators for the probability of discovering non-mutant lineages surviving from time to time . This proposition makes explicit the link between our results and the work of Good , where the celebrated Good-Turing estimator has been introduced. Given a sample of size from a population of individuals belonging to an (ideally) infinite number of species with unknown proportions, the Good-Turing estimator provides with an estimate of the probability of discovering at the -th draw a species observed with frequency in the initial sample. Of course corresponds to the case of the probability of discovering a new species at the -th draw. Within our framework for ancestral inference under the FV prior assumption (2.2), the probabilities (2.19) and (2.20) may be considered as natural Bayesian nonparametric counterparts of the celebrated Good-Turing estimators. Precisely: (2.19) is the Bayesian nonparametric estimator of the probability of discovery in one additional sample a new non-mutant lineage surviving from time to time ; (2.20) is the Bayesian nonparametric estimator of the probability of discovery in one additional sample a non-mutant lineages surviving from time to time and whose frequency is in the initial sample.
In this section we show how to use the results of the previous section by applying them to real genetic dataset. Consider a population whose composition evolves in time according to the FV transition probability function (2.2), and suppose we observe a sample of individuals taken from a Dirichlet process with base measure . Recall that, under this assumption on the evolution of the population, the law of the Dirichlet process with base measure is the unique stationary distribution of the neutral FV process. The sample then consists of a collection of distinct genetic types with corresponding frequencies . In particular if denotes the probability of a sample , which features genetic types with frequencies , then
see Ewens  for details. With a slight abuse of notation, we denote by a random variable whose distribution coincides with the conditional distribution of given . As we pointed out at the end of Section 2, in order to apply the posterior distributions (2.16) and (2.18) we have to estimate the unobservable quantities and , respectively. Using a fully Bayesian approach, estimates of and are obtained as the expected values of the posterior distributions of and given , with respect to some prior choice for . For simplicity we focus on the posterior distributions of and , and we resort to an empirical Bayes approach for estimating . Specifically, we use the maximum likelihood estimate for originally proposed by Ewens , which is obtained from the likelihood (3.1) by numerically finding the root of a certain polynomial in . From the point of view of ancestral inference, the distributions of and correspond respectively to the questions: How many non-mutant genetic ancestors to the sample existed a time ago? And how many non-mutant genetic ancestors existed whose type appeared with frequency among those ancestors?
First we consider the posterior distribution of . Under the Kingman coalescent model in which mutation is parent-independent, the distribution of this random variable is straightforward: indeed it is well known that the distribution of coincides with the distribution of , for any . This holds because the coalescent process for a sample of size can be decomposed into its ancestral process , and a skeleton chain taking values in marked partitions of the set . See Watterson  and references therein for details. These two processes are independent, and the sample is informative about only the skeleton chain. Thus, the distribution of is given by (2.3). In particular if we denote by the maximum likelihood estimate of , then an estimate of is given by the following expression,
which is the expected value of the distribution (2.3) with replaced by its estimate . Thus, we can plug in the estimate to the posterior distribution (2.16) and then predict the number of non-mutant lineages surviving from time to time in the enlarged sample of size , given the initial observed sample of size . Observe that under parent-independent mutation the information of the initial sample affects the prediction only through the estimate .
The posterior distribution of is not trivial. Differently from the distribution of , which is independent of , the sample is informative for . In order to derive the distribution of , one strategy would be to study a posterior analogue of the marked-partition-valued process introduced in the work by Watterson , and then project it onto its block sizes. The resulting formulas are, however, unwieldy. Our preferred approach is via Monte Carlo simulation, since the posterior transition rates of evolving backwards in time are easy to describe. In particular, if we set
then the transition rate matrix for when currently is
with initial condition
See Hoppe  for details on . In words, lineages are lost at rate . At such an event, the lineage selected to be lost is chosen uniformly at random. If that lineage contributed to