Recent years have witnessed a proliferation in the number of methods for inferring species trees from whole genomes (Bryant and Hahn, 2019). Some have gone so far as to describe this as a paradigm shift in phylogenetics (Edwards, 2009)
. It is now widely accepted that (i) phylogenetic analysis needs to take account of the varying evolutionary histories of different parts of the genome, and (ii) estimation of evolutionary relationships between populations (or species) should take account of evolutionary dynamicswithin populations (or species).
In the systematics community, taking account of population dynamics in a phylogenetic context has meant implementing some version of the multispecies coalescent (Liu et al., 2008; Liu, 2008; Larget et al., 2010; Heled et al., 2013). The coalescent models the distribution of gene trees within a population; the multispecies coalescent is its natural extension to multiple populations or species. One of the key advantages of the coalescent is that, for neutral mutations, the gene tree can be decoupled from the mutation process, a feature which forms the basis of many implementations of the model. Nevertheless, the coalescent has its limitations. It is difficult to incorporate selection, for example. Also, the running time of coalescent based methods (e.g. BEAST, *BEAST, etc.) depends critically on the number of individuals being sampled. In *BEAST a gene tree is sampled for each loci, with the number of leaves in the gene tree given by the number of individuals sampled from all populations.
Bryant et al. (2012) showed that explicit sampling of gene trees for each locus could be avoided in the case of unlinked binary markers. This is appropriate for the estimation of species trees from unlinked single nucleotide polymorphisms (SNPs). Their algorithm was implemented in Snapp, and can analyse data sets with hundreds of thousands of loci and up to 200 individuals (depending on computing resources and sampling difficulties). Unfortunately Snapp does not scale that well as the number of individuals increases. The running time of the algorithm algorithm is with individuals, which quickly becomes limiting. The far most common user complaint about Snapp to developers has been running speed.
Rather than tinker with the Snapp algorithm, we have taken a completely new approach with a different kind of model. Like Gutenkunst et al. (2010), Sirén et al. (2010) and Lukic and Hey (2012) we use diffusion models, though we apply them in a new way. The net result is that we can carry out Snapp-type analyses but with essentially no limits on the number of individuals being sampled.
Diffusion models, like the coalescent, are a convenient approximation of the standard Wright-Fisher discrete models. In fact, diffusion models pre-date coalescent models by a good fifty years (Wright, 1931; Kingman, 1982). Whereas coalescent models describe the distribution of the genealogy or gene tree for a sample of individuals, diffusion models describe the frequency of an allele in the population as a whole. There are straightforward extensions incorporating selection.
There are three challenges to overcome working with a model of allele frequencies. First, we do not actually observe the population frequencies, we just observe a sample drawn from that population. This problem is easily solved by incorporating the sampling step explicitly into our likelihood.
Second, gene frequencies are continuous traits, so it is not possible to sum over ancestral trait values as in Felsenstein’s pruning algorithm (Felsenstein, 1981). For this, we follow a numerical approach developed in Hiscott et al. (2016), though with some new twists to improve efficiency and accuracy.
Third, diffusion models do not give explicit transition probabilities or densities: these are only available via partial differential equations (PDEs). We solve these differential equations numerically, extending a standard spectral approach from a single population to the entire species tree. We note that there is significant potential for extending our methods to other diffusion models.
The outline of this paper is as follows: In Section 2 we briefly review the population genetic diffusion models and some relevant theory, describe the model on a species tree and discuss other related models. In Section 3 we spend some time setting up the mathematics developed to compute the model likelihood. Thereafter we present a numerical algorithm for computing the likelihood and discuss the Bayesian software package Snapper. In Section 4 we assess the perfomance of Snapper and compare running times to Snapp. In Section 5 we showcase the capability of Snapper on a SNP dataset from freshwater turtles. We conclude the paper with a summary of method implementation and discuss some possible model extensions for future work.
2 Models and Methods
2.1 The Wright-Fisher diffusion
Our starting point is the Wright-Fisher discrete model of drift and mutation. Suppose we have a population of diploid individuals, giving copies of each autosomal gene. We consider a gene with two alleles (say ‘red’ and ‘green’) and, to being with, a model with non-overlapping generations. With random mating the number of red alleles in generation
has a binomial distribution with parametersand , where is the probability of mutating from red to green and the probability of mutating from green to red. In this way the number of red alleles will follow a random walk with discrete time (generation number) and states (from to ).
The idea behind the diffusion models is to approximate this discrete random walk by a continuous random walk that is easier to work with analytically. We set up the approximation so that the larger gets, the better the approximation fits. We describe the proportion of gene copies having the red allele, rather than the total count. The state space of the process will be the interval . Instead of considering discrete generations, we construct a random walk which is continuous in time.
Diffusion models for gene frequencies involve a rescaling of time. There are simple, practical, reasons for this. As the population size gets larger, the rate of genetic drift decreases, ultimately approaching zero drift in the limit. However the effect of drift is something that we would like to model. For this reason we change the units of time so that the rate of drift remains approximately the same as increases, eventually converging to some non-zero amount. The convention for diploid populations is to use a scale where unit of time corresponds to generations (one coalescent unit).
If we are to change the units of time, we need to adjust the rate of mutation accordingly. Therefore the overall rate of change due to mutations from green alleles to red alleles is . We adopt the standard notation and define and .
For each value of we let denote the density of the allele proportion at time
, noting that this allele proportion is a continuous random variable with state space. Surprisingly, there is very little choice over what the function might be, after a few basic assumptions are made. See Etheridge (2011) for technical details and McKane and Waxman (2007) for a discussion about how we need to incorporate point masses at and into .
As is often the case in mathematical modelling, we work with the function indirectly using a PDE. Stochastic process theory (Øksendal, 2003, chap. 5) tells us that the function satisfies the PDE
This equation by itself is not enough to uniquely determine . We also need to specify what looks like at the boundaries. To specify that the distribution of the initial state is given by some density we add the initial condition
Note that to specify the initial value exactly, we need to be a Dirac-delta function which is essentially a infinitely thin spike on one value . Even with these initial conditions, the PDE (1) does not uniquely determine the function . We also need to add boundary conditions at and to guarantee that the probability of going outside the interval is always zero. Writing this as a condition on integrals of and then plugging in the PDE, eventually leads to what in physics is known as a zero-flux condition
when or . See McKane and Waxman (2007) for details.
The PDE (1) together with the boundary conditions (2) and (3) are just a mathematically convenient, if not particularly transparent, way to describe the function . The function , in turn, is just a way to approximate the probabilities in the original discrete process. The diffusion approximation works well only if mutations rates are assumed small to begin with. This will give a reasonably small error in the diffusion approximation. As shown in Ethier and Norman (1977) we still have convergence in expectation with an error bound on the diffusion approximation behaving like . Note that if or are large (compared to, say, ) then the diffusion approximation will fail miserably. In a recent simulation study, Tataru et al. (2017) used simulations to quantify the error from the diffusion approximation in small populations, and found that diffusion models gave reasonable approximations even when the population has fewer than 100 individuals.
2.2 Modelling allele frequencies on a species tree
The diffusion model describes how allele frequencies change over time in a single population. The model extends directly to multiple populations in a species tree (Sirén et al., 2010). As in Bryant et al. (2012) we think of the root of the species tree to be at the top of the tree and the leaves at the bottom. The model describes evolution of the allele frequencies from the top of the tree to the bottom.
The allele frequency at the root has a distribution given by the stationary density of the diffusion model (in this case, a beta distribution). The allele frequency at the bottom of a branch has a distribution given by the diffusion model with an initial value equal to the allele frequency at the top of the branch. At a speciation, the two daughter populations have the same allele frequency as the parent population.
We now formalise these ideas. Suppose that the branches in the species tree are numbered where, for convenience, the branches adjacent to the leaves are numbered . Let denote the allele frequency immediately below the top of branch . Let denote the allele frequency immediately above the bottom of branch . Let denote the allele frequency at the root.
At the root, , the proportion of red alleles in the ancestral population equals the stationary distribution of the diffusion model. The stationary distribution of the diffusion equation (1) has a beta distribution (Ewens, 2004) with density
Along any branch in the species tree, the changes in allele frequencies are modelled using the diffusion process. The allele frequency at the start of the branch gives the initial density, , for . The distribution of the allele frequency at the end of the branch is then given by , where is the length of the branch in units of generations and .
At a speciation, we make the assumption that there is no correlation between allele type and speciation. Hence the two descendant species are assumed to have identical allele frequencies to the parent node.
We have, therefore, a model for allele frequencies over the whole tree. However allele frequencies at the leaves are not directly observed. Instead we have a sample of individuals, giving allele copies taken from each species . If is the red allele frequency for the population then the observed number of red alleles in the the sample for this gene has binomial distribution with parameters and , so that
see Sirén et al. (2010).
In Algorithm 1 we describe how to simulate allele frequencies under this model. Note that in a pre-order traversal we visit every node in order so that the children of a node are always visited after the node itself.
We use the algorithm of Jenkins et al. (2017) to simulate diffusions along each branch. Alternative methods for simulating diffusions include truncating a spectral expansion of the transition function (Song and Steinrücken, 2012; Steinrücken et al., 2014) or numerical solutions of the Kolmogorov equations (Williamson et al., 2005; Bollback et al., 2008; Gutenkunst et al., 2009). Care needs to be taken when implementing these methods since approximation errors close to the boundary can become large, a problem that Jenkins et al. (2017) explicitly avoid.
2.3 Analytical formula for partial likelihoods
We now describe how to describe the likelihood functions analytically. Overall our approach is to use dynamic programming, as in Felsenstein’s pruning algorithm (Felsenstein, 1981). For each node and each ancestral state , which is our case is an allele frequency, we define two partial likelihoods, and . The first is the probability of observing all allele counts for taxa in the subtree rooted at , conditional on the state at the bottom of branch being equal to . The second is defined in the same way, but is instead conditioned on the state at the top of the branch. A similar approach was used in Bryant et al. (2012).
We now show how to describe these partial likelihoods recursively for an individual site .
2.3.1 Partial likelihoods at the leaves
Suppose that node is a leaf and that we have sampled diploid individuals from this population. Let denote the number of observed gene copies carrying the red allele for site . If is the proportion of red alleles in the population then has a binomial distribution with parameters and . Hence the partial likelihood at a leaf is given by
2.3.2 Partial likelihoods at a speciation
Let and be the children of node . We assumed that the allele frequencies for daughter populations after a speciation are exactly those of the parent population before speciation. The partial likelihoods at the bottom of the branch above is then the product of partial likelihoods at the tops of the branches immediately below , so
for all .
2.3.3 Partial likelihoods along a branch
Let be the effective population size for the branch directly above node , and let denote the length of the branch above node , measured in numbers of generations. Then equals the length of the branch in coalescent units. To derive the partial likelihood at the top of this branch, we integrate over all frequency values at the bottom of the branch.
In this equation, is the density for the allele frequency the bottom of the branch conditional on an initial value at the top of the branch, while is the partial likelihood conditional on allele frequency at the bottom of the branch.
To get a handle on (7) we define
Then satisfies the PDE,
with initial condition
2.3.4 Likelihood at the root of a tree
Equations (5), (6) and (7) can be applied to the entire tree, starting with the leaves and moving upwards towards the root. They define the root likelihood , which is the probability of observing all allele counts for all populations, conditioning on the allele frequency being equal to .
To define the likelihood of the tree we now integrate over the allele frequency at the root, multiplied by the stationary density of given in (4):
This is the probability for a single site . The likelihood from all markers is found by multiplying the probabilities from each site together (or, in practice, summing the log-likelihoods)
2.4 Translation of parameters
One of the more confusing aspects of working across population genetics and phylogenetics is the different ways that different communities parameterize different variables. In this section we summarise the parameters and outputs for our model, and show how they might be converted into other formulations.
There are four groups of input variables in our model. These are the variables that could conceivably be inferred from data using the likelihood function. They are
The species tree.
The branch lengths in the species tree. In our model, we measure the length of a branch in number of generations. Let denote the length of the branch above node .
The mutation rates and giving the probabilities of mutating from a red to a green allele or a green to a red allele, each generation.
The effective population size for each branch (that is, the branch above node ).
Different methods for inferring species trees describe these parameters in different ways. The convention in phylogenetics is to express branch lengths in terms of expected substitutions per site. The use of the term ’substitution’ is confusing, as the whole premise behind the multispecies coalescent is to model incomplete substitutions. However for neutral mutations, the rate of substitutions equals the rate of mutations. Under our model, the rate of mutations per generation is
so a branch of length generations corresponds to a branch length of expected substitutions (mutations) per site.
Methods which ignore the branch lengths in gene trees are not able to infer both population sizes and branch lengths, and instead use a single parameter per branch, typically measured in coalescent units (e.g. (Vachaspati and Warnow, 2015; Liu et al., 2010)). If is the length of a branch in numbers of generations, the length of the branch in coalescent units is given by .
In our model, we parameterize effective population size for branch directly as . Snapp (Bryant et al., 2012), Best (Liu, 2008) and BPP (Yang, 2015) instead use as the parameter for effective population size. Under an infinite sites model, equals the expected proportion of sequence differences between two individuals from the same population. In a finite sites model, is an overestimate of this expectation, due to backward mutations.
2.5 Dealing with confounded rates and rate matrices
An important issue with the diffusion model as we have described it, and indeed with many multispecies coalescent models, is that there is an identifiability problem with rates. If the mutation rates are multiplied by some constant and at the same time branch lengths and population sizes are multiplied by , the probability of the data remains the same.
One solution is to estimate the mutation rate beforehand and use this value, or a prior distribution around that value, to include as part of our model. The prior distribution for and is then reformulated so that (14) is satisfied. This strategy is used by StarBeast, where the average substitution (mutation) rate is fixed ahead of time.
An alternative strategy is to express branch lengths in terms of expected substitutions (mutations) per site and population sizes in terms of the parameter, as both of these are invariant to a choice of , (Yang, 2015; Bryant et al., 2012). As a consequence, the effective population sizes cannot be inferred without additional information. This approach adapts well to phylogenetics where it is customary to describe mutation rates using a normalised rate matrix, such as the Jukes-Cantor and HKY85 matrices
In this matrix, the stationary probabilities for the nucleotides are , the parameter controls the ratio of transitions to transversions, and the diagonal elements are chosen to make rows sum to zero. The Jukes-Cantor matrix on the left is already normalised, while in the HKY85 model would be chosen to make the overall substitution rate equal to .
These matrices can be incorporated directly into Snapper, either using prior information for , or when using the same branch length and population parameter scheme as BPP. For any site, if and are the two nucleotides present and we (arbitrarily) assign red to and green to then the appropriate choices for and are simply and . For example, under the HKY85 model, if the red allele is nucleotide A and green allele is nucleotide C then the corresponding rates are and . We are approximating a four state model with a two state model, but the error introduced should be minimal in the absence of divergences or highly variable sites.
2.6 Comparison with allele frequency spectrum methods
We note that diffusion models are already used in population genetics to approximate changes in the allele frequency spectrum (AFS) (Gutenkunst et al., 2009; Lukic and Hey, 2012; Racimo et al., 2016). For each value of between and , the AFS gives the proportion of sites for which the derived allele appears in the proportion of the sample. For example, we might consider all sites where the the derived allele appears in 30 of the 100 sampled genomes. If of all sites fall into this category then the AFS value for 0.3 will be .
Methods based on the AFS to infer genetic parameters for a single population are widespread in the literature (Boitard et al., 2016; Lapierre et al., 2017; Nielsen, 2000). In all of these analyses, it is assumed that the population dynamics have achieved stationarity when the AFS predicted by the model is used.
When there are multiple populations divergence times become important. Suppose that there are
species. For every vectorof numbers between and , the joint-AFS gives the proportion of sites where the derived allele appears in a proportion of the individuals in the first population, of the individuals in the second population, and so on. While the AFS for each population by itself will be the stationary AFS, the joint-AFS for multiple populations depend on the time since separation. Fortunately the PDE determining the predicted joint-AFS is a straight-forward extension of the PDE for a single population (Gutenkunst et al., 2009; Lukic and Hey, 2012; Racimo et al., 2016).
There are two main advantages of an approach based on the joint-AFS when compared to the method we describe here. First, the PDE for the joint-AFS only needs to be solved once for all sites, whereas in our approach we end up solving the PDEs once for each (distinct) site. Second, migration between populations can be incorporated quite simply into AFS calculation, whereas it would break down the dynamical programming strategy that we use.
The main disadvantage of the joint-AFS strategy is that the PDE has as many dimensions as the number of species, and suffers from the curse of dimensionality (Gutenkunst et al., 2009), leading to an exponential growth in the size of grid used by standard numerical methods. This factor severely limits the number of species which can be considered concurrently, whereas in our approach the algorithm scales linearly with the number of species.
3 Computing the likelihood efficiently
In the previous section, we showed how the recursions for the partial likelihoods can be derived analytically. We did not show how to actually compute those likelihoods. Indeed, computing the likelihoods amounts to an extremely high dimensional integration problem. Our approach is to use numerical techniques combined with dynamic programming. While the likelihoods we compute are approximate, we can still bound the error.
3.1 Shifted Chebyshev polynomials
Hiscott et al. (2016) describe a general strategy for computing likelihoods numerically whereby partial likelihoods are evaluated on a mesh of values at each node, and accurate quadrature methods used to carry out the actual computation. We will extend the same strategy by using a set of basis functions to express approximate partial likelihoods instead of a mesh of values. The basis functions are cleverly chosen to help solve equation (9) efficiently and accurately.
The basis functions we use are called the shifted Chebyshev polynomials of the first kind, denoted
Shifted Chebyshev polynomials are defined on and have particularly nice properties (Fox and Parker, 1968). They also have a lot of equivalent definitions, but the simplest uses the following recursion,
The shifted Chebyshev polynomials are related to the better-known Chebyshev polynomials by the identity: . That is, they are obtained by shifting and scaling the Chebyshev polynomials to have domain , see Mason and Handscomb (2002).
There are two main ways of using shifted Chebyshev polynomials to approximate and as functions of . The first is to approximate the function directly as a linear combination of the shifted Chebyshev polynomials, that is, finding sets of coefficients and so that for all in we have
It can be shown that the error in this approximation drops exponentially quickly as the number of basis functions increases. We therefore only need to store a few coefficients in order to evaluate the partial likelihood function at any , with small error.
The second way of obtaining an approximation is by determining the values and at a pre-specified set of points and then finding the unique combination of coefficients such that
. This is called polynomial interpolation. As it happens, there is a particular choice of pointsfor which we can switch back and forth between function values
and interpolation coefficients
These points are all in the interval with a denser packing of points nearer and . They equal the -coordinates of points spaced regularly around a semi-circle.
We use both coefficient values and function values when approximating the partial likelihoods and .
3.2 Approximate partial likelihoods at a leaf
We compute our approximate likelihood at the bottom of a leaf branch by simply evaluating the function
at the Gauss-Lobatto points. We then compute corresponding coefficients using the FFT algorithm (Trefethen, 2013).
3.3 Solving the diffusion model numerically
Shifted Chebyshev polynomials provide the foundation for the numerical methods we use to solve the PDE (9). Note that, unlike the forward diffusion (1), solutions to this backward diffusion are nice smooth functions, meaning that we can avoid some of the numerical headaches encountered by those using the forward diffusion directly (Lukic and Hey, 2012).
We approximate in terms of shifted Chebyshev polynomials
Let denote the vector of values . Formulas for the derivative of shifted Chebyshev polynomials lead eventually to a matrix such that
Indeed after some tedious algebra we can derive a matrix so that
for an approximate solution to the PDE, giving the system
with initial condition given as the Chebyshev coefficients of .
Like Bryant et al. (2012) we use rational approximations to on the negative real axis computed using a Caratheodory Fejer-approximation. Additionally, we use techniques adapted from Fox and Parker (1968) and outlined in the appendix (A.1) to take advantage of structure in the matrix , allowing an implementation of the Caratheodory-Fejer approximation which runs in time.
To summarise, consider a node and let denote the length (in coalescent units) of the branch connecting to its parent. Suppose with corresponding coefficients is already computed at the bottom of the branch. We then compute partial likelihood at the top of the branch by
Computing a numerical approximation for .
3.4 Approximate partial likelihoods at a speciation
Consider a node with two children and . We suppose that the approximations for the partial likelihoods and . We then have
|for all .||(21)|
This computation takes time.
3.5 Approximate likelihoods at the root
The final integration to carry out is over the partial likelihood at the root
see (11). The density for a beta distribution (4) can be infinite at the boundaries, meaning that numerical integration techniques such as Clenshaw-Curtis quadrature can give poor approximations. The solution is to separate out those parts which are difficult to integrate numerically and determine them analytically. Suppose we have an approximation of as a polynomial
We factor this polynomial as
and then compute
Noting that the first integral is now well-behaved while the second integral evaluates to by properties of the Beta distribution.
3.6 Computing the log-likelihood of a species trees
Algorithm 2 summarises the numerical calculation of the likelihood. This algorithm takes time per site where pre-processing, where is the number of species, the number of Chebyshev basis functions and is the number of individuals at a site. The pre-processing step involves counting the frequencies of allele types in each population. In practice this step could be carried out once per data set, rather than once per tree evaluated.
The conversion to and from coefficients to function values in the Chebyshev expansion in lines 7 and 9 each takes time, using the FFT algorithm reviewed above. Solution of the PDE in line 10 takes time.
The likelihood algorithm forms the core of a Bayesian inference software package, Snapper. The software is open-source and available to download at https://github.com/rbouckaert/snapper. Like Snapp, it takes biallelic data at multiple loci for multiple individuals, in a set of species and returns samples from the joint posterior distribution of (i) species phylogenies, (ii) species divergence times and (iii) effective population sizes. As in Snapp we implemented a dynamic cache-based system to store partial likelihoods on different subtrees and multithreading to take advantage of parallel computation on multiple core machines or graphics processing units. The range of prior distributions, and flexible prior specification, remains essentially unchanged.
The MCMC proposal function implemented is almost the same subset of BEAST2 (Bouckaert et al., 2019) move proposals implemented for Snapp. We add one new proposal selects a subtree of the species tree and scales all population sizes within that subtree simultaneously, a refinement of an existing proposal.
The user can specify number of basis functions. Some recommendations are made in the Snapper manual regarding the number of basis functions to use given the size of a dataset. Also contained in the software package are simulators and python scripts to integrate Snapper with the iPyRad data pipeline to assist with streamlining data analysis.
4 Performance assessment
We have tested the likelihood algorithm and the Snapper software extensively. Here we highlight some results related to the performance of Snapper. Details of additional simulations, including tests checking that posterior distributions recovered model parameters, can be found in the appendix (A.2).
We conducted the following experiments: (i) An assessment of error convergence as the number of Chebyshev basis functions increase (ii) A comparison of inferred parameters between Snapp and Snapper given a dataset (iii) An assessment of how computational time estimates for Snapp and Snapper scale in terms of size of the data set. All simulations and inference were run on a laptop with an Intel i7-8565U CPU.
4.1 Numerical error convergence
We assessed the rate of error convergence for different tree heights, tree sizes and tree topologies. In particular we wanted to verify that we have exponential convergence of error.
We drew parameter sets from the prior distributions specified by Table 1, for 4 species trees and 16 species trees with ‘caterpillar’ and ‘well-balanced’ tree shapes. Furthermore for each tree we specified expected tree heights to be either very small or very large. We then simulated a 1000 sites for each parameter set and computed the relative error of the likelihood for number of basis function, . Note that for this experiment we do not limit to values of the form . As there is no analytical expression for the likelihood so to assess convergence we compared values calculated to those for a large () number of basis functions. We summarize the results for 4 species trees in Figure 1 and for 16 species trees in Figure 2.
|Short trees||Tall trees|
|Population size||(2, 200)||0.01||(2, 200)||0.01|
|Tree||Shape and branch length||Yule(100)||0.00111…||Yule(0.1)||10|
We see that in all the cases we have that the error decreases exponentially with , that is, it decreases like for some . Error is smaller for long branch lengths since approximate solutions are typically lower degree polynomials. However when the branch lengths are very short the error decreases more slowly. There are good reasons for this. Firstly, when the partial likelihood function is spiked, the approximate solutions are high degree polynomials requiring more basis functions. Secondly, population sizes for short branches are intrinsically more difficulty to estimate, no matter which model or method is used. The only information we have about population sizes comes from the distribution of coalescent events, and on short branches there are simply insufficient coalescent events to make sound inference. Later, we address this by adopting a prior distribution on population sizes which introduces correlation between neighbouring branches.
Apart from branch length distribution, tree shape does not seem to have a noticeable effect on the error convergence.
4.2 Comparing inferences from Snapp and Snapper
Theory states that the multispecies coalescent model underlying Snapp and the diffusion model behind Snapper are approximations of one another (Griffiths and Spano‘, 2010), with differences decreasing as effective population size increases. To demonstrate this we analyze the rattlesnake data set in Chifman and Kubatko (2014). The data 3000 SNPs from a total of 59 individuals in 7 populations.
In Chifman and Kubatko (2014) it was reported that Snapp was not sampling from the posterior distribution efficiently enough. It quickly became apparent that the sole cause of difficulty was the inference of population sizes on branches with close to zero length. We selected the CIR prior in Snapp and in Snapper, which implements a model with correlated population sizes on nearby branches. Convergence was also improved by implementing a proposal which scales all population sizes within a randomly selected subtree.
We ran the MCMC chain for 750,000 iterations for both Snapp and Snapper. It took Snapp approximately hours to run and Snapper approximately hours to run. We give the summary statistics in the appendix (A.3).
Figure 3 uses Densitree (Bouckaert, 2010) to depict samples of the posterior distributions of the most likely densitrees for Snapp and Snapper. We print the posterior mean of effective population size above each branch. In both cases we see that there is only one topology in the 95% highest posterior density. Posterior distributions of branch lengths are mostly indistinguishable. There are some small differences in posterior distributions of populations sizes. However in all cases population sizes follow the same apparent distributions.
4.3 Running time of Snapper
We compare the computation time between Snapper and Snapp on a 4- and 8-taxa tree scaling the number of individuals at a leaf. We generated datasets of a 1000 SNPs given the number of species () and the number of individuals at a leaf (). The number of basis functions for Snapper was fixed at K = 33. Likelihood computation without caching was done on a computer with an Intel i3-7100 CPU. In Figure 4 we report the average times (in seconds) to compute the likelihood of a 1000 sites on 4- and 8-taxa trees, for Snapp and Snapper. As expected we can clearly see the advantage of Snapper, in that computation time is hardly affected by the number of individuals at a leaf. Note that there is some dependence on the number of individuals. As having large samples can lead to a need for more basis functions to ensure accuracy.
5 Analysis of freshwater turtle; Emydura macquarii
To illustrate the application of Snapper we reanalyse SNP data of Georges et al. (2018) from a group of freshwater turtles known collectively as Emydura. The range of Emydura extends almost the full length of the Australian continent from north to south. The group is currently recognized as a complex of closely related and morphologically distinct allopatric forms, variously regarded as species, subspecies or distinct morphological lineages (Georges et al., 2018). The individual samples were taken from 57 distinct water bodies to study and clarify some of the evolutionary relationships among different Emydura populations found in Eastern Australia river basins and along the coast. The sampling covers the coastal drainages of eastern Australia, from the Hunter River in the south (New South Wales) to the Normanby River (Queensland) in the north; the rivers of the Murray-Darling Basin (MDB), including the Paroo drainage, and the Lake Eyre Basin (LEB), and the intervening Bulloo Basin (Georges et al., 2018).
For the Snapper analysis we used 399 individuals divided into 41 populations with a total of 5,186 filtered SNPs. We expect short branch lengths for some of the populations due to geographical proximity of sampling sites. Therefore we specify a CIR prior that allows for correlation between populations on nearby branches. The analysis include two out-taxon populations therefore we expect tree height to be large. Hence we use a Yule prior with appropriate mean tree height. We have no prior information on mutation rates therefore we fix . See Table 2 for more details.
|Population size||CIR(2, 200,1)||0.01|
|Tree||Shape and branch length||Yule(2)||0.2|
We conducted a number of initial investigations to determine appropriate proposals and proposal weights. We then ran the sampler for 2,000,000 iterations with convergence assessed in Tracer (Rambaut et al., 2018) using estimated effective sample size. It took a total of hours for the sampler to run on a computer with an Intel i3-7100. CPU. We provide a complete list of summary statistics in the appendix (A.3).
depicts the tree with largest posterior probability with uncertainty. The shape of the tree in Figure5 agrees with the genetic distances and SVDquartets trees computed in Georges et al. (2018). We extend the anaylsis in Georges et al. (2018) by quantifying uncertainty around branch lengths and include an additional parameter, namely population size. In Figure 5 we see some uncertainty surrounding topology in the Fitzroy clade (samples 38 - 43). The Cooper Creek clade (samples 96 - 102) is sister to the North-east Coast clade (samples 10 - 19) in all three analyses, as is the sister relationship between the Hunter R (sample 92) and the Murray-Darling Basin (samples 96 - 115) populations. The East Coast (samples 29 - 60) and South-east Coast (samples 11 - 20) clade is sister to the Hunter-MDB clade (samples 20 - 31) in the Snapper tree, in agreement with the Fitch-Margoliash distance tree, whereas the arrangement is unresolved in the SVDquartets tree. Sensibly, the Kolan-Burnett-Mary (sampels 48 - 58) clade is sister to the Fitzroy clade in the Snapper tree, whereas the SVDquartets tree has the Mary-Burnett clade in an unresolved trichotomy with the Cooper-NE Coast clade (samples 96 - 102; samples 10 - 19) and the Fitzroy clade. The Snapper analysis supports relationships within the NE Coast clade that reflect drainage proximity better than does the SVD quartets analysis.
Also, we note that the bottom two population samples, i.e Emydura subglobosa and Emydura victoriae populations, are considered as out-taxon groups. Therefore we expect to see the divergence time from the rest of the populations to be much earlier. This however leads to poor resolution for the MDB clade (samples 112 - 131). Thus we present the MDB clade in Figure 6. The figure also depicts the extent of uncertainty surrounding the tree due to short branch lengths. As we discuss above, population size estimates here will be mostly dependent on the prior due to little information available on such short branch lengths.
In this paper we present Snapper a computationally more efficient method to supersede Snapp. Like Snapp the method takes biallelic markers sampled from individuals in multiple populations and computes the likelihood of the species tree topology together with branch lengths and population sizes. We achieved this computational efficiency by computing the likelihood using diffusion models rather than the multispecies coalescent. The Wright-Fisher diffusion and Coalescent are dual processes and it is therefore not surprising that the same inference can be made under these distinct but related model frameworks.
We utilize observed allele frequencies in the sample as initial conditions for the backwards diffusion. The advantages of using the backward diffusion are two-fold. Firstly the numerical solutions are stable and bounded. Secondly, it is clear how to define the boundary conditions.
The Snapper sampler is based on the Snapp sampler and uses the same move proposals, which are standard in the Beast2 software package. However to improve sampling efficiency for large trees we implemented an additional move to scale population sizes on subtrees.
We have reported some of the analyses performed in order to validate and more importantly convince the reader of the ability of Snapper to infer population genetic parameters. Further work needs to be done to establish regions in the appropriate parameter space where specific parameters are irrecoverable.
Finally, we note that the general framework for defining likelihoods in terms of backward diffusions can be extended to other difffusion models. These models can include additional parameters such as selection, migration and linkage disequilibrium. In some cases even the computational framework is reusable.
We thank Arthur Georges for making the SNP dataset of Emydura macquarii available. Also for his help with the biogeographic interpretation of the Emydura macquarii species phylogeny and comparison of Snapper analysis with previous analysis done on the Emydura dataset. Marnus Stolz received a doctoral scholarship from the NZ Marsden Fund (PIs David Bryant and Steven Higgins).
- Boitard et al. (2016) Boitard, S., W. Rodríguez, F. Jay, S. Mona, and F. Austerlitz. 2016. Inferring population size history from large samples of genome-wide molecular data-an approximate Bayesian computation approach. PLoS genetics 12:e1005877.
- Bollback et al. (2008) Bollback, J. P., T. L. York, and R. Nielsen. 2008. Estimation of 2Nes from temporal allele frequency data. Genetics 179:497–502.
- Bouckaert (2010) Bouckaert, R. R. 2010. Densitree: making sense of sets of phylogenetic trees. Bioinformatics 26:1372–1373.
- Bouckaert et al. (2019) Bouckaert, R., T. G. Vaughan, J. Barido-Sottani, S. Duchêne, M. Fourment, A. Gavryushkina, J. Heled, G. Jones, D. Kühnert, N. De Maio and others. 2019. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS computational biology 15, 4:e1006650.
- Bryant et al. (2012) Bryant, D., R. Bouckaert, J. Felsenstein, N. A. Rosenberg, and A. RoyChoudhury. 2012. Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Molecular Biology and Evolution 29:1917–1932.
- Bryant and Hahn (2019) Bryant, D. and M. Hahn. 2019. The concatenation question. in Phylogenetics in the Genomics Era (F. Delsuc, N. Galtier, and C. Scornavacca, eds.). (in press).
- Chifman and Kubatko (2014) Chifman, J. and L. Kubatko. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30:3317–3324.
- Drummond and Rambaut (2007) Drummond, A. J. and A. Rambaut. 2007. Beast: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology 7:214.
- Edwards (2009) Edwards, S. V. 2009. Is a new and general theory of molecular systematics emerging? Evolution: International Journal of Organic Evolution 63:1–19.
- Epstein and Mazzeo (2010) Epstein, C. L. and R. Mazzeo. 2010. Wright–Fisher diffusion in one dimension. SIAM Journal on Mathematical Analysis 42:568–608.
- Etheridge (2011) Etheridge, A. 2011. Some mathematical models from population genetics : École d’éte de probabilités De Saint-flour XXXIX-2009. Springer.
- Ethier and Norman (1977) Ethier, S. N. and M. F. Norman. 1977. Error estimate for the diffusion approximation of the Wright–Fisher model. Proceedings of the National Academy of Sciences 74:5096–5098.
- Ewens (2004) Ewens, W. 2004. Mathematial population genetics. I. Theoretical introduction. Interdisciplinary Applied Mathematics 2 ed. Springer, New York.
- Felsenstein (1981) Felsenstein, J. 1981. Evolutionary trees from dna sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368–376.
- Felsenstein (1992) Felsenstein, J. 1992. Phylogenies from restriction sites: a maximum-likelihood approach. Evolution 46:159–173.
- Fox and Parker (1968) Fox, L. and I. Parker. 1968. Chebyshev Polynomials in Numerical Analysis. Oxford Mathematical Handbooks Oxford University Press, London.
- Georges et al. (2018) Georges, A., B. Gruber, G. B. Pauly, D. White, M. Adams, M. J. Young, A. Kilian, X. Zhang, H. B. Shaffer, and P. J. Unmack. 2018. Genomewide SNP markers breathe new life into phylogeography and species delimitation for the problematic short-necked turtles (Chelidae: Emydura) of eastern Australia. Molecular Ecology 27:5195–5213.
- Griffiths and Spano‘ (2010) Griffiths, R. C. and D. Spano‘. 2010. Diffusion processes and the coalescent. Pages 358–379 in Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman (N. H. Bingham and C. M. Goldie, eds.) London Mathematical Society Lecture Note Series. Cambridge University Press.
- Gutenkunst et al. (2010) Gutenkunst, R., R. Hernandez, S. Williamson, and C. Bustamante. 2010. Diffusion approximations for demographic inference: DaDi. Nature Precedings 5:1–1.
- Gutenkunst et al. (2009) Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Bustamante. 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics 5.
- Heled et al. (2013) Heled, J., D. Bryant, and A. J. Drummond. 2013. Simulating gene trees under the multispecies coalescent and time-dependent migration. BMC evolutionary biology 13:44.
- Hiscott et al. (2016) Hiscott, G., C. Fox, M. Parry, and D. Bryant. 2016. Efficient Recycled Algorithms for Quantitative Trait Models on Phylogenies. Genome biology and evolution 8:1338–1350.
- Jenkins et al. (2017) Jenkins, P. A., D. Spano, et al. 2017. Exact simulation of the Wright–Fisher diffusion. The Annals of Applied Probability 27:1478–1509.
- Kingman (1982) Kingman, J. 1982. The coalescent. Stochastic Processes and their Applications 13:235–248.
- Lapierre et al. (2017) Lapierre, M., A. Lambert, and G. Achaz. 2017. Accuracy of demographic inferences from the site frequency spectrum: the case of the Yoruba population. Genetics 206:439–449.
- Larget et al. (2010) Larget, B. R., S. K. Kotha, C. N. Dewey, and C. Ané. 2010. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26:2910–2911.
- Liu (2008) Liu, L. 2008. Best: Bayesian estimation of species trees under the coalescent model. Bioinformatics 24:2542–2543.
- Liu et al. (2008) Liu, L., D. K. Pearl, R. T. Brumfield, and S. V. Edwards. 2008. Estimating species trees using multiple-allele DNA sequence data. Evolution: International Journal of Organic Evolution 62:2080–2091.
- Liu et al. (2010) Liu, L., L. Yu, and S. V. Edwards. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology 10:302.
- Lukic and Hey (2012) Lukic, S. and J. Hey. 2012. Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics 192:619–639.
- Mason and Handscomb (2002) Mason, J. C. and D. C. Handscomb. 2002. Chebyshev polynomials. Chapman and Hall/CRC.
- McKane and Waxman (2007) McKane, A. J. and D. Waxman. 2007. Singular solutions of the diffusion equation of population genetics. Journal of Theoretical Biology 247:849–858.
- Nielsen (2000) Nielsen, R. 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154:931–942.
- Øksendal (2003) Øksendal, B. 2003. Stochastic differential equations, an Introduction with Applications. Universitext Springer.
- Racimo et al. (2016) Racimo, F., G. Renaud, and M. Slatkin. 2016. Joint estimation of contamination, error and demography for nuclear dna from ancient humans. PLoS genetics 12:e1005972.
- Rambaut et al. (2018) Rambaut, A., A. J. Drummond, D. Xie, G. Baele, and M. A. Suchard. 2018. Posterior summarization in Bayesian phylogenetics using tracer 1.7. Systematic Biology 67:901–904.
- Sirén et al. (2010) Sirén, J., P. Marttinen, and J. Corander. 2010. Reconstructing population histories from single nucleotide polymorphism data. Molecular Biology and Evolution 28:673–683.
- Song and Steinrücken (2012) Song, Y. S. and M. Steinrücken. 2012. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190:1117–29.
- Steinrücken et al. (2014) Steinrücken, M., A. Bhaskar, and Y. S. Song. 2014. A novel spectral method for inferring general diploid selection from time series genetic data. The Annals of Applied Statistics 8:2203–2222.
- Tataru et al. (2017) Tataru, P., M. Simonsen, T. Bataillon, and A. Hobolth. 2017. Statistical inference in the Wright–Fisher model using allele frequency data. Systematic biology 66:e30–e46.
- Trefethen (2013) Trefethen, L. N. 2013. Approximation theory and approximation practice. SIAM, Philadelphia.
- Vachaspati and Warnow (2015) Vachaspati, P. and T. Warnow. 2015. ASTRID: accurate species trees from internode distances. BMC Genomics 16:S3.
- Waldvogel (2006) Waldvogel, J. 2006. Fast construction of the Fejér and Clenshaw–Curtis quadrature rules. BIT Numerical Mathematics 46:195–202.
- Williamson et al. (2005) Williamson, S. H., R. Hernandez, A. Fledel-Alon, L. Zhu, R. Nielsen, and C. D. Bustamante. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proceedings of the National Academy of Sciences 102:7882–7887.
- Wright (1931) Wright, S. 1931. Evolution in mendelian populations. Genetics 16:97.
- Yang (2015) Yang, Z. 2015. The BPP program for species tree estimation and species delimitation. Current Zoology 61:854–865.
Appendix A Appendix
a.1 The algorithm for solving diffusions
In the main text, we introduced a matrix which plays a key role in the numerical solution of diffusions. The matrix is chosen so that if
The bottleneck in the numerical method we use is the repeated solution of linear systems that look like
for difference complex values and vectors . Using a direct method, these take time each as is lower triangular. However we can do better, using a trick described in Fox and Parker (1968). Here we give a very high level description of the approach.
The key idea is to apply integration twice to the LHS of (23). Using integration by parts multiple times we have
We introduce two new matrices and . The matrix is derived from properties of shifted Chebyshev polynomials, and is defined so that if is expanded as in (22) then
We then obtain
The usefulness of this follows from that fact that, with the exception of two rows, all the non-zero entries in and are on or near the digaonal: both matrices are almost banded. To solve we multiply both sides by and solve the sparse system that results. Overall, this now takes time.
a.2 Further details on the simulation studies
We conducted a simulation study to check whether the model posterior distributions recover model parameters. We simulate data from three different 4 species tree distributions. Parameter means are equal but we change the variance of the 4 species tree distribution. We then draw a 100 parameter sets from each 4 species tree distribution. Thereafter we simulate a 1000 SNPs for 32 individuals distributed over 4 species for each parameter set. For every simulated SNP dataset we ran two chains for 100,000 iterations, one with correct prior, i.e 4 species tree distribution used for simulation, and one with incorrect prior. See Table4 for details on priors. We then count the number of parameters that were recovered, i.e fall within the 95% highest posterior density of the MCMC chain. MCMC chains took on average 200 seconds to run. We report these counts in Table 3. We see that correct priors on the model has a high rate of parameter recovery. However, when incorrect priors are used recovery rates suffer for population size parameters close to the root. This does not come as a surprise due to information loss on branches as we go up the tree.
|Correct prior||Incorrect prior|
|Population size||(2, 200)||0.01||(2, 20)||0.1|
|Tree||Topology and branch length||Yule(10)||0.0111…||Yule(1)||1|
a.3 Tabulated posterior statistics for Snapp and Snapper analyses
(Note: these could go online or in supplementary data)