Robust model selection between population growth and multiple merger coalescents

06/18/2018 ∙ by Jere Koskela, et al. ∙ Ruhr-Universität Bochum University of Warwick 0

We study the effect of biological confounders on the model selection problem between Kingman coalescents with population growth, and Xi-coalescents involving simultaneous multiple mergers. We use a low dimensional, computationally tractable summary statistic, dubbed the singleton-tail statistic, to carry out approximate likelihood ratio tests between these models. The singleton-tail statistic has been shown to distinguish between the two classes with high power in the simple setting of neutrally evolving, panmictic populations without recombination. We extend this work by showing that cryptic recombination and selection do not diminish the power of the test, but that misspecifying population structure does. Furthermore, we demonstrate that the singleton-tail statistic can also solve the more challenging model selection problem between multiple mergers due to selective sweeps, and multiple mergers due to high fecundity with moderate power of up to 60

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Kingman coalescent (Kingman, 1982a, b, c; Hudson, 1983a, b; Tajima, 1983)

models ancestral relations of samples from large populations as random, binary trees, and is an important tool for predicting genetic diversity. A central assumption of the Kingman coalescent is low variance of family sizes, so that large populations always consist of many relatively small families. Violations of this assumption call for models with infinite variance family sizes, and lead to so called

-coalescents, which allow more than two lineages to merge to a common ancestor simultaneously (Donnelly and Kurtz, 1999a; Pitman, 1999; Sagitov, 1999).

There is growing evidence that

-coalescents are an appropriate model for organisms with high fecundity coupled with a skewed offspring distribution

(Beckenbach, 1994; Árnason, 2004; Eldon and Wakeley, 2006; Sargsyan and Wakeley, 2008; Hedgecock and Pudovkin, 2011; Birkner et al., 2011; Steinrücken et al., 2013; Tellier and Lemaire, 2014). Consequently, development of statistical techniques for distinguishing the Kingman coalescent from -coalescents has also been an active area of research; see (Eldon et al., 2015; Koskela, 2018), and references therein. In particular, attention has focused on distinguishing -coalescents from Kingman coalescents with population growth, because both classes of models predict an excess of singletons (mutations only carried by one individual in a sample of DNA sequences) relative to the standard Kingman coalescent under the infinitely many sites model of mutation (Watterson, 1975).

Koskela (2018) introduced a simple, two-dimensional summary statistic, referred to here as the singleton-tail statistic, which distinguishes between these model classes with high power even from a data set consisting of 500 samples from bi-parental, diploid organisms sequenced at around 10 unlinked chromosomes. The correct model could be selected with high power without knowing the population-rescaled mutation rate, provided it is was not very low (see also (Eldon et al., 2015, Supporting Information 12)). In this paper we investigate the impact of other confounders on the prospect of discriminating between these models based on the singleton-tail statistic, again in the bi-parental, diploid setting. In particular, we will focus on each of

  1. weak natural selection modelled by the Ancestral Selection Graph (Krone and Neuhauser, 1997; Neuhauser and Krone, 1997; Donnelly and Kurtz, 1999b; Baake et al., 2016),

  2. crossover recombination within chromosomes modelled by the Ancestral Recombination Graph (Hudson, 1983a; Griffiths and Marjoram, 1997; Donnelly and Kurtz, 1999b; Birkner et al., 2013),

  3. population structure modelled by the structured coalescent (Herbots, 1997; Limic and Sturm, 2006; Eldon, 2009).

We will demonstrate that the presence or absence of the first two has minimal effect on the performance of the hypothesis test developed in (Koskela, 2018), while population structure is a significant counfounder that must be correctly incorporated into the model.

There are four parental copies of each chromosome involved in each merger in the diploid, bi-parental setting, allowing for up to four simultaneous mergers. Hence the models considered in this paper are actually -coalescents (Schweinsberg, 2000; Möhle and Sagitov, 2001) which allow simultaneous multiple mergers, despite the fact that the population will be assumed to reproduce in a fashion consistent with the more restrictive -coalescent permitting only one multiple merger at a time.

We also use the singleton-tail statistic to distinguish two classes of -coalescents: those arising from high fecundity reproduction, and those arising from selective sweeps (Durrett and Schweinsberg, 2005)

. This problem is more challenging than a null hypothesis consisting of Kingman coalescents with population growth, because the marginal coalescent process at each chromosome can be identical under the two hypotheses. However, high fecundity reproduction results in positively correlated coalescence times between unlinked chromosomes, whereas unlinked chromosomes are independent under the selective sweep model. The positive correlation results in increased sampling variance of the singleton-tail statistic, which yields tests with moderate statistical power of up to

.

The rest of the paper is organised as follows. In Section 2 we recall the singleton-tail statistic of (Koskela, 2018) as well as the associated hypothesis test for model selection. Section 3 presents a unified, diploid coalescent model incorporating high fecundity reproduction and population growth, as well as the three confounders of weak selection, crossover recombination, and discrete spatial structure. Models with only population growth or high fecundity reproduction, as well as any desired subset of confounders, can be recovered as special cases. Section 4 provides simulation studies on the effect of each of the three confounders on the sampling distribution of the singleton-tail statistic, as well as the associated hypothesis test. In Section 5 we introduce a different model in which rapid selective sweeps result in multiple mergers acting locally on the genome, and investigate whether the singleton-tail statistic can distinguish it from the -coalescent introduced in Section 3. Section 6 concludes with a discussion.

2 The singleton-tail statistic

Suppose a sample of DNA sequences from a single chromosome is available, and that derived mutations can be distinguished from ancestral states. Let , and let be the number of sites at which a mutant allele appears times. Then

is the unfolded site-frequency spectrum (SFS). If mutant and ancestral types cannot be distinguished, the folded spectrum (Fu, 1995) is used instead, where

and if , and is zero otherwise. Let be the normalised unfolded SFS, whose entries are given by , where is the total number of segregating sites, and with the convention that if there are no segregating sites.

Now, for any define the lumped tail of the SFS as

and consider the summary statistic for some fixed . Data from multiple chromosomes is incorporated by averaging: if unlinked chromosomes are available, then the singleton-tail statistic is

where denotes the singleton class and lumped tail computed from the chromosome.

For two classes of models and

, the likelihood ratio test statistic is

where denotes the sampling distribution of the singleton-tail statistic under coalescent . A corresponding hypothesis test of size given an observed value of the singleton-tail statistic is

(1)

where corresponds to rejecting the null hypothesis , and

is the quantile

The sampling distribution and the quantile are both intractable, but can easily be approximated by simulation due to the low dimensionality of the singleton-tail statistic to obtain an implementable hypothesis test with approximate size Koskela (2018). In particular, we consider the hypotheses

Kingman coalescent with exponential growth at population-rescaled rate
(2)

In brief, data is simulated under both and

, and kernel density estimates (KDEs)

of the intractable sampling distributions are obtained for and . These KDEs, along with more simulated data, can be used to accurately approximate the intractable quantile , yielding an implementable hypothesis test. Our KDEs were obtained using the kde function in the ks package (version 1.10.4) in R under default settings. In particular, this method uses truncated Gaussian kernels, and determines bandwidths using the SAMSE estimator (Duong and Hazelton, 2003, equation (6)).

Remark 1.

The null hypothesis in Koskela (2018) was broader and included algebraic population growth, in addition to exponential. However, results in Koskela (2018) showed that the two growth models resulted in very similar sampling distributions for the singleton-tail statistic, and hence we focus on the exponential growth model.

Simulating data in order to approximate the test (1) requires specification of the cutoff for the lumped tail of the singleton-tail statistic, as well as of the mutation rate . Sensitivity analyses conducted in Koskela (2018) showed that the test was highly insensitive to the choice of provided , as well as to misspecification of the mutation rate by up to a factor of ten. We fix throughout, and use the known, true mutation rate in our simulation studies. For biological data sets with an unknown mutation rate, the analysis in Koskela (2018) demonstrated that it is sufficient to use the generalised Watterson estimator,

where is the expected branch length from leaves under coalescent .

3 An umbrella model

In this section we describe a general class of models incorporating diploidy, bi-parental, high fecundity reproduction, population growth, weak natural selection, population structure in a discrete geography, and crossover recombination. This generalises both the Ancestral Influence Graph Donnelly and Kurtz (1999b), as well as time-inhomogeneous multiple merger coalescents Möhle (2002); Matuszewski et al. (2018). Models with any subset of the above forces can be recovered as special cases.

Consider a geography of demes, with the population size on deme at time given by , where is a scaling parameter. We also define the total population size

and the shorthand .

Each individual carries a diploid genome consisting of pairs of unlinked chromosomes. Each chromosome carries one of alleles, identified with , which are acted upon by natural selection. In addition, each chromosome is identified with the unit interval , on which neutral mutations and crossover recombination take place. For definiteness, we assume that the selective allele is fully linked to the left end of the neutral interval.

The populations evolve in discrete time with non-overlapping generations. At each time , the individuals in each deme form pairs uniformly at random. The pairs are ordered in a fixed but arbitrary way, and pair on deme has a random number of offspring denoted by . The two summands will be associated with neutral reproduction and natural selection, respectively. As such, the distribution of will depend on the alleles of the two parents, though this dependence is suppressed for legibility. Likewise, we will frequently suppress the time-dependence in the family sizes, and write and . For future convenience, we define as the random number of selective offspring that pair on deme at time would have had if they carried the fittest possible combination of alleles.

The neutral offspring vectors

are assumed to be exchangeable, and independent across demes as well as time steps. The selective offspring vectors are independent across demes and time steps. In addition, both vectors on each deme are assumed to satisfy the almost sure constraint

Each offspring inherits one copy of each of its

chromosome pairs from each of its parents. Each inherited chromosome is a mosaic of the two chromosomes carried by the parent, with the number of recombination breakpoints having the Poisson distribution with parameter

, and each break point being uniformly distributed along the chromosome. All of the Poisson and uniform random variables are independent of each other, as well as of the wider reproduction mechanism. Each locus inherits its allele from the parental chromosome assigned to its leftmost segment, with selective mutations happening independently at random with probability

. Mutant types are drawn from a stochastic matrix

, where is the probability of a mutant locus having allele given its parent had allele .

After the reproduction step is complete, a deterministic fraction of children chosen uniformly at random from deme migrate to deme , for each pair of demes. These migration fractions are assumed to satisfy

(3)

for each , so that the population sizes of demes remain unchanged by migration. For notational brevity we set .

We now reverse the direction of time, so that time corresponds to generations in the past in the model specified above. For and let denote the falling factorial, and define

(4)

as the probability that two chromosomes sampled uniformly at random from deme (resp. the whole population) at time (resp. ) were born to a common family in the previous generation, made the same choice from two available parents, and also the same choice of chromosome within that parent. In other words, is the probability of two time chromosomes on island merging to a common ancestor in one generation, while is the same probability for two chromosomes sampled uniformly from the whole population at time .

We make the following assumptions for each , and each , as , where each is a probability measure on , and each is a positive function bounded away from 0, with and , and and are constant independent of , , and :

(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
Remark 2.

It is well known that if , the Dirac delta-measure at 0, in (10), then the assumption is equivalent to

for each and , where the second representation follows from (8) and

(16)

itself a consequence of (7) and (8). See (Möhle and Sagitov, 2003, Section 5) for details. This assumption disallows multiple mergers in the limiting ancestry, which will thus only consist of isolated binary mergers. Any other choice of will yield an ancestry with up to four simultaneous multiple mergers at each chromosome, corresponding to the four possible parental chromosomes involved in the forwards-in-time reproduction event, and thus produce ancestries described by a -coalescent. See (Möhle and Sagitov, 2003, Section 6) for details of -coalescents arising out of diploid reproduction in this way.


Remark 3.

Before showing that (5) – (15) lead to the desired ancestral process, some intuition behind the role of each assumption is in order. (5) yields a limit process evolving in continuous time. Assumptions (6) – (8) ensure that the population sizes and time scales on demes are comparable. The conditions on the relative population sizes are sufficient to ensure finite waiting times between merger and migration events, and could be relaxed in specific examples. For exponential population growth, they hold as long as the growth rates on all demes coincide. For models in the domain of attraction of Kingman’s coalescent, (7) will typically hold with , while e.g. the -coalescents of Schweinsberg (2003) have (c.f. (4) and (Schweinsberg, 2003, Lemma 13)). The condition ensures that (5) and (7) can hold simultaneously. Conditions (9) and (10) are well known to be necessary and sufficient for a -coalescent limit, resulting in no more than four simultaneous multiple mergers in the diploid, biparental setting. (11) – (14) ensure that mutation, recombination, migration, and selection all take place on the coalescent time scale, while (15) disallows multiple selective branching events, as well as simultaneous selective and neutral merger events.

The aim is to show that the ancestry of a sample from the above particle system converges to a structured, time-inhomogeneous -Ancestral Influence Graph (Donnelly and Kurtz, 1999b) as , when time is measured in units of . To establish this fact, we identify the limiting rates of coalescence, mutation, recombination, migration and branching due to selection, and show that these are the only dynamics which affect the ancestry of the process. Specifically, that simultaneous mergers of sizes , with at time happen on deme at rate

events in which one lineage branches into lineages occur at rate , branching into two lineages due to crossover recombination happens at rate , mutations occur ate rate , and that migration to deme happens at rate , where is the number of lineages on deme . Between migration events, the ancestries of subpopulations on different demes evolve independently. Convergence will then follow from a straightforward analogue of (Möhle and Sagitov, 2003, Theorem 4.2). Throughout, we assume that our sample consists of lineages on deme , and that each lineage carries ancestral material on only one chromosome. This assumption is justified later by verifying that a separation of timescales phenomenon (Möhle, 1998) takes place, establishing that distinct chromosomes disperse to separate active lineages instantaneously on the coalescent time scale.

Multiple mergers via a single large family

By the Kingman formula for exchangeable, diploid offspring distributions (Möhle and Sagitov, 2003, equation (9)), the probability of chromosomes merging by belonging to the same family in the previous time step, and picking the same parental chromosome out of the four possibilities, is

Analogously to (Möhle and Sagitov, 2003, equations (28) and (29)), conditions (9) and (10) imply that

where the last step follows from (16). The rate of a particular combination of simultaneous mergers with sizes for is obtained by summing over all ways in which such a merger can happen, resulting in the overall rate

(17)

where (Birkner et al., 2013, equation (27)).

Multiple mergers via two or more large families

By (9), the probability of mergers via two or more large families, i.e. families with at least two offspring in the sample, is bounded from above by

A single migration event

The event that all lineages belong to different families in the previous generation, and that one individual migrates from deme to in reverse time, has asymptotic probability

by (3), (8), and (13).

A similar calculation demonstrates that the analogous probability for more than one simultaneous migration event is , while combining the above with the first two calculations demonstrates that a single migration occurring simultaneously with one or more large families is also an event.

A single mutation event

An analogous calculation to the migration case using (11) shows that the probability of one site mutating in the previous time step with no other accompanying events converges to .

A single recombination event

Likewise, an analogous calculation to the migration case using (12) shows that the probability of one chromosome recombining in the previous time step with no other accompanying events converges to .

A single branching event due to a selective birth

The probability of a lineage belonging to a selective birth by a family in the previous generation depends on the fitness of its parents, which is unknown. An elegant solution is to add selective events at the greatest possible rate, add the chromosomes belonging to the two potential parents into the sample along with retaining the child lineage whenever a selection event happens, and track this extended sample to its ultimate ancestor: the most recent common ancestor of the original sample, as well as all potential selective parents encountered along the way (Krone and Neuhauser, 1997; Neuhauser and Krone, 1997). The type of the ultimate ancestor can then be sampled from the stationary distribution of (or any other desired initial law), with mutations occurring along lineages and alleles propagated to children as before. Now the alleles, and thus the fitness of the selective parents are known at each potential selective event, and the true ancestry of each child lineage can be assigned to either a randomly chosen parent with probability , or to remain with the ongoing child lineage with the complementary probability.

From the point of view of the ancestral process, such selective branching events in which one lineage on deme branches into lineages (corresponding to the single-chromosome child lineage, as well as the parental chromosomes which immediately disperse into separate lineages due to separation of time scales) happens with asymptotic probability

A binomial expansion followed by (14) and (15) yield

as required.

Multiple simultaneous branching events

Multiple simultaneous selective events can take place in one of three ways: two (or more) simultaneous selective births in the same family, two (or more) simultaneous selective births in a combination of families, or a combination of selective and neutral births in the same family. The probability of all three kinds of events is bounded above by

by a binomial expansion and (15).

Dispersal of chromosomes into distinct, single-marked individuals

Finally, we abandon the assumption that all lineages carry ancestral material on only one chromosome in order to verify the separation of time scales phenomenon. The probability that individuals with ancestral material both chromosomes (or so-called double-marked individuals) in a pair disperse into parents, each of whom carries ancestral material on only one copy of the chromosome (so-called single-marked individuals), in the previous generation is . To see why, note that every individual is replaced at every time step, and individuals always inherit one chromosome from each parent. Thus, complete dispersal of double-marked individuals happens in one generation provided that all individuals originate from different families, which has probability at least

Likewise, the probability of two active chromosomes splitting apart into distinct ancestors is because assignments of parents to chromosomes is done independently and uniformly at random. Hence, the probability of a lineage with ancestral chromosomes dispersing into lineages with a single ancestral chromosome each in at most generations happens with probability at least

Probabilities of merger, recombination, selection or migration events were all established above to be , and thus the probability of complete dispersal before any merger, recombination, selection or migration events is of order

where is a constant independent of both and . Thus an analogue of the separation of timescales result in (Möhle, 1998) holds, which justifies considering only single-marked configurations in the previous computations of transition probabilities.

4 Robustness results

The following three subsections quantify the respective effect of selection, recombination, and population structure on the sampling distribution of the singleton-tail statistic. Each subsection specialises the model of Section 3 to consist of only the relevant force by a particular choice of parameters. We assume the model of Schweinsberg (2003) for the evolution of the population, and thus consider a one-dimensional family of coalescents specified by in (10) for , with corresponding time scaling and in (7). Under the alternative hypothesis , the population sizes on demes will be constant, i.e.  for relative deme sizes . Under the null hypothesis , populations on demes will undergo exponential growth forwards in time, corresponding to , resulting in for the population-rescaled growth rate .

It will also be necessary to distinguish between two kinds of data sets: simulated data used to fit KDEs to approximate likelihoods, and compute the quantile in (1), as well as observed data, which will also be simulated in this instance, but which will typically be a biological data set. We will refer to the former as calibration data, and the latter as pseudo-observed data. Pseudo-observed data is reserved solely for plugging into KDE approximations of likelihoods (computed from calibration data) to obtain likelihood ratio test statistics. A C++ implementation of the algorithm used to generate the data in this section is available at https://github.com/JereKoskela/Beta-Xi-Sim.

We set the number of simulation replicates per model at 1000 (note that contains 133 models, and a further 41), the sample size at , the lumped tail cutoff at , and assume the true mutation rate is known. The number of unlinked chromosomes per sample is set to 23 to match the number of linkage groups in Atlantic cod (Tørresen et al., 2017, Supplementary Table 3) — an organism for which multiple merger have frequently been suggested as an important evolutionary mechanism (Steinrücken et al., 2013; Tellier and Lemaire, 2014). Results are averaged across chromosomes as outlined in Section 2. The lengths of the 23 chromosomes have also been set (by multiplying the total rate of mutation on each chromosome by the number of sites it contains) to match those reported in (Tørresen et al., 2017, Supplementary Table 3). The approximate size of hypothesis tests is set at throughout.

4.1 Weak selection

In this section we consider the model of Section 3 with a single deme (), and no recombination (). The resulting process is a -coalescent analogue of the Complex Selection Graph (CSG) Fearnhead (2003). We compute realisations of the singleton-tail statistic by assuming that neutral, infinitely-many-sites mutations occur on each chromosome along the branches of the realised non-neutral tree sampled from the -CSG, but that the selective types of individuals are unobserved. This assumption is reasonable if the fitness of individuals cannot be observed, or if mutations with a fitness effect are either much less frequent than neutral mutations, or occur in unobserved regions.

Figure 1 shows the sampling distributions of the neutral and non-neutral models. The fitness model assumes two alleles, a and A, with each chromosome pair contributing fitness if either parent carries at least one A allele at that pair, and 0 otherwise. The selection rates are necessarily low, because the cost of simulating the ASGs is known to increase exponentially in (Fearnhead, 2001, Appendix A). Efficiency gains resulting from perfect simulation techniques (Fearnhead, 2001, 2003) cannot be employed because they rely on terminating the simulation before reaching the MRCA, and thus the SFS cannot be resolved.

Figure 1: percentiles of KDEs fitted to 1000 realisations of the singleton-tail statistic for each model in and . Each sample consists of 23 chromosomes, and (Left) , or (Right) per chromosome pair as varies from 1 to 2.

The results in Figure 1 show striking agreement between sampling distributions in the neutral and selective cases. We also conducted the hypothesis test (1) using calibration data simulated from a neutral model, and applied the resulting misspecified test to pseudo-observed data simulated from a model with weak selection. Figure 2 shows that the performance of the test was excellent, with high power and size well below the formal threshold of for the majority of the parameter ranges.

Figure 2: Empirical size (Left) and power (Right) of a vs test conducted using calibration data simulated under neutral models, but applied to pseudo-observed data simulated from selective models. The simulation parameters are as in Figure 1.

To investigate the effect of a larger selection coefficient, we also simulated realisations of the singleton-tail statistic under a single chromosome model. In this setting, each selective branching event results in five lineages, as opposed to as in the 23 chromosome case. The sampling variance under a single locus is too large for a powerful statistical test, but Figure 3 demonstrates that the sampling distributions with and without selection remain very similar. Taken together, these simulations show that the distribution of realised relative branch lengths under the CSG is similar to that under a neutral coalescent, at least for external branches, as well as for the oldest branches before the MRCA is reached. Hence, the singleton-tail statistic cannot be used to detect weak selection, but can discriminate between population growth and -coalescents without knowing whether weak selection is taking place.

Figure 3: percentiles of KDEs fitted to 1000 realisations of the singleton-tail statistic for each model in and . Each sample consists of one chromosome, and (Left) , or (Right) as varies from 1 to 2.

4.2 Recombination

In this section we consider the model of Section 3 with a single deme (), and no selection (). Realisations of the singleton-tail statistic are computed by assuming a neutral, infinitely-many-sites mutation model along the branches of the realised -Ancestral Recombination Graph.

Figure 4 presents a comparison between models with and without recombination. As was the case with weak selection (see Figure 1), the presence of recombination makes no discernible difference to the sampling distribution of the singleton-tail statistic, although the distribution of intermediate SFS entries was observed to be different (results not shown). Figure 5 demonstrates that the size and power of statistical tests are unaffected when a misspecified model which wrongly neglects recombination is used to generate calibration data, and the hypothesis test is conducted on pseudo-observed data with recombination.

Figure 4: percentiles of KDEs fitted to 1000 realisations of the singleton-tail statistic for each model in and . (Left) . (Right) as varies from 1 to 2.
Figure 5: Empirical size (Left) and power (Right) of a vs test conducted using calibration data from models without recombination, but applied to pseudo-observed data from models with recombination. The simulation parameters are as in Figure 4.

4.3 Population structure

In this section we consider the model of Section 3 with a no selection (), no recombination (), and two different patterns of population structure.

Figure 6 shows sampling distributions corresponding to a four deme model with symmetric migration between all pairs of demes, as well as a two deme model with asymmetric migration. The contours differ markedly from the panmictic results in Figures 1 and 4, and also from each other. Figure 7 demonstrates that misspecifying spatial structure results in very poor performance of the hypothesis test, with both the size and power curves showing complex behaviour that depends on the patterns of overlap between the distribution of the misspecified calibration data, and the pseudo-observed data.

Figure 6: percentiles of KDEs fitted to 1000 realisations of the singleton-tail statistic for each model in and . (Left) Four demes with equal population sizes and symmetric migration between all pairs of demes at reverse-time rate as varies from 1 to 2. (Right) Two demes with relative population sizes , and reverse-time migration rates ranging from when to when .
Figure 7: (Top Row) Empirical size (Left) and power (Right) of a test of vs conducted using calibration data from panmictic models, but applied to pseudo-observed data from four deme models. (Bottom Row) Empirical size (Left) and power (Right) of a test of vs conducted using calibration data from four deme models, but applied to pseudo-observed data from two deme models. The simulation parameters in both cases are as in Figure 6.

5 Distinguishing high fecundity from selective sweeps

This section focuses on distinguishing multiple mergers due to selective sweeps from multiple mergers due to high fecundity. The high fecundity model is the -coalescent introduced in Section 3 with a single deme (), no recombination , no selection , and a constant population size . For selective sweeps, we assume a population of constant size evolving in non-overlapping generations, in which mutations providing a selective advantage occur at points of a Poisson process with rate , and sweep to fixation instantaneously on the coalescent time scale.

We also assume that recombination within chromosomes results in incomplete sweeps, so that when viewed backwards in time each individual has a random chance to participate in the merger resulting from each sweep. Recombination is specified implicitly by setting the probability of a lineage participating in a sweep arising from a mutation with advantage to . Genetic material that is unlinked to the beneficial mutation escapes the selective sweep, and thus multiple mergers affect one chromosome at a time. Neutral mutations continue to accrue along ancestral branches according to the infinite sites model with mutation rate . When the population is diploid and biparental, these dynamics result in an ancestral process in which the marginal coalescent at each locus is the -coalescent with merger rates given by (17).

Remark 4.

The model described above has not been derived as a scaling limit of a finite population model of evolution. Instead, it has been chosen to make the task of distinguishing between selective sweeps and high fecundity as difficult as possible. For the same reason, we also scale the mutation rate as as in the model of Schweinsberg (2003). For biological motivation, note that this model closely resembles the -coalescent of Durrett and Schweinsberg (2005), which was derived as a scaling limit of finite population models undergoing selective sweeps and recombination in much the same way as above. However, their convergence result can only be used to obtain -coalescents in which has an atom at 0, and hence it cannot be immediately used to obtain our model (Durrett and Schweinsberg, 2005, Example 2.5). A model akin to ours could be obtained as a similar scaling limit by letting selective sweeps occur more frequently than the time scale of pairwise coalescence, thus causing the atom at 0 to vanish in the large population limit (Gillespie, 2000; Durrett and Schweinsberg, 2005).

We fix our null hypothesis as the class of selective sweep models described above with the parameter discretised as in (2). The alternative hypothesis is the high fecundity -coalescent described at the beginning of this section. The only difference between the two model classes is whether coalescence times at unlinked chromosomes are independent (under the selective sweep model), or positively correlated (under the high fecundity model). The marginal coalescents at each chromosome coincide.

Figure 8: (Top Left) percentile of a kernel density estimator fitted to 1000 realisations of the singleton-tail statistic under each model in