The zig-zag process is a piecewise deterministic Markov process introduced in [BR17, BFR19b] as a continuous-time MCMC method. It has several advantages in comparison with reversible methods such as Metropolis-Hastings [Has70] and Gibbs sampling [GS90]. It is non-reversible, and thus avoids the backtracking behavior known to inhibit the mixing of reversible methods. It is also rejection-free, so that no computation is wasted in proposing rejected moves.
In brief, a zig-zag process with velocity targeting the
-marginal probability densitydefined on has generator
where is the -partial derivative, and flips the sign of . The flip rates
ensure that is the -marginal of the unique stationary distribution of [BFR19b, Proposition 2.3]. In words, the coordinates of move at constant velocities until a flip at coordinate , where the velocity of the th coordinate changes sign.
To date, the zig-zag process has only been applied to relatively simple targets, such as the Curie-Weiss model [BR17]
, and logistic regression[BFR19b]. Both state spaces have simple geometric structures with obvious notions of direction and velocity. We construct a zig-zag process for the coalescent [Kin82]: a tree-valued target consisting of continuous branch lengths and discrete tree topologies with no canonical geometric structure. This will be done by embedding both branch lengths and tree topologies into a single continuous space, following similar strategies in Hamiltonian Monte Carlo (HMC) [DBZM17, NDL20].
MCMC algorithms for coalescents suffer from poor mixing, largely due to the difficulty of designing proposal distributions which perturb trees in a way that combines efficient exploration with a high acceptance rate [MV05, HDPD08, LVH08]. Workarounds consist of empirical searches for efficient proposals [HD12, ASR16], or Metropolis-coupled MCMC [Gey92]. The former struggles to scale to problems for which empirical optimization is infeasible. The latter helps mixing between modes, but does not overcome low acceptance probabilities or the backtracking random-walk behavior of reversible MCMC.
HMC [Nea10] is a variant of Metropolis-Hastings that shares some similarities with the zig-zag process. It augments the state space with momenta, and uses Hamiltonian dynamics to propose large steps which are accepted with high probability, although they are not rejection-free. Like the zig-zag process, HMC requires gradient information and a suitable geometric embedding of the target. These were provided by [DBZM17] for the coalescent and the finite sites model of mutation [JC69] using an orthant complex construction of phylogenetic tree space [BHV01]. This work improves on the method of [DBZM17] in three ways:
The zig-zag process is readily implementable on -space via Poisson thinning, without e.g. numerical integrators such as leap-prog [DBZM17, Algorithm 1].
The zig-zag process has simple boundary behavior between orthants without boundary smoothing [DBZM17, Section 3.3], chiefly because discontinuous gradients are easier to handle in continuous than discrete time,
and demonstrates that the zig-zag process mixes over the space of latent trees efficiently. Our techniques are not specific to the coalescent, and can also be adapted to other models.
The rest of the paper is structured as follows. Section 2 recapitulates the coalescent and -space: the state space of the zig-zag process once suitably augmented with velocities. Sections 3 and 4 recall the popular infinite and finite sites models of mutation, show derive zig-zag processes for each, and demonstrate its performance via simulation. Section 5 concludes with a discussion. The algorithms and data sets used to conduct the simulation studies in Sections 3 and 4 are available at https://github.com/JereKoskela/tree-zig-zag.
2 The coalescent and a geometric embedding
An ultrametric binary tree with labeled leaves is a rooted, binary tree in which every leaf is equidistant from the root. We follow [GD16], and encode such trees leaf labels as the pair , where is a ranked topology and . The distance from the leaves to the first merger event is encoded by , and subsequent s are separations between successive mergers. The ranked topology specifies which nodes merge in each event. It is an -tuple of label pairs, where the th pair specifies the two child nodes of the th merger. Non-leaf nodes are labeled by the leaves they subtend. For example, the ranked topology
encodes the four leaf caterpillar tree with leaves merging in order of labels: first 1 and 2, then 3, and then 4. We order the entries of by their least element.
The coalescent [Kin82] is a seminal model for the genetic ancestry of samples from large populations. Under the coalescent, a tree has probability density
which arises as the law of a tree constructed by starting a lineage from each leaf, and merging each pair at rate 1 until the most recent common ancestor (MRCA) is reached. Its popularity is due to robustness: distributions of ancestries from a large class of individual-based models converge to the coalescent in the infinite population limit provided time is suitably rescaled. For more details, see e.g. [Wak09].
We define a family of swap operators for ranked topologies such that is not an element of . The action of is
In words, exchanges the order of the th and th mergers. We also define families of pivot operators and for such that is one of the entries of . For such an , we define as
where (resp. ) is the entry of with the higher (resp. lower) least element, and is the sibling: the entry of that is not . Pivot is defined similarly, with
The pivots correspond to the two nearest neighbor interchanges between the th merger, and the mergers giving rise to its child nodes.
We can now describe -space, which gives a geometric structure to the set of pairs . For fixed , the space of
-vectors is the orthant. Each boundary point at which one corresponds to a degenerate tree in which one of three things takes place:
The two leaves of merge at time 0 if .
There are two simultaneous mergers if is not an element of .
There is a triple merger if is an element of .
Type 1 boundaries are boundary points of -space. At a type 2 boundary, we glue together the orthants corresponding to the two distinct ranked topologies separated by an -step. Continuous trajectories crossing this boundary move between the ranked topologies. At type 3 boundaries, we glue together the three orthants which resolve the triple merger into two binary mergers. They all differ by a or -step, and as with boundaries of type 2, any continuous trajectory that crosses the boundary will visit the intermediate tree containing the triple merger, and can progress into any of the three orthants.
This embedding enables us to construct the zig-zag process with paths driven by velocities , corresponding to the holding times . The process explores the full set of ranked topologies via the boundary crossings described above. For more details and properties of -space, such as existence and uniqueness of geodesics and Fréchet means, we refer to [GD16].
3 The infinite sites model
The infinite sites model [Wat75] associates the type of the MRCA with the unit interval. Mutations occur as points of a Poisson process with rate along the branches of the coalescent tree, and each mutation is labeled with a -distributed location. The type of a leaf consist of all the mutations along the branches separating it from the MRCA. Because all mutations are visible in the leaves and each mutation can only arise once, an observed pattern of mutations constrains the coalescent tree to ranked topologies that are consistent with the mutations. We will add the mutation rate into the state space by introducing the velocity , the flip rate , as well as a prior density on .
To track mutations, it will be convenient to define as the rooted graphical tree with nodes, the first of which are leaves labeled , and the remaining are labeled as in . Edges connect children to their parents, and edge lengths are determined by . For , we denote by and the respective labels of the child and parent delimiting the edge, by the number of mutations on the edge, and by the edge length, where we write if time contributes to the length of the edge .
The posterior on -space given a pattern of mutations under the infinite sites model is
provided is consistent with the mutations, and otherwise. The flip rates are
Simulating holding times with these rates is difficult due to the intervals during which they vanish. Instead, we introduce the following bounding intensities for use in Poisson thinning:
where is a positive function with a tractable integrals (e.g. a polynomial) satisfying
Holding times based on these bounding rates can be simulated by inverse CDF methods, and a proposed flip time is accepted with probability
and analogously for . It is possible for the numerator of (5) to become nonsensical, for instance if for some . In that case we set , and stop simulating the holding time for coordinate because the holding time for coordinate will be shorter. Algorithms 1 and 2 detail the procedure for simulating holding times and flips using
The fact that the boundary crossings in Algorithm 1 preserve (1) despite their simplicity follows from a Metropolis-Hastings accept/reject argument with acceptance probability one due to continuity of (1). See Proposition 1 below for details.
The Solve method on line 4 of Algorithm 2 returns the variable which solves , which is unique in all cases considered in this paper.
At a type 2 boundary, the process can either cross by performing an -move, or reflect back into its orthant. We regard arrival at the boundary as a Metropolis-Hastings proposal to cross, and apply an accept/reject step with acceptance (i.e. crossing) probability
This ensures [Bec10, Equation (36)] is satisfied, and the probability flux across the boundary leaves (1) invariant. Now, (1) is continuous at the boundary provided both orthants are consistent with the observed mutations. We show below that boundaries of inconsistent orthants are not reached, so all crossings of type 1 boundaries are accepted.
The acceptance probability of type 2 crossings (where we propose to cross into one of the two adjacent, new orthants uniformly) also equals one by an analogous argument.
It remains to verify that branches with mutations cannot reach length 0, and that is inaccessible if there is at least one mutation. A joint density of initial holding times ensures for the initial condition if , and the Poisson construction of the zig-zag dynamics then ensures that the set of times at which for has zero Lebesgue measure. Thus, an edge must span only one holding time before shrinking to length 0. Fix an edge with
, and in an abuse of notation suppose it spans one epochwith velocity (without loss of generality ). Suppose will be the next event if no flips occur, which must be true for some duration before is hit. Then, for sufficiently small , we have
and the corresponding survival function for is
Letting shows that the probability of a flip of before time is 1.
An analogous argument shows that is inaccessible if there is at least one mutation because the boundedness of means (1) diverges at . ∎
To illustrate the performance of the zig-zag process, we compared it to Metropolis-Hastings by reanalyzing the data set of [WFDP91], consisting of 55 individuals with 14 distinct types formed from 18 segregating sites. The zig-zag process used
The velocity was chosen based on trials runs so that the process crossed the posterior mode in unit time in the absence of a jump. The Appendix details the Metropolis-Hastings algorithm, and also summarizes other tuning parameters in Table 3.
We also compared both methods to a hybrid, which combined zig-zag dynamics with Metropolis-Hastings updates of the tree via subtree-prune-regraft steps, and of via a reflected random walk, both detailed in the Appendix, at fixed rate . The hybrid was not particularly sensitive to provided it was not extreme — very small values result in a zig-zag algorithm, while very large values result in Metropolis-Hastings.
Figure 1 shows that the zig-zag and hybrid processes mixing visibly better over the latent tree than Metropolis-Hastings. However, they are not as effective at exploring the upper tail of the -marginal posterior, likely because the processes do not remain in a region of short trees for long enough for to grow into the tail.
To assess the scaling of each algorithm, we simulated two further data sets: one of size with mutation rate , which is the approximate posterior mean of the data set of [WFDP91], and one with , and , which models a segment of DNA sequence 10 times as long. Figures 2 and 3 demonstrate that the zig-zag and hybrid processed perform far better than Metropolis-Hastings, particularly when
. Effective sample size estimates in Table1 quantify the improvement to 1–2 orders of magnitude.
|ESS / sec||Tree height||Branch length||Run time (min)|
4 The finite sites model
The finite sites model [JC69] is more detailed than the infinite sites model, but has greater computational cost. We consider a finite set of sites with a finite number of possible types at each site, denoted by . Typical examples include , or . Mutations occur at each site with rate
, and the type of a mutant child is sampled from a stochastic matrix, assumed to have a unique stationary distribution . We denote the transition matrix of the resulting -valued compound Poisson process with jump rate and jump transition matrix with . Extensions to heterogeneous mutation rates or type spaces across sites could be incorporated, but are neglected for ease of presentation.
The posterior can be written as a sum over the types of internal nodes:
where we define as the type at site on the node with label . The target (6) can be evaluated efficiently using the pruning algorithm of [Fel81]. The corresponding flip rates can be written in terms of branch-specific gradients as
which can in turn be evaluated using the linear-cost method of [JZH20].
We show that events with rates (7) and (8) can be simulated via the example of [GT94, Section 7.4], in which , , and is the matrix which always changes state. The corresponding transition probability is
where is the Kronecker delta, and .
with the convention that . If the zig-zag process is currently at time , then localizes it so that at most one branch can shrink to length zero on , and can fall by at most of its present value. This treatment of and is needed as (7) and (8) diverge as , provided the observed data contains at least one mutation, or as if the first merger is between two leaves of different types. In either case, or are inaccessible similarly to Proposition 1. We found gave good performance across our test cases; larger constants give tighter bounds on (7) and (8), but waste more computation as the algorithm hits time without an accepted flip.
A similar localization yields an alternate infinite sites implementation. If is such that and branch lengths with mutations remain strictly positive, then a constant intensity Poisson process can be thinned to simulate holding times with rates (2) and (3). Both approaches performed similarly in our scenarios.
Given , we have the following bounds on (9) on the time interval :
for , where , and
is the velocity of the branch length . Substituting these bounds into [JZH20, Equation (9)] provides bounds on flip rates that can be evaluated with cost.
Figure 4 demonstrates that the zig-zag process again mixes over latent trees more rapidly than Metropolis-Hastings, but struggles to explore the upper tail of the -marginal. The hybrid method was run with to account for shorter run lengths than in Section 3, and hence resembles Metropolis-Hastings rather than the zig-zag process.
Figures 5 and 6 show results for two further simulated data sets: one with and , and one with and . The superior mixing of the zig-zag process over the latent tree is plain. The lack of mixing in the upper tail of the -marginal is also stark, particularly in Figure 5
where zig-zag significantly underestimates posterior variance. The estimated posterior means of all three methods coincide in all cases (results not shown).
|ESS / sec||Tree height||Branch length||Run time (min)|
The coalescent is a gold-standard model in phylogenetics, but its practical use is hindered by slow mixing of MCMC methods over latent ancestral trees [HDPD08, LVH08, MV05]. We have constructed a zig-zag process for the coalescent and shown that it can improve mixing over ancestral trees, particularly under the infinite sites model which gives rise to sharp, cheap bounds on the zig-zag flip rates. The infinite sites model is widely used to analyze ever larger data sets, and the zig-zag process is a promising candidate for expanding the scope of feasible MCMC.
The zig-zag process was more efficient than Metropolis-Hastings under the infinite sites model in terms of effective sample size, but struggled to explore the tails of the posterior
-marginal. This is because high mutation rates are only be attainable when branch lengths are short. A Metropolis-Hastings algorithm can jump to a high mutation rate as soon as the latent tree enters a region of short branch length, while the zig-zag process has to traverse all intervening mutation rates before the branch lengths grow. The hybrid method with both types of motion interpolated between the zig-zag and Metropolis-Hastings algorithms.
The finite sites model required much higher run times from all three algorithms. Estimates of effective sample sizes in Table 2 distinguish the methods only weakly: zig-zag is superior to Metropolis-Hastings by a factor of 2 in some cases, but worse at estimating when . Trace plots tell the same story as for the infinite sites model: zig-zag shows improved mixing over latent trees, but struggles to explore the upper tail of the -marginal, particularly when where it markedly underestimates posterior variance of . The long run times of the zig-zag and hybrid methods for the finite sites model are due to the cost of gradients under the finite sites model, giving rise to an cost per evaluation of (7) or (7), and there are velocities for which to simulate holding times. However, it is noteworthy that flip times across velocities are conditionally independent given the current state, and can be generated in parallel, unlike steps of the Metropolis-Hastings algorithm. Thus, the zig-zag process is better suited to parallel architectures. Our work is the first non-reversible MCMC method for the coalescent, and contributes to recent constructions of gradient-based methods for discrete targets. We show that embedding discrete variables in continuous space [DBZM17, NDL20] is feasible for the zig-zag process, and the resulting algorithm can be simpler than HMC, on which attention has thus far focused (though see [NDL20, Section S6.4]).
The key to the derivation of the coalescent zig-zag process was -space [GD16], which enabled the sampling of discrete tree topologies using piecewise deterministic Markov processes with continuous paths and gradient evaluation of a target with discrete components. While -space is convenient for ultrametric trees, it is not unique: [GD16] also present another embedding, called -space. Good choices of embedding are likely to be problem-specific. For example, both -space and -space would be awkward for non-contemporaneous data.
It is natural to ask whether non-reversible methods can yield similar improvements in population genomics, where different regions of DNA have different but correlated ancestries due to e.g. crossover recombination. Here the standard model is the ancestral recombination graph (ARG) [GM97], in which ancestral trees are embedded into a common graph that is not itself a tree. The ARG is an even more challenging model for MCMC than the coalescent due to its larger and variable dimensionality. Reversibility is also a major difficulty for implementation of Metropolis-Hastings processes targeting ARG-posteriors. Local rearrangements of edges near the leaves can render large regions of the graph non-ancestral to the sample, at which point storing them is an unnecessary (and at worst exponentially large) overhead. Deleting the redundant regions cannot be done reversibly without a positive probability of resimulating them in a reverse move [KYF00, Implementation], which complicates the design and implementation of Metropolis-Hastings proposals considerably. A non-reversible method could in principle delete redundant regions as they appear, and thus avoid additional bookkeeping.
This appendix specifies the details of the Metropolis-Hastings algorithms used in Sections 3 and 4, both of which used the same proposal mechanisms consisting of three steps. An iteration of the algorithm corresponds to one scan through them all, with an accept/reject correction after each step. The Metropolis-Hastings step in the hybrid method is similar, but only uses steps 1 and 3.
A Gaussian random walk update of the mutation rate reflected at 0, and with proposal variance tuned to obtain an average acceptance probability .
An update of holding times under a fixed topology.
A subtree-prune-regraft step.
Type 2 updates first additively perturb the initial holding time as
Further holding times are conditionally independently perturbed as
where and are the perturbed times of the child nodes of the node at time . Again, was tuned so that the average acceptance probability was .
Type 3 updates consist of sampling an ordered pair of edges, where is an edge connecting the MRCA to . The first edge is deleted, and its child node is reattached into . Letting denote the time of the merger event with label in an abuse of notation, the reattachment time is has distribution
Under the infinite sites model, moves into topologies incompatible with observed mutations were rejected essentially instantaneously, without costly likelihood evaluations.
The author was supported by ESPRC grant EP/R044732/1, and is grateful to Jure Vogrinc and Andi Wang for productive conversations on MCMC for coalescent processe, and non-reversible MCMC in general.
A. J. Aberer, A. Stamatakis, and F. Ronquist.
An efficient independence sampler for updating branches in Bayesian Markov chain Monte Carlo sampling of phylogenetic trees.Syst. Biol., 65(1):161–176, 2016.
- [Bec10] J. Bect. A unifying approach to the Fokker-Planck-Kolmogorov equation for general stochastic hybrid systems. Nonlinear Anal-Hybri., 4:357–370, 2010.
- [BFR19a] J. Bierkens, P. Fearnhead, and G. Roberts. Supplement to the zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat., 47, 2019.
- [BFR19b] J. Bierkens, P. Fearnhead, and G. Roberts. The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat., 47(3):1288–1320, 2019.
- [BHV01] L. J. Billera, S. P. Holmes, and K. Vogtmann. Geometry of the space of phylogenetic trees. Adv. Appl. Math., 27:733–767, 2001.
- [BR17] J. Bierkens and G. Roberts. A piecewise deterministic scaling limit of lifted Metropolis-Hastings for the Curie-Weiss model. Ann. Appl. Probab., 27(2):846–882, 2017.
V. Dinh, A. Bilge, C. Zhang, and F. A. Matsen IV.
Probabilistic path Hamiltonian Monte Carlo.
Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017.
- [Fel81] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17:368–376, 1981.
J. M. Flegal, J. Hughes, D. Vats, and N. Dai.
mcmcse: Monte Carlo Standard Errors for MCMC, 2017.
- [GD16] A. Gavryushkin and A. J. Drummond. The space of ultrametric phylogenetic trees. J. Theor. Biol., 403:197–208, 2016.
- [Gey92] C. J. Geyer. Practical Markov chain Monte Carlo. Statist. Sci., 7:473–483, 1992.
- [GM97] R. C. Griffiths and P. Marjoram. An ancestral recombination graph. In P. Donnelly and S. Tavaré, editors, Progress in Population Genetics and Human Evolution, pages 257–270. Springer-Verlang, 1997.
- [GS90] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc., 85(410):398–409, 1990.
R. C. Griffiths and S. Tavaré.
Simulating probability distributions in the coalescent.Theor. Popln Biol., 46:131–159, 1994.
- [Has70] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970.
- [HD12] S. Höhna and A. J. Drummond. Guided tree topology proposals for Bayesian phylogenetic inference. Syst. Biol., 61(1):1–11, 2012.
- [HDPD08] S. Höhna, M. Defoin-Platel, and A. J. Drummond. Clock-constrained tree proposal operators in Bayesian phylogenetic inference. In BioInformatics and BioEngineering, pages 1–7, 8th IEEE International Conference on IEEE, 2008.
- [JC69] T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro, editor, Mammalian protein metabolism, pages 21–132. Academic Press, New York, 1969.
- [JZH20] X. Ji, Z. Zhang, A. Holbrook, A. Nishimura, G. Baele, A. Rambaut, P. Lemey, and M. A. Suchard. Gradients do grow on trees: a linear-time -dimensional gradient for statistical phylogenetics. arXiv:1905.12146, 2020.
- [Kin82] J. F. C. Kingman. The coalescent. Stochast. Process. Applic., 13(3):235–248, 1982.
- [KYF00] M. K. Kuhner, J. Yamamoto, and J. Felsenstein. Maximum likelihood estimation of recombination rates from population data. Genetics, 156:1393–1401, 2000.
- [LVH08] C. Lakner, P. Van Der Mark, J. P. Huelsenbeck, B. Larget, and F. Ronquist. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst. Biol., 57:86–103, 2008.
- [MV05] E. Mossel and E. Vigoda. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science, 309:2207–2209, 2005.
- [NDL20] A. Nishimura, D. B. Dunson, and J. Lu. Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods. Biometrika, asz083, 2020.
- [Nea10] R. M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov chain Monte Carlo. CRC Press, 2010.
- [Wak09] J. Wakeley. Coalescent theory: an introduction. Roberts & Co, 2009.
- [Wat75] G. A. Watterson. On the number of segregating sites in genetical models without recombination. Theor. Popln Biol., 7:256–276, 1975.
- [WFDP91] R. H. Ward, B. L. Frazier, K. Dew, and S. Pääbo. Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl. Acad. Sci. USA, 88:8720–8724, 1991.