# Efficient Projection onto the Perfect Phylogeny Model

Several algorithms build on the perfect phylogeny model to infer evolutionary trees. This problem is particularly hard when evolutionary trees are inferred from the fraction of genomes that have mutations in different positions, across different samples. Existing algorithms might do extensive searches over the space of possible trees. At the center of these algorithms is a projection problem that assigns a fitness cost to phylogenetic trees. In order to perform a wide search over the space of the trees, it is critical to solve this projection problem fast. In this paper, we use Moreau's decomposition for proximal operators, and a tree reduction scheme, to develop a new algorithm to compute this projection. Our algorithm terminates with an exact solution in a finite number of steps, and is extremely fast. In particular, it can search over all evolutionary trees with fewer than 11 nodes, a size relevant for several biological problems (more than 2 billion trees) in about 2 hours.

## Authors

• 2 publications
• 4 publications
• 6 publications
• 27 publications
11/23/2017

### Counting paths in perfect trees

We present some exact expressions for the number of paths of a given len...
11/27/2019

### Measuring similarity between two mixture trees using mixture distance metric and algorithms

Ancestral mixture model, proposed by Chen and Lindsay (2006), is an impo...
08/22/2019

### Exact inference under the perfect phylogeny model

Motivation: Many inference tools use the Perfect Phylogeny Model (PPM) t...
04/26/2021

### Efficient Evolutionary Models with Digraphons

We present two main contributions which help us in leveraging the theory...
05/12/2021

### Isomorphic unordered labeled trees up to substitution ciphering

Given two messages - as linear sequences of letters, it is immediate to ...
10/17/2019

### EvoZip: Efficient Compression of Large Collections of Evolutionary Trees

Phylogenetic trees represent evolutionary relationships among sets of or...
04/19/2018

### A third strike against perfect phylogeny

Perfect phylogenies are fundamental in the study of evolutionary trees b...

## Code Repositories

### Efficient-Projection-onto-the-Perfect-Phylogeny-Model

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The perfect phylogeny model (PPM) hudson1983properties ; kimura1969number is used in biology to study evolving populations. It assumes that the same position in the genome never mutates twice, hence mutations only accumulate.

Consider a population of organisms evolving under the PPM. The evolution process can be described by a labeled rooted tree, , where is the root, i.e., the common oldest ancestor, the nodes are the mutants, and the edges are mutations acquired between older and younger mutants. Since each position in the genome only mutates once, we can associate with each node , a unique mutated position, the mutation associated to the ancestral edge of . By convention, let us associate with the root , a null mutation that is shared by all mutants in . This allows us to refer to each node as both a mutation in a position in the genome (the mutation associated to the ancestral edge of ), and a mutant (the mutant with the fewest mutations that has a mutation ). Hence, without loss of generality, , , where is the length of the genome, and refers to both the oldest common ancestor and the null mutation shared by all.

One very important use of the PPM is to infer how mutants of a common ancestor evolve el2015reconstruction ; el2016multi ; jiao2014inferring ; malikic2015clonality ; popic2015fast ; satas-raphael17 . A common type of data used for this purpose is the frequency, with which different positions in the genome mutate across multiple samples, obtained, e.g., from whole-genome or targeted deep sequencing schuh2012monitoring . Consider a sample , one of samples, obtained at a given stage of the evolution process. This sample has many mutants, some with the same genome, some with different genomes. Let be such that is the fraction of genomes in with a mutation in position in the genome. Let be such that is the fraction of mutant in . By definition, the columns of must sum to . Let be such that , if and only if mutant is an ancestor of mutant , or if . We denote the set of all possible matrices, matrices and labeled rooted trees , by , and , respectively. See Figure 1 for an illustration. The PPM implies

 F=UM. (1)

Our work contributes to the problem of inferring clonal evolution from mutation-frequencies: How do we infer and from ? Note that finding is the same as finding (see Lemma B.2).

Although model (1) is simple, simultaneously inferring and from can be hard el2015reconstruction . One popular inference approach is the following optimization problem over , and ,

 minU∈UC(U), (2) C(U)=minM,F∈Rq×p∥^F−F∥ subject to F=UM,M≥0,M⊤1=1, (3)

where is the Frobenius norm, and contains the measured fractions of mutations per position in each sample, which are known and fixed. In a nutshell, we want to project our measurement onto the space of valid PPM models.

Problem (2) is a hard mixed integer-continuous optimization problem. To approximately solve it, we might find a finite subset

, that corresponds to a “heuristically good” subset of trees,

, and, for each fixed matrix , solve (3), which is a convex optimization problem. We can then return , where . Fortunately, in many biological applications, e.g., el2015reconstruction ; el2016multi ; jiao2014inferring ; malikic2015clonality ; popic2015fast ; satas-raphael17 , the reconstructed evolutionary tree involves a very small number of mutated positions, e.g., . In practice, a position might be an effective position that is a cluster of multiple real positions in the genome. For a small , we can compute for many trees, and hence approximate ,

, and get uncertainty measures for these estimates. This is important, since data is generally scarce and noisy.

Contributions: (i) we propose a new algorithm to compute exactly in steps, the first non-iterative algorithm to compute ; (ii) we compare its performance against state-of-the-art iterative algorithms, and observe a much faster convergence. In particular, our algorithm scales much faster than in practice; (iii) we implement our algorithm on a GPU, and show that it computes the cost of all (more than billion) trees with nodes, in hours.

## 2 Related work

A problem related to ours, but somewhat different, is that of inferring a phylogenetic tree from single-cell whole-genome sequencing data. Given all the mutations in a set of mutants, the problem is to arrange the mutants in a phylogenetic tree, fernandez2001perfect ; gusfield1991efficient . Mathematically, this corresponds to inferring from partial or corrupted observation of . If the PPM is assumed, and all the mutations of all the mutants are correctly observed, this problem can be solved in linear time, e.g., ding2006linear . In general, this problem is equivalent to finding a minimum cost Steiner tree on a hypercube, whose nodes and edges represent mutants and mutations respectively, a problem known to be hard garey2002computers .

We mention a few works on clonality inference, based on the PPM, that try to infer both and from . No previous work solves problem (2) exactly in general, even for trees of size . Using our fast projection algorithm, we can solve (2) exactly by searching over all trees, if . Ref. el2015reconstruction (AncesTree) reduces the space of possible trees to subtrees of a heuristically constructed DAG. The authors use the element-wise -norm in (3) and, after introducing more variables to linearize the product , reduce this search to solving a MILP, which they try to solve via branch and bound. Ref. malikic2015clonality (CITUP) searches the space of all unlabeled trees, and, for each unlabeled tree, tries to solve an MIQP, again using branch and bound techniques, which finds a labeling for the unlabeled tree, and simultaneously minimizes the distance . Refs. jiao2014inferring and deshwar2015phylowgs (PhyloSub/PhyloWGS), use a stochastic model to sample trees that are likely to explain the data. Their model is based on ghahramani2010tree

, which generates hierarchical clusterings of objects, and from which lineage trees can be formed. A score is then computed for these trees, and the highest scoring trees are returned.

Procedure (2) can be justified as MLE if we assume the stochastic model , where , and satisfy the PPM model, and represents additive, component-wise, Gaussian measurement noise, with zero mean and covariance . Alternative stochastic models can be assumed, e.g., as , where is non-negative and its columns must sum to one, and is as described before. For this model, and for each matrix , the cost is a projection of

onto the probability simplex

. Several fast algorithms are known for this problem, e.g., condat2016fast ; duchi2008efficient ; gong2011efficient ; liu2009efficient ; michelot1986finite and references therein. In a -dimensional space, the exact projection onto the simplex can be done in steps.

Our algorithm is the first to solve (3) exactly in a finite number of steps. We can also use iterative methods to solve (3). One advantage of our algorithm is that it has no tuning parameters, and requires no effort to check for convergence for a given accuracy. Since iterative algorithms can converge very fast, we numerically compare the speed of our algorithm with different implementations of the Alternating Direction Method of Multipliers (ADMM) boyd2011distributed , which, if properly tuned, has the fastest convergence rate among all first-order methods francca2016explicit under some convexity assumptions, and is known to produce good solutions for several other kinds of problems, even for non-convex ones hao2016testing ; francca2017distributed ; laurenceinvFBA2018 ; mathysparta ; zoran2014shape ; bento2015proximal ; bento2013message .

## 3 Main results

We now state our main results, and explain the ideas behind their proofs. Detailed proofs can be found in the Appendix.

Our algorithm computes and minimizers of (3), resp. and , by solving an equivalent problem. Without loss of generality, we assume that , since, by squaring the objective in (3), it decomposes into independent problems. Sometimes we denote by , since given , we can specify , and vice-versa. Let be the closest ancestor of in . Let be the set of all the ancestors of in , plus . Let be the set of children of in .

###### Theorem 3.1 (Equivalent formulation).

Problem (3) can be solved by solving

 mint∈Rt+L(t), (4) L(t)=minZ∈Rq12∑i∈V(Zi−Z¯i)2 subject to Zi≤t−Ni,∀i∈V, (5)

where , and, by convention, for . In particular, if minimizes (4), minimizes (5) for , and minimize (3), then

 (6)

Furthermore, , , and are unique.

Theorem 3.1 comes from a dual form of (3), which we build using Moreau’s decomposition moreau1962decomposition .

### 3.1 Useful observations

Let be the unique minimizer of (5) for some . The main ideas behind our algorithm depend on a few simple properties of the paths and , the derivative of with respect to . Note that is also a function of , as defined in Theorem 3.1, which depends on the input data .

###### Lemma 3.2.

is a convex function of and . Furthermore, is continuous in and , and is non-decreasing with .

###### Lemma 3.3.

is continuous as a function of and . is continuous as a function of .

Let i.e., the set of components of the solution at the boundary of (5). Variables in are called fixed, and we call other variables free. Free (resp. fixed) nodes are nodes corresponding to free (resp. fixed) variables.

###### Lemma 3.4.

is piecewise constant in .

Consider dividing the tree into subtrees, each with at least one free node, using as separation points. See Figure 4 in Appendix A for an illustration. Each belongs to at most different subtrees, where is the degree of node , and each belongs exactly to one subtree. Let be the set of resulting (rooted, labeled) trees. Let , where the root is the closest node in to . We call the subtrees induced by . We define , and, when it does not create ambiguity, we drop the index in . Note that different ’s might have elements in common. Also note that, by construction, if , then must be a leaf of , or the root of .

###### Definition 3.5.

The -problem is the optimization problem over variables

 min{Zj:j∈Vw∖B(t)}(1/2)∑j∈Vw(Zj−Z¯j)2, (7)

where is the parent of in , if , and if .

###### Lemma 3.6.

Problem (5) decomposes into independent problems. In particular, the minimizers are determined as the solution of the -problem. If , then , where and depend on but not on , and .

###### Lemma 3.7.

and are piecewise linear and continuous in . Furthermore, and change linear segments if and only if changes.

###### Lemma 3.8.

If , then . In particular, changes at most times with .

###### Lemma 3.9.

and have less than different linear segments.

### 3.2 The Algorithm

In a nutshell, our algorithm computes the solution path and the derivative . From these paths, it finds the unique , at which

 d(t+L(t))/dt=0|t=t∗⇔L′(t∗)=−1. (8)

It then evaluates the path at , and uses this value, along with (6), to find and , the unique minimizers of (3). Finally, we compute .

We know that and are continuous piecewise linear, with a finite number of different linear segments (Lemmas 3.7, 3.8 and 3.9). Hence, to describe and , we only need to evaluate them at the critical values, , at which and change linear segments. We will later use Lemma 3.7 as a criteria to find the critical values. Namely, are the values of at which, as decreases, new variables become fixed, and changes. Note that variables never become free once fixed, by Lemma 3.8, which also implies that .

The values and are computed sequentially as follows. If is very large, the constraint in (5) is not active, and . Lemma 3.7 tells us that, as we decrease , the first critical value is the largest for which this constraint becomes active, and at which changes for the first time. Hence, if , we have , , and . Once we have , we compute the rates and from and , as explained in Section 3.3. Since the paths are piecewise linear, derivatives are not defined at critical points. Hence, here, and throughout this section, these derivatives are taken from the left, i.e., and .

Since and are constant for , for we have

 Z∗(t)=Z∗(ti)+(t−ti)Z′∗(ti),L′(t)=L′(ti)+(t−ti)L′′(ti), (9)

and the next critical value, , is the largest , for which new variables become fixed, and changes. The value is found by solving for in

 Z∗(t)r=Z∗(ti)r+(t−ti)Z′∗(ti)r=t−Nr, (10)

and keeping the largest solution among all . Once is computed, we update with the new variables that became fixed, and we obtain and from (9). The process then repeats.

By Lemma 3.2, never increases. Hence, we stop this process (a) as soon as , or (b) when all the variables are in , and thus there are no more critical values to compute. If (a), let be the last critical value with , and if (b), let be the last computed critical value. We use and (9) to compute , at which and also . From we then compute and and .

The algorithm is shown compactly in Alg. 1. Its inputs are and , represented, e.g., using a linked-nodes data structure. Its outputs are minimizers to (3). It makes use of a procedure ComputeRates, which we will explain later. This procedure terminates in steps and uses memory. Line 5 comes from solving (10) for . In line 14, the symbols and remind us that and are computed from and using (6). The correctness of Alg. 1 follows from the Lemmas in Section 3.1, and the explanation above. In particular, since there are at most different linear regimes, the bound in the for-loop does not prevent us from finding any critical value. Its time complexity is , since each line completes in steps, and is executed at most times.

###### Theorem 3.10 (Complexity).

Algorithm 1 finishes in steps, and requires memory.

###### Theorem 3.11 (Correctness).

Algorithm 1 outputs the solution to (3).

### 3.3 Computing the rates

We now explain how the procedure ComputeRates works. Recall that it takes as input the tree and the set , and it outputs the derivatives and .

A simple calculation shows that if we compute , then computing is easy.

###### Lemma 3.12.

can be computed from in steps and with memory as

 L′′(ti)=∑j∈V(Z′∗(ti)j−Z′∗(ti)¯j)2, (11)

where is the closest ancestor to in . We note that if , then, by definition, . Assume now that . Lemma 3.6 implies we can find by solving the -problem as a function of , where is such that . In a nutshell, ComputeRates is a recursive procedure to solve all the -problems as an explicit function of .

It suffices to explain how ComputeRates solves one particular -problem explicitly. To simplify notation, in the rest of this section, we refer to and as and . Recall that, by the definition of and , if , then must be a leaf of , or the root of .

###### Definition 3.13.

Consider a rooted tree , a set , and variables such that, if , then for some and . We define the -problem as

 min{Zj:j∈V∖B}12∑j∈Vγj(Zj−Z¯j)2, (12)

where , is the closest ancestor to in , and if .

We refer to the solution of the -problem as , which uniquely minimizes (12). Note that (12) is unconstrained and its solution, , is a linear function of . Furthermore, the -problem is the same as the -problem, which is what we actually solve.

We now state three useful lemmas that help us solve any -problem efficiently.

###### Lemma 3.14 (Pruning).

Consider the solution of the -problem. Let be a leaf. Then . Furthermore, consider the -problem, where is equal to with node pruned, and let its solution be . We have that , for all .

###### Lemma 3.15 (Star problem).

Let be a star such that node is the center node, node is the root, and nodes are leaves. Let . Let be the solution of the -problem. Then,

 Z∗1=(γ1α2+∑ri=3γrαrγ1+r∑i=3γr)t+(γ1β2+∑ri=3γrβrγ1+∑ri=3γr). (13)

In particular, to find the rate at which changes with , we only need to know and , not .

###### Lemma 3.16 (Reduction).

Consider the -problem such that , and such that has all its children . Let be its solution. Consider the , where is equal to with nodes removed, and . Let be its solution. If for all , and , and satisfy

 ~αj=∑ri=1γiαi∑ri=1γi,~βj=∑ri=1γiβi∑ri=1γi,~γj=⎛⎝(γj)−1+(r∑i=1γi)−1⎞⎠−1, (14)

then  for all .

Lemma 3.15 and Lemma 3.16 allow us to recursively solve any -problem, and obtain for it an explicit solution of the form , where and do not depend on .

Assume that we have already repeatedly pruned , by repeatedly invoking Lemma 3.14, such that, if is a leaf, then . See Figure 2-(left). First, we find some node such that all of its children are in . If , then must be the root, and the -problem must be a star problem as in Lemma 3.15. We can use Lemma 3.15 to solve it explicitly. Alternatively, if , then we invoke Lemma 3.16, and reduce the -problem to a strictly smaller -problem, which we solve recursively. Once the -problem is solved, we have an explicit expression for all , and, in particular, we have an explicit expression . The only free variable of the -problem to be determined is . To compute , we apply Lemma 3.15 to the -problem, where is a star around , are the components of corresponding to nodes that are neighbors of , and are such that for all that are neighbors of , and for which is already known, and are all the neighbors of . See Figure 2-(right).

The algorithm is compactly described in Alg. 2. It is slightly different from the description above for computational efficiency. Instead of computing , we keep track only of , the rates, and we do so only for the variables in . The algorithm assumes that the input has been pruned. The inputs , , , and are passed by reference. They are modified inside the algorithm but, once ComputeRatesRec finishes, they keep their initial values. Throughout the execution of the algorithm, encodes (1) a doubly-linked list where each node points to its children and its parent, which we call , and (b) a a doubly-linked list of all the nodes in for which all the children are in , which we call . In the proof of Theorem 3.17, we prove how this representation of can be kept updated with little computational effort. The input , also passed by reference, starts as an uninitialized array of size , where we will store the rates . At the end, we read from .

Let be the number of nodes of the tree that is the input at the zeroth level of the recursion.

###### Theorem 3.17.

Algorithm 2 correctly computes for the -problem, and it can be implemented to finish in steps, and to use memory.

The correctness of Algorithm 2 follows from Lemmas 3.14-3.16, and the explanation above. Its complexity is bounded by the total time spent on the two lines that actually compute rates during the whole recursion, lines 3 and 8. All the other lines only transform the input problem into a more computable form. Lines 3 and 8 solve a star-shaped problem with at most variables, which, by inspecting (13), we know can be done in steps. Since, never takes the same value twice, the overall complexity is bounded by . The bound on memory is possible because all the variables that occupy significant memory are being passed by reference, and are modified in place during the whole recursive procedure.

The following lemma shows how the recursive procedure to solve a -problem can be used to compute the rates of change of of a -problem. Its proof follows from the observation that the rate of change of the solution with in (13) in Lemma 3.15 only depends on and , and that the reduction equations (14) in Lemma 3.16 never make or depend on .

###### Lemma 3.18 (Rates only).

Let be the solution of the -problem, and let be the solution of the -problem. Then, , and for some and .

We finally present the full algorithm to compute and from and .

The following theorem follows almost directly from Theorem 3.17.

###### Theorem 3.19.

Alg. 3 correctly computes and in steps, and uses memory.

## 4 Reducing computation time in practice

Our numerical results are obtained for an improved version of Algorithm 1. We now explain the main idea behind this algorithm.

The bulk of the complexity of Alg. 1 comes from line 4, i.e., computing the rates from and . For a fixed , and by Lemma 3.6, the rate , depends only on one particular -problem induced by . If exactly this same problem is induced by both and , which happens if the new nodes that become fixed in line 7 of round of Algorithm 1 are not in , then we can save computation time in round , by not recomputing any rates for , and using for the value .

Furthermore, if only a few change from round to round , then we can also save computation time in computing from by subtracting from the sum in the right hand side of equation (11) the terms that depend on the previous, now changed, rates, and adding new terms that depend on the new rates.

Finally, if the rate does not change, then the value of at which might intersect , and become fixed, given by in line 5, also does not change. (Note that this is not obvious from the formula for in line 5). If not all change from round to round , we can also save computation time in computing the maximum, and maximizers, in line 7 by storing in a maximum binary heap, and executing lines 5 and 7 by extracting all the maximal values from the top of the heap. Each time any changes, the heap needs to be updated.

## 5 Numerical results

Our algorithm to solve (3) exactly in a finite number of steps is of interest in itself. Still, it is interesting to compare it with other algorithms. In particular, we compare the convergence rate of our algorithm with two popular methods that solve (3) iteratively: the Alternating Direction Method of Multipliers (ADMM), and the Projected Gradient Descent (PGD) method. We apply the ADMM, and the PGD, to both the primal formulation (3), and the dual formulation (4). We implemented all the algorithms in C, and derived closed-form updates for ADMM and PG, see Appendix F. We ran all algorithms on a single core of an Intel Core i5 2.5GHz processor.

Figure 3-(left) compares different algorithms for a random Galton–Watson input tree truncated to have nodes, with the number of children of each node chosen uniformly within a fixed range, and for a random input

, with entries chosen i.i.d. from a normal distribution. We observe the same behavior for all random instances that was tested. We gave ADMM and PGD an advantage by optimally tuning them for each individual problem-instance tested. In contrast, our algorithm requires no tuning, which is a clear advantage. At each iteration, the error is measured as

. Our alg