Cutting an alignment with Ockham's razor

10/24/2019 ∙ by Mark Jones, et al. ∙ 0

In this article, we investigate different parsimony-based approaches towards finding recombination breakpoints in a multiple sequence alignment. This recombination detection task is crucial in order to avoid errors in evolutionary analyses caused by mixing together portions of sequences which had a different evolution history. Following an overview of the field of recombination detection, we formulate four computational problems for this task with different objective functions. The four problems aim to minimize (1) the total homoplasy of all blocks (2) the maximum homoplasy per block (3) the total homoplasy ratio of all blocks and (4) the maximum homoplasy ratio per block. We describe algorithms for each of these problems, which are fixed-parameter tractable (FPT) when the characters are binary. We have implemented and tested the algorithms on simulated data, showing that minimizing the total homoplasy gives, in most cases, the most accurate results. Our implementation and experimental data have been made publicly available. Finally, we also consider the problem of combining blocks into non-contiguous blocks consisting of at most p contiguous parts. Fixing the homoplasy h of each block to 0, we show that this problem is NP-hard when p >= 3, but polynomial-time solvable for p = 2. Furthermore, the problem is FPT with parameter h for binary characters when p = 2. A number of interesting problems remain open.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When a multiple alignment contains sequences whose ancestors have undergone recombination, the evolutionary histories of different parts of the alignment are represented by different phylogenetic trees. In this case, using standard evolutionary analysis tools that assume a single phylogenetic tree for the entire alignment can introduce important biases for several inference tasks. For example it has been shown that in the presence of recombination, positive selection pressures (anisimova2003effect; kosakovsky2008estimating; arenas2010coalescent), as well as rate variation among sites and lineages tends to be overestimated (schierup2000consequences; schierup2000recombination). Recombination can also severely affect the inference of genetic distances (schierup2000consequences; lemey2009), of ancestral sequences (arenas2010effect), of demographic history (schierup2000consequences), and of course of the phylogeny itself, which may bear little resemblance to the true reticulate history of the sequences (posada2002effect).

An often viable solution to these problems is to partition the input alignment into recombination-free blocks, which can then be analysed with standard methods using a single phylogenetic tree per block (see, e.g., scheffler2006robust, for its application to the detection of positive selection in viruses). Inferring “locus trees” for a large number of recombination-free blocks has also become very popular in the context of species tree inference under the multi-species coalescent model (see xu2016challenges, for a recent review), where metiotic recombination is assumed to have happened between—but not within—blocks.

For these reasons, a recurring preliminary step in evolutionary bioinformatics is to infer the putative locations within an alignment where recombination has occurred, that is, the recombination breakpoints, with the goal of partitioning the alignment into blocks for further downstream analyses. In this paper, we closely examine a number of formulations of the problem of alignment partitioning in the presence of recombination, and algorithms that allow to efficiently solve them. Many increasingly sophisticated methods have been proposed for this or similar tasks (salminen2009; martin2011analysing). Below, we present an overview of the main ideas behind these methods.

Phylogenetic incompatibilities between sites in an alignment can either be due to recurrent substitutions—that is, the same or the inverse substitution having arisen on different branches in the tree—or to recombination having occurred between those sites (see box 15.1 in lemey2009). Distinguishing between these two phenomena—recurrent substitution and recombination—is strongly influenced by the prior belief about their relative frequencies. For example, if recurrent substitutions are assumed to be impossible (the infinite sites model

), every pair of incompatible sites must be separated by a recombination breakpoint, an observation that led to one of the earliest methods for breakpoints estimation 

(hudson1985statistical). On the other hand, classical phylogenetic inference assumes no recombination, and all incompatibilities between characters are explained by recurrent substitution. When both recombination and recurrent substitution are possible, an observation that underlies virtually all the methodology for recombination detection (and the present work is no exception) is that recombination leads to the spatial clustering along the alignment of compatible or nearly-compatible sites. In other words, in the presence of recombination, sites carrying similar phylogenetic signals tend to be closer than expected by chance alone.

The literature about recombination detection is very rich. A review from 2011 reported that there were already about 90 tools available at the time  (see Table S1 in martin2011analysing). In practice, however, these were conceived with many different tasks and goals in mind (e.g., detecting evidence of recombination, estimating breakpoints, identifying the parental sequences of the recombinants, etc.). Although it is beyond the scope of this article to provide a complete overview, there are a number of recurring ideas that are easy to describe. The methods allowing the identification of breakpoints in a multiple alignment can be roughly categorized in 3 groups (martin2011analysing).

Similarity- or distance-based methods constitute the conceptually simplest approach. They inspect the variation along the alignment of the (dis-)similarity between sequences, measured in terms of percent identity or of evolutionary distance (estimated via standard substitution models). Changes along the alignment in the relative similarities among sequences are interpreted as evidence of recombination. Examples of this approach are Rip (siepel1995computer), PhilPro (weiller1998phylogenetic), SimPlot (lole1999full), Rat (etherington2004recombination) and T-Recs (tsimpidis2017t). A problem with these methods is that the relation between pairwise similarities and phylogenetic relatedness is not straightforward, for example because of rate variation among lineages or selection pressures, meaning that a change in relative similarity does not always indicate phylogenetic incongruence or recombination (lemey2009).

Substitution distribution methods focus on a small subset of sequences (e.g., a triple) and, for each such subset in the alignment, they test whether some site patterns (e.g., those with the form , and ) occur in clustered locations along the alignment. Recombination breakpoints are identified with those positions where, as we move along the alignment, a change in the relative frequencies of these patterns is detected. These methods are often, but not exclusively, based on the use of a sliding window. Examples of this approach are GeneConv (sawyer1989statistical; padidam1999possible), MaxChi (smith1992analyzing), Chimaera (posada2001evaluation; martin2010rdp3), Siscan (gibbs2000sister), 3-Seq (boni2007exact; lam2017improved), Rapr (song2018tracking). The necessity to analyse only a subset at a time, instead of all sequences simultaneously, may be problematic when the alignment contains many sequences. This is not just for computational reasons (e.g., the number of triples grows cubically in the number of sequences) and statistical reasons (correcting for the rapidly increasing multiple tests may lower the power to detect recombination (martin2017detecting)). Another important issue is that the situation where the alignment only contains sequences that are direct recombinants of other sequences within the alignment is in fact the exception rather than the rule. If an alignment contains several descendants of the same recombinant, or a parental sequence of a recombinant is the ancestor of several sequences in the alignment, then the same recombination event should be detected in several partially overlapping subsets. Being able to recognize when two subsets have detected the same event is a difficult and open question (song2018tracking).

Phylogenetic-based methods seek to detect when different contiguous blocks in the alignment support different phylogenetic trees. A seminal paper in this context is due to J. Hein (hein1993heuristic). That paper introduced an optimization problem whose goal is to determine a sequence of breakpoints in the alignment and a sequence of phylogenetic trees for each of the blocks delimited by the breakpoints, so as to minimize a linear combination of the parsimony scores of the trees and of the cost of breakpoints. Each breakpoint has a cost that expresses the minimum number of recombination events that are necessary to “switch” between the two trees that it separates. Solving this problem exactly is impractical for several reasons (see the formal definition and discussion in Sec. 9.8.1 of huson2010phylogenetic)

, so a heuristic named

RecPars was proposed (hein1993heuristic; huson2010phylogenetic). Subsequent methods relied on distance-based phylogenetics (e.g., BootScan (salminen1995identification), Topal (mcguire1997graphical)), maximum-likelihood techniques (e.g., Plato (grassly1997likelihood), Lard (holmes1999phylogenetic), Gard (kosakovsky2006automated; kosakovsky2006gard)

), while more recent methods use hidden Markov models whose hidden states correspond to phylogenetic trees, and where the observations are the columns of the alignment. These models, sometimes referred to as

phylo-HMMs

, can be used to infer probable partitions of the input alignment into recombination-free blocks and are at the core of tools such as

Barce (husmeier2003detecting), DualBrothers (minin2005dual), StHmm (webb2008phylogenetic) and many others (hobolth2007genomic; bloomquist2008stepbrothers; deOliveira2008phylogenetic; dutheil2009ancestral; boussau2009mixture). Although phylogenetic-based methods are very appealing as they directly look for phylogenetic incongruence—instead of relying on indirect evidence from pairwise similarities or site pattern frequencies—they are usually much more computationally demanding than similarity-based or substitution distribution methods. In practice, sophisticated methods such as those based on phylo-HMMs are only applicable to a small number of sequences (e.g., 4 for husmeier2003detecting), unless the space of tree topologies that can be assigned to the blocks is heavily restricted (minin2005dual; webb2008phylogenetic). For this reason, some authors have proposed the use of parsimony criteria (hein1993heuristic; maydt2006recco; ruths2006recomp; ane2011detecting), well-known to lead to faster computations at the cost of some loss in accuracy. We note in particular the Mdl approach by ane2011detecting, which seeks to solve an optimization problem that is a simplification of that by hein1993heuristic mentioned above, in that all breakpoints receive the same penalty.

The work we present here falls naturally in this context of phylogenetic- and parsimony-based methods for recombination-aware alignment partitioning. Our goal is not to present a new software tool for this task—although we do provide implementations of some novel algorithms—but to investigate a number of natural formulations of the problem, and provide exact algorithms to solve them. We also investigate the relative strengths of the alternative formulations. Our optimization problems are defined in terms of the homoplasy within each block (informally, the amount of recurrent substitution that is needed to explain the sequences in that block), although equivalent formulations could be expressed in terms of parsimony scores. This choice is motivated by the observation that deciding whether a block has a level of homoplasy below a (small) constant is polynomially solvable (baca-lagergren; nearperfect).

Another key choice we made is related to the well-known observation that the number of blocks inferred by many methods for alignment partitioning is sensitive to parameters such as the size of a sliding window, or the cost/penalty of breakpoints or recombinations, relative to that of substitutions, for parsimony-based methods such as RecPars and Mdl (hein1993heuristic; ane2011detecting). To a lesser extent, this also holds for HMM-based statistical approaches, which rely on the use of priors on breakpoint frequency (husmeier2003detecting) or on the total number of breakpoints (minin2005dual). In our problem formulations, instead of introducing a parameter that indirectly influences the number of blocks inferred, we chose to directly constrain the maximum number of blocks in the partition. We believe this has the merit of making more explicit the dependency of block partitioning on user-defined parameters.

Overview of the article. We start by introducing necessary preliminaries in Section 2, and then move on to defining four different parsimony-based models for detecting recombination breakpoints in Section 2.1. Some of these models aim to minimize the maximum homoplasy in a block—or the relative frequency of homoplasy in a block—while others take an aggregate perspective, considering all blocks together. Although these models constitute fairly natural optimization criteria for parsimony-based models, one of our primary contributions is that, for all models, explicit pseudocode, rigorous proofs of correctness and detailed running time analyses are given: these are provided in Section 3. Several of the algorithms are Fixed Parameter Tractable (FPT) for binary characters, meaning—in essence—that certain natural parameters of the problems have an independent, and thus limited, contribution to the overall running time (We refer the reader to FlumGrohe:2006; downey2012parameterized, for more background on FPT algorithms). In Section 4 we describe algorithms that attempt to merge blocks, from a pre-computed block partition, such that parsimony-motivated criteria are optimized (these models are formally defined in Section 2.2). In Section 5 we describe our publicly-available software package CutAl, which implements the four algorithms described in Section 3. We have performed a number of experiments on simulated data testing the ability of the four algorithms to recover the locations of breakpoints; in particular, we look at the influence of the number and length of blocks, the number of taxa and the branch lengths (of the phylogenies used to experimentally generate alignments) on the overall performance of the algorithms. This data has also been made publicly available. Our analysis suggests that, of our four algorithms, aggregate minimization of the total amount of homoplasy seems most effective, and is highly accurate under many variations of experimental parameters. In Section 6 we present our overall conclusions and propose a number of interesting directions for future work.

2 Preliminaries

Let  be a set of character states. Throughout the paper, we assume that the cardinality  of  is a constant. An alignment  is an matrix with elements from . Denote by the -th column of and denote by the element in the -th row and -th column of , for any and . We also refer to the columns of an alignment as characters.

A block in is the alignment formed by the columns , for some . When is clear from context we will write as shorthand for . When the indices and are not important, we often write to denote a block. Note that blocks are always composed of contiguous columns. The number of columns in a block  is denoted . A block partitioning of  is a partition of the columns of , such that each is a block.

The null score of an alignment  is , with the number of different character states appearing in column  minus one.

A phylogenetic tree on  is an unrooted tree with no degree-2 vertices and leaf set . An -near perfect phylogeny for an alignment  is a phylogenetic tree  on with a mapping  such that and

with  the Hamming distance of  and  (the number of positions where they differ). The parsimony score can now be defined as the minimum value of  such that  is an -near perfect phylogeny for . A 0-near perfect phylogeny is a perfect phylogeny, which has parsimony score .

An alignment is said to have homoplasy score if is the minimum integer for which there exists an -near perfect phylogeny for . The total homoplasy of a block partitioning is the sum of the homoplasies of its blocks. Observe that the total homoplasy of a block partition of an alignment is at most the homoplasy of the alignment. Denote by the homoplasy of a block . Denote by the homoplasy ratio of .

We note some established results concerning the calculation of homoplasy scores. In particular, we note that calculating the homoplasy score of an alignment is fixed-parameter tractable for binary characters:

Theorem 2.1

(nearperfect) Given an  alignment  with elements from  with , it can be decided in  time whether  has an -near perfect phylogeny.

Theorem 2.2

(baca-lagergren) Given an  alignment  with elements from  with , it can be decided in  time whether  has an -near perfect phylogeny.

We also recall the well-known four-gamete test (Buneman1971), which characterizes alignments for which there exists a perfect phylogeny when .

Theorem 2.3 (Four-gamete test)

(Buneman1971) Let  be an alignment on  where . Call two characters  of  incompatible if there exist rows such that



.
Then  has homoplasy score  if and only if no two characters of  are incompatible.

A column  is called an uninformative site if some character state  appears times in , and every other character state appears at most once. Otherwise, is an informative site. Observe that a column is an uninformative site if and only if every phylogenetic tree on is a perfect phylogeny for . From a parsimony perspective, uninformative sites provide no meaningful information, and we will therefore ignore them when considering the accuracy of block partitions (see Section 5). For a given block in that contains at least one informative site, let be the minimal contiguous block in that contains all the informative sites of . Then we call the informative restriction of . Note that can still contain uninformative sites, but the first and last columns in are guaranteed to be informative. A block is said to be completely uninformative if every column in is an uninformative site. Given a block partition of , the informative restriction of is derived from by deleting all completely uninformative blocks, and replacing each remaining block with its informative restriction.

2.1 Partitioning into contiguous blocks

We consider a number of block partitioning problems in which the aim is to partition an alignment into a small number of blocks with homoplasy as small as possible.

Total Homoplasy Score
Given: an alignment  and integer .
Find: a block partitioning of  into  blocks , such that is minimized.

Max Homoplasy Ratio
Given: an alignment  and integer .
Find: a block partitioning of  into  blocks , such that is minimized.

Max Homoplasy Score
Given: an alignment  and integer .
Find: a block partitioning of  into  blocks , such that is minimized.

Total Homoplasy Ratio
Given: an alignment  and integer .
Find: a block partitioning of  into  blocks , such that is minimized.

Note that the variant of these four problems when the blocks in the output are not required to be contiguous is NP-complete, as it was proved by LSS2013 that the restricted case where the total homoplasy is 0 and the number of blocks is fixed to an integer (called -Character-Compatibility) is NP-complete.

2.2 Combining blocks to non-contiguous blocks

A multiblock for an alignment  is an alignment formed by some subset of columns in . Thus, the difference between a block and multiblock is that a multiblock is not necessarily contiguous. We say a multiblock is a -multiblock if it can be partitioned into at most contiguous parts—i.e., it is the union of at most blocks. A non-contiguous block partitioning of an alignment  is a partitioning of the columns of  into mulitblocks. The total homoplasy of a non-contiguous block partitioning is defined similarly as for block partitionings. The next problem studied in this paper aims at merging blocks with small total homoplasy to a small number of non-contiguous blocks. It is formally defined as follows.

Block Combining
Given: an alignment , a block partitioning  of  with total homoplasy  and an integer .
Decide: does there exist a non-contiguous block partitioning with total homoplasy  that consists of at most  non-contiguous blocks, each of which is a union of blocks from ?

Since allowing blocks to be completely atomized into many small contiguous parts seems to be too flexible, we also consider a variant where we only allow a partition into -multiblocks.

We say that two blocks  and  are mergeable if their union  has homoplasy equal to the homoplasy of  plus the homoplasy of . In the following problem, we assume that input blocks  and  are not mergeable for any . This is a reasonable assumption in cases where the input block partition is derived from a block partitioning algorithm. It holds, for example, when the blocks are those generated by Algorithm 3, and it also holds for solutions to Total Homoplasy Score provided the value of is chosen to be as small as possible without increasing .

Block Combining to -Multiblocks
Given: an alignment , a block partitioning  of  with total homoplasy  such that and are not mergeable for any , and integer .
Decide: does there exist a non-contiguous block partitioning with total homoplasy  that consists at most  -multiblocks, each of which is a union of blocks from ?

3 Block partitioning algorithms

3.1 Minimizing total homoplasy

In this section, we describe an approach that can be used to solve the Total Homoplasy Score problem. The key observation is that, given a suitable method to calculate the homoplasy score of each possible block in , an optimal block partitioning can be found using standard dynamic programming techniques. Algorithm 1 describes this dynamic programming technique, under the assumption that a value  has been calculated for each possible block . For the purposes of solving Total Homoplasy Score, we let  be the homoplasy score of block . However, by changing the function  we can also use Algorithm 1 to solve other block partitioning problems.

In particular, for the implementation described in Section 5, we will not use exact homoplasy scores for , but instead use the values returned by the heuristic parsimony solver Parsimonator (https://sco.h-its.org/exelixis/web/software/parsimonator/index.html).

We begin by proving the correctness of Algorithm 1 with respect to an arbitrary function .

Lemma 1

Given an alignment  with columns , an integer , and a value  for each block  in , Algorithm 1 returns the minimum value for which there exists a block partitioning of  into  blocks such that , or if no such block partitioning exists.

Proof

For each triplet of integers such that and , Algorithm 1 calculates a value . We first claim that for each choice of , is equal to the minimum for which there exists a block partitioning of with , such that , i.e., the last block consists of columns to , and that if no such block partitioning exists.

We prove this claim by induction on . We start with the base case, . In this case, the only possible partitioning on is the one consisting of the single block . Thus should be equal to if , and otherwise. It can be seen that this is the value calculated by Algorithm 1, and thus the claim is correct for .

Now suppose that . If , then the only possible block partitioning is the one consisting of the single block . Thus should be equal to if , and otherwise. Again this is the value calculated by Algorithm 1. If , then we have that an optimal block partitioning consists of a -block partitioning for together with the block . Moreover, the last block in the block partitioning of must be for some . Thus in the case , the total value for an optimal block partitioning of  is equal to , for the choice of that minimizes this value. As this is exactly what the algorithm calculates, the claim is correct. This completes the inductive proof.

It remains to observe that any block partitioning of  into at most blocks must have exactly blocks for some , and the last block must be for some . It follows that the value of an optimal block partitioning can be found by taking the minimum value of for all choices of and . ∎

The following Lemma is clear from the structure of Algorithm 1 and is stated without proof.

Lemma 2

Given an alignment  with columns , an integer , and a value  for each block  in , Algorithm 1 has running time .

Although we do not give a full proof here, we observe that Algorithm 1 can easily be converted into an algorithm that returns a block partitioning of  into  blocks that minimizes . Indeed, an optimal block partitioning for with blocks can be constructed by finding the value for which is minimized, recursively finding an optimal block partitioning for with blocks (if ), and combining this block partitioning with the block . We therefore have the following lemma.

Lemma 3

Let be a function on blocks of such that the value of can be calculated in time for any block , and let be an integer. Then in time, we can calculate a block partitioning of  into  blocks such that is minimized.

The factor comes from the need to calculate for each of the blocks in .

Using Theorems 2.1 and 2.2 we can now prove the following theorem.

Theorem 3.1

For binary characters, the Total Homoplasy Score problem can be solved in time , where is is the maximum homoplasy of a block in the block partitioning, and is thus fixed-parameter tractable with respect to . For -state characters, the problem can be solved in time .

Proof

Recall that by Theorem 2.1, for binary characters it can be decided in whether an alignment has an -near perfect phylogeny (nearperfect). It follows that given an integer , the homoplasy score of a block can be found in time if this score is at most . Similarly by Theorem 2.2, for -state characters, the homoplasy score of a block can be found in time if this score is at most  (baca-lagergren).

So now, given an integer let be the function on blocks in such that is equal to the homoplasy score of block if this is at most , and otherwise. It remains to apply Lemma 3 using this function, with for binary characters and for -state characters. For binary characters, this gives a running time of (as we may assume ), and for -state characters a running time of (as ). ∎

We also observe that by letting  be the homoplasy ratio of a block , Algorithm 1 can be used to solve Total Homoplasy Ratio.

Data: Alignment  with columns , integer , a value  for each block  in  (for instance, is the homoplasy score of ).
Result: Minimum value for which there exists a block partitioning of  into  blocks such that , or if no such block partitioning exists.
for  do
       ;
       for  do
             ;
       end for
      for  do
             ;
             for  do
                   ;
                  
             end for
            
       end for
      
end for
return
Algorithm 1 Algorithm ToHoPar().

3.2 Minimizing homoplasy ratio per block

In this section, we describe an approach that can be used to solve the Max Homoplasy Ratio problem. Similarly to Total Homoplasy Score, the key observation is that after calculating the homoplasy score (and thus the homoplasy ratio) of each possible block in , an optimal block partitioning can be found using standard dynamic programming techniques. Algorithm 2 describes this dynamic programming technique, under the assumption that a value  has been calculated for each possible block . For the purposes of solving Max Homoplasy Ratio, we let  be the homoplasy score of block .

Data: Alignment  with columns , integer , a value  for each block  in  (for instance, is the homoplasy ratio of ).
Result: Minimum value for which there exists a block partitioning of  into  blocks such that , or if no such block partitioning exists.
for  do
       ;
       for  do
             ;
       end for
      for  do
             ;
             for  do
                   ;
                  
             end for
            
       end for
      
end for
return
Algorithm 2 Algorithm HoRaPar().
Lemma 4

Given an alignment  with columns , an integer , and a value  for each block  in , Algorithm 2 returns the minimum value for which there exists a block partitioning of  into  blocks such that , or if no such block partitioning exists.

Proof

Observe that Algorithm 2 is identical to Algorithm 1, except for line 9 which handles construction of in the case where and . Consequently, the proof of this lemma is identical to that of Lemma 1, except for the case and , and we omit the other cases.

As with Lemma 1, we prove by induction on that is equal to the minimum for which there exists a block partitioning of with , such that , and that if no such block partitioning exists. For the case and , we have that an optimal block partitioning consists of a -block partitioning for together with the block . Moreover, the last block in the block partitioning of must be for some . Thus the value for an optimal block partitioning of  is equal to the maximum of and , for the choice of that minimizes this value. As this is exactly what the algorithm calculates, the claim is correct, and we have completed the proof for this case. ∎

The following Lemma is clear from the structure of Algorithm 2 and is stated without proof.

Lemma 5

Given an alignment  with columns , an integer , and a value  for each block  in , Algorithm 2 has running time .

As with Algorithm 1, we observe that Algorithm 2 can be made to return a block partitioning using standard backtracking techniques, and that using existing homoplasy algorithms, a certain parameterization of Max Homoplasy Ratio for binary characters is fixed-parameter tractable.

Theorem 3.2

For binary characters, the Max Homoplasy Ratio problem can be solved in time , where is is the maximum homoplasy of a block in the block partitioning, and is thus fixed-parameter tractable with respect to . For -state characters, the problem can be solved in time .

We also observe that by letting  be the homoplasy score of a block , Algorithm 1 can be used to solve Max Homoplasy Score.

3.3 Minimizing number of blocks

In this section, we consider a variation of the Max Homoplasy Score problem, described below.

Fewest Parsimonious Blocks
Given: an alignment  and integer .
Find: a block partitioning , of  into a minimum number of blocks such that each block admits an -near perfect phylogeny.

Algorithm 3 solves the Fewest Parsimonious Blocks problem.

Data: Alignment  with columns  and an integer .
Result: Block partitioning of  into a minimum number of blocks, such that each block  admits an -near perfect phylogeny.
;
;
for  do
       if  does not admit an -near perfect phylogeny then
             ;
             ;
            
       end if
      
end for
;
return
Algorithm 3 Algorithm FewParBlo()
Theorem 3.3

Algorithm 3 solves the Fewest Parsimonious Blocks problem.

Proof

Let  be a block partitioning produced by Algorithm 3, and let  be a block partitioning into a minimum number of blocks such that the index of the first column where and differ (i.e., the minimum such that columnn appears in blocks and for some ) is as large as possible. Consider the smallest  for which . It is not possible that  has more columns than  since otherwise the algorithm would have extended  with another column. Hence,  has fewer columns than . However, then we can add a column to  and obtain a solution with the same number of blocks as where the first columns where it differs from is one larger. This contradicts the assumption that the first column where and differ is as large as possible. Hence, we conclude that is identical to and therefore optimal. ∎

The running time of Algorithm 3 is  times the running time of the -near perfect phylogeny algorithm, hence for binary characters and for general -state characters. Consequently, the Fewest Parsimonious Blocks problem is fixed-parameter tractable for binary characters.

Corollary 1

Fewest Parsimonious Blocks is fixed-parameter tractable when the parameter is  for binary characters.

If the number of blocks is known to be small (), the following algorithm may also be useful. It uses binary search to find the location of the recombination site between two blocks.

Data: Alignment  with columns  and an integer .
Result: Block partitioning of  into a minimum number of blocks, such that each block  admits an -near perfect phylogeny.
;
while  do
       ;
       ;
       while  do
             ;
             if  admits an -near perfect phylogeny then
                   ;
                  
            else
                   ;
                  
             end if
            
       end while
      ;
       ;
       ;
      
end while
return null;
Algorithm 4 Algorithm FewParBlo2()

The correctness of this algorithm follows from a similar argument as for the previous algorithm: this algorithm uses binary search instead of linear search to find the longest possible block, using and as lower and upper bounds on the last column of that block. The only extra observation needed is the following. If admits an -near perfect phylogeny, then so does ; and, similarly, if does not admit an -near perfect phylogeny, then neither does .

Note that the running time of the part within the while loop is dominated by the function checking whether an -near perfect phylogeny exists. As this while loop performs a binary search on a list of length , its contents are executed times. Finally, this while loop is contained in a for loop which runs times at most. Therefore the running time is bounded by times the running time of the -near perfect phylogeny algorithm, which theoretically gives a factor improvement over the previous algorithm. An important note here is that actual running time may also depend on implementation of the -near perfect phylogeny algorithm. If the worst case for this algorithm is only attained in NO cases, the improvement of the second algorithm may be less than expected as the first algorithm encounters exactly one NO case per block, and the second may encounter more such cases.

4 Combining blocks to non-contiguous blocks

We now consider the Block Combining to -Multiblocks problem and first show this problem to be NP-complete for  by reduction from the Bounded Coloring problem: given a graph , does there exist a coloring of the vertices of  with at most  colors such that each color is used at most  times and adjacent vertices always get different colors? This problem is NP-complete for each fixed  (boundedcolorings).

Theorem 4.1

For every integer , the Block Combining to -Multiblocks problem is NP-complete for .

Proof

Given an instance of the Bounded Coloring problem, that is an integer and a graph with , we build the following blocks of aligned sequences of length for each vertex of , illustrated in Figure 1:

  • for all (that is, the first row in consists of all ’s);

  • and for all in ;

  • for all distinct from such that adjacent with , and for all ;

  • for all distinct from such that not adjacent with , and for all .

0000 0000 0000 0000
1000 1000 0000 0000
1000 1000 1000 1000
0000 1000 1000 1000
0100 0100 0100 0100
0100 0100 0100 0100
0100 0000 0100 0100
0000 0010 0010 0010
0010 0010 0010 0010
0010 0010 0000 0010
0000 0001 0001 0001
0001 0001 0001 0001
0001 0001 0001 0000
Figure 1: An instance of the Block Combining to -Multiblocks problem built from an instance of the Bounded Coloring problem with at most vertices per color.

Recall from the four-gamete test that an alignment has homoplasy 0 if and only if no two characters are incompatible, where characters are incompatible if there exist rows such that , , and .

First note that by construction, the only characters which may be equal to 1 in each block are the -th character of the block on sequences , and , all other characters are equal to 0. Therefore, there is at most one character equal to 1 in each line of block , therefore contains no incompatible characters, so the block has homoplasy 0, therefore we have built a proper instance of the Block Combining to -multiblocks problem for .

Now, suppose that is a positive instance of the Bounded Coloring problem using at most colors used at most times, then there exist independent sets in . We claim that the corresponding -multiblocks have homoplasy 0.

To prove this claim, let us consider 2 characters and in two distinct blocks and in the same -multiblock, such that is the -th character of block and is the -th character of block . If , the two characters are compatible because the 1s in those two characters do not appear in the same line. Otherwise, by construction the only 1s may appear in the sequences , and for these 2 characters. The vertices and corresponding to blocks and are not adjacent because they are part of an independent set, therefore by construction the sequences , and all contain 1 for one of these two characters, therefore both characters are compatible. So in all cases each -multiblock has homoplasy 0.

To prove the reverse, let us assume that it is possible to merge the blocks of the instance of the Block Combining to -Multiblocks we have built so that each -multiblock has homoplasy 0. For each -multiblock , let us consider two vertices and of corresponding to two blocks and of . Suppose by contradiction that and are adjacent. Then we have , , and . Thus the -th character of block and of block are incompatible, and therefore the homoplasy cost of the block combination containing and is strictly greater than 0; a contradiction. Therefore, no pair of vertices corresponding to the blocks of the block combination are adjacent, thus each set of vertices corresponding to each of the block combinations is an independent set, so has a -coloring.

Thus, we have built an instance of the Block Combining to -Multibocks problem where a merge into non-contiguous blocks (each containing at most contiguous blocks) with total homoplasy 0 for each of these blocks is possible if and only if there is a -coloring of  where each color is used at most times, therefore Block Combining to -Multiblocks is NP-hard.

Given a merge into non-contiguous blocks, it is easy to check if all the characters of each of these blocks are compatible, and that each non-contiguous block contains at most contiguous blocks, so Block Combining to -Multiblocks is NP-complete. ∎

We now focus on the case , i.e., the problem Block Combining to 2-Multiblocks.

Data: Alignment , a block partitioning  of  with total homoplasy  such that and are not mergeable for any , and integer .
Result: At most  2-multiblocks, each of which is a union of blocks from , such that the total homoplasy of the 2-multiblocks is  (if such a solution exists).
Construct a graph  with a vertex for each block and an edge  if  and  are mergeable;
Find a maximum matching  in ;
for each edge  do
       merge  and  into a 2-multiblock;
end for
for each vertex  that is not covered by  do
      make  a 2-multiblock consisting of a single continuous part;
end for
if the number of obtained 2-multiblocks is at most  then
      return the obtained 2-multiblocks;
else
      return null;
end if
Algorithm 5 Algorithm BloCo2Mul()

We now argue correctness of the algorithm. Suppose that there exist  2-multiblocks with total homoplasy  and such that each of the 2-multiblocks is a union of blocks from . Each 2-multiblock that consists of exactly two contiguous parts corresponds to an edge of . Let  be the set of all such edges and observe that they form a matching in . Each 2-multiblock that consists of a single contiguous part corresponds to a vertex of . Moreover, each vertex of  that is not covered by  corresponds to such a 2-multiblock. Hence, . It follows that we get the smallest possible number  of 2-multiblocks by choosing a maximum cardinality matching , which Algorithm 5 does.

The following theorem follows directly, since it can be checked in polynomial time whether an alignment has homoplasy 0 (perfectphylogeny; fastperfectphylogeny) and for binary states this is fixed-parameter tractable with parameter  (nearperfect).

Theorem 4.2

Block Combining to -Multiblocks can be solved in polynomial time for  and  and it is fixed-parameter tractable with parameter  for binary characters and .

We now continue to the general Block Combining to -Multiblocks problem with . This problem can be solved by Algorithm 6. We say that a collection of blocks  is mergeable if the total homoplasy of  is equal to the homoplasy of the alignment obtained by combining all blocks from . Note that, in general, it may happen that a set of blocks is not mergeable even if the blocks in the set are pairwise mergeable.

Data: Alignment , a block partitioning  of  with total homoplasy  such that and are not mergeable for any , and integer .
Result: At most  -multiblocks, each of which is a union of blocks from , such that the total homoplasy of the -multiblocks is .
Construct a graph  with a vertex for each set  with  that is mergeable and an edge  if ;
Give each vertex  a weight equal to  (i.e., the number of blocks in minus one);
Find a maximum weight independent set  in ;
for each input block not in any element of  do
      add the vertex to ;
end for
for each vertex  do
       Merge the blocks in  into a -multiblock;
      
end for
if the number of obtained -multiblocks is at most  then
      return the obtained -multiblocks
else
      return null
end if
Algorithm 6 Algorithm BloCopMul()

Correctness of Algorithm 6 follows from the following argument. First we argue that each -multiblock partitioning of with total homoplasy and multiblocks corresponds to an independent set in containing all input blocks of weight . Then we show that each independent set of with weight gives a -multiblock partitioning of in multiblocks. We now prove this in detail.

Lemma 6

Algorithm 6 is correct.

Proof

Suppose we have a -multiblock partitioning of with total homoplasy into multiblocks, then each multiblock must consist of a mergeable set of input blocks. Indeed, the total homoplasy of each multiblock is at least the sum of the homoplasies of the contained blocks. So, if the total homoplasy of all multiblocks is at most , and the total homoplasy of is also , none of the multiblocks may have strictly greater total homoplasy than the sum of the homoplasy of the contained blocks, i.e., is not allowed. Hence for each multiblock , or in other words, each multiblock  is mergeable.

This means that every multiblock corresponds to a vertex of and because the multiblocks form a partition of , there are no edges between these vertices. Hence the nodes corresponding to the chosen multiblocks form an independent set in . The weight of this independent set is