A wide range of tasks in computational biology start by determining, for a given “query gene” in a species , one or all genes in another species that are most similar to . Conceptually, these best matches of the query genes are meant to approximate the set of genes in that are evolutionary most closely related to . Best matches can be identified by comparing evolutionary distances (Nei and Zhang, 2006), which in turn are usually obtained from sequence alignments (Chatzou et al., 2016). In practice, fast approximation algorithms such as blast and its more modern successors are often used for this purpose (Moreno-Hagelsieb and Latimer, 2008, Hernández-Salmerón and Moreno-Hagelsieb, 2020). Even if sequence similarity is measured perfectly, deviations from a common molecular clock, i.e., differences in the evolutionary rates of different genes, cause discrepancies between best hits (most similar sequences) and best matches (evolutionary most closely related sequences), see (Stadler et al., 2020) for a detailed discussion.
The idea of best matches in the sense of closest evolutionary relatedness pre-supposes an underlying tree that describes the phylogenetic relationships among the genes, which correspond to the leaves of , and a map assigning to each gene the species in which it resides. Given such a leaf-colored tree , the best match graph has as its vertex set the leaves of , i.e., the set of genes, and as (directed) arcs the best matches. The latter are defined as the pairs for which the last common ancestor of and is at least as close to as the last common ancestor of and any other gene from the same spaces . Best match graphs (BMGs), i.e., graphs that are derived from a leaf-labeled tree in this manner, form a very restrictive class of colored digraphs (Geiß et al., 2019, Schaller et al., 2021a). Empirically determined best hit data therefore will in general not satisfy the defining properties of BMGs. They can be corrected in part, however, by considering the problem of editing a given graph to the closest BMG. Schaller et al. (2021d)
showed that the arc modification problems for BMGs are NP-hard, but can be formulated as integer linear programs (ILPs) allowing practical solutions for small instances. However, in computational biology, applications to large gene families would be of particular interest, creating the need for faster, approximate solutions for BMG editing. Before embarking to develop software for a BMG-based analysis of large sequence data sets, we need to understand whether the editing problem for BMGs is tractable in practice with sufficient accuracy and for interestingly large instances. The purpose of this contributions is to establish that this is indeed the case.
Motivated by both theoretical and practical considerations, we are mainly interested in heuristics that are consistent in the following sense: Let be an algorithm that takes an arbitrary vertex-colored digraph as input and outputs a BMG . Then is consistent if whenever the input graph is a BMG.
A vertex-colored digraph is a BMG if and only if (a) a certain set of informative triples , which can easily be read off the input graph, is consistent and (b) the BMG of the corresponding so-called Aho tree coincides with (Schaller et al., 2021a). In general, the Aho tree of a compatible set of triples on a set is a least resolved super-tree of all the triples in . For a BMG, is the unique least resolved tree (LRT) for . These close connections between recognizing BMGs and constructing super-trees suggest to adapt ideas from heuristic algorithms for triple consistency problems and super-tree construction for BMG editing.
The simplest approach, therefore, is to extract a maximal consistent subset from and to use as an approximation, see Sec. 4. A more detailed analysis of arcs in that violate the property of being a BMG in Sec. 5, however, leads a notion of “unsatisfiable relations” (UR), which can be used to count the arc modifications associated with a partition of the vertex set of . It also gives raise to a top-down algorithm in which the vertex set of is recursively edited and partitioned. A large class of heuristics for BMG editing can be constructed depending on the construction of the partition in each recursion step. We shall see that the arc edit sets in different steps of the recursion are disjoint. A main result of this contribution, Thm. 3, links the partitions appearing in BMG editing algorithms to the auxiliary graphs appearing in the BUILD algorithm for supertree construction (Aho et al., 1981), see Sec. 2. This provides a guarantee that the BMG editing algorithms are consistent provided the choice of is such that it does not enforce edits whenever an alternative partition with and empty UR is available. For BMGs, this is in particular the case for the partitions appearing in the BUILD algorithm. In Sec. 7, we proceed to show by reduction from Set Splitting that finding a partition with a minimal number of unsatisfiable relations is NP-hard.
The theoretical results are complemented by computational experiments on BMGs with randomly perturbed arc sets in Sec. 8. We focus on a comparison of different algorithms to construct the partitions . Somewhat surprisingly, we find that minimizing the cardinality of the UR alone is not the best approach, since this tends to produce very unbalanced partitions and thus requires a large number of steps in the recursions whose costs add up. Instead, certain types of clustering or community detection approaches that favor more balanced partitions tend to perform well.
2 Notation and Preliminaries
is a partition of a set if (i) , (ii) and (iii) for . A partition is non-trivial if . Consider two partitions and of . If for every there is a such that , i.e., if every set in is completely contained in a set in , then is a refinement of , and is a coarse-graining of .
Mostly, we consider simple directed graphs (digraphs) with vertex set and arc set . We will frequently write and to explicitly refer to the graph . For a vertex , we say that is an in-arc and is an out-arc. The subgraph induced by a subset is denoted by . Undirected graphs can be identified with symmetric digraphs, i.e., the undirected graph underlying a digraph is obtained by dropping the direction of all arcs, or by symmetrizing the digraph, i.e., adding the arc to for every arc . When referring to an undirected graph , we write for and call an edge. The (weakly) connected components of are the maximal connected subgraphs of the undirected graph underlying or, equivalently, the maximal strongly connected components of the symmetrized digraph. Whenever the context is clear, we will also refer to the partition of formed by the vertex sets of the maximal connected subgraphs as the set of connected components.
A vertex coloring is a map , where is a non-empty set of colors. A vertex coloring of is proper if whenever . We write for a vertex-colored digraph and denote by the subset of vertices of a graph that have color . Moreover, we define for the subset of colors present in a set .
We write for the set of out-neighbors of a vertex and for the set of in-neighbors of . A graph is called sink-free if holds for all . We write for the symmetric difference of two sets and , and define, for a graph and arc set , the graph . Analogously, we write and .
Consider an undirected, rooted tree with leaf set and root . Its inner vertices are given by the set . The ancestor order on is defined such that if lies on the unique path from to the root , i.e., if is an ancestor of . For brevity we set if and . If is an edge in such that , then is the parent of and the child of . The set of children of a vertex is denoted by . A tree is phylogenetic if all its inner vertices have at least two children. All trees considered in this contribution will be phylogenetic. For a non-empty subset , we define , the last common ancestor of , to be the unique -minimal vertex of that is an ancestor of every . Following e.g. Bryant and Steel (1995), we denote by the restriction of to a subset , i.e. is obtained by identifying the (unique) minimal subtree of that connects all leaves in , and suppressing all vertices with degree two except possibly the root . We say that displays or is a refinement of a tree , in symbols , if can be obtained from a restriction of by a series of inner edge contractions. is a leaf-colored tree if is a map from the leaves of into a non-empty set of colors. We say that is displayed by if and for all .
A (rooted) triple is a tree on three leaves and with two inner vertices. We write for the triple on the leaves and if the path from to does not intersect the path from to the root in , i.e., if . In this case we say that displays . We write for the restriction of a triple set to a set of leaves. A set of triples is consistent if there is a tree with leaf set that displays every triple in . The polynomial-time algorithm BUILD decides for every triple set whether it is consistent, and if so, constructs a particular tree, the Aho tree , that displays every triple in (Aho et al., 1981). The algorithm relies on the construction of an (undirected) auxiliary graph, the Aho graph, for a given triple set on a set of leaves . This graph, denoted by , contains an edge if and only if for some .
3 Best match graphs
Let be a leaf-colored tree. A leaf is a best match of the leaf if and holds for all leaves of color .
Given , the graph with vertex set , vertex-coloring , and with arcs if and only if is a best match of w.r.t. is called the best match graph (BMG) of (Geiß et al., 2019).
An arbitrary vertex-colored graph is a best match graph (BMG) if there exists a leaf-colored tree such that . In this case, we say that explains .
We say that is an -BMG if . By construction, there is at least one best match of for every color :
For every vertex and every color in a BMG there is some vertex with . Equivalently, the subgraph induced by every pair of colors is sink-free.
In particular, therefore, BMGs are sink-free whenever they contain at least two colors. We formalize this basic property of BMGs for colored graphs in general:
Let be a colored graph. The coloring is sink-free if it is proper and, for every vertex and every color in , there is a vertex with . A graph with a sink-free coloring is called sf-colored.
Given a tree and an edge , we denote by the tree obtained from by contracting the edge . We call an edge in redundant (w.r.t. ) if both and explain .
A tree is least resolved for a BMG if (i) it explains and (ii) it does not contain any redundant edges w.r.t. .
Geiß et al. (2019, Thm. 8) showed that every BMG has a unique least resolved tree (LRT).
Let be a vertex-colored digraph. Then the set of informative triples is
|and the set of forbidden triples is|
|For the subclass of BMGs that can be explained by binary trees, we will furthermore need|
By definition, must be pairwise distinct whenever , , or .
We extend the notion of consistency to pairs of triple sets in
Let and be sets of triples. The pair is called consistent if there is a tree that displays all triples in but none of the triples in . In this case, we also say that agrees with .
It can be decided in polynomial time whether such a pair is consistent using the algorithm MTT, which was named after the corresponding mixed triplets problem restricted to trees and described by He et al. (2006).
We continue with two simple observations concerning the restriction of triple sets. Since informative and forbidden triples are only defined by the presence and absence of arcs in the subgraph of induced by , this leads to the following
(Schaller et al., 2021b) Let be a vertex-colored digraph and . Then holds for every .
Moreover, any pair of triples such that and for a consistent pair remains consistent since any tree that agrees with clearly displays all triples in and none of the triples in . Hence, we have
Let and for a consistent pair of triple sets . Then is consistent.
Let be a properly colored digraph with vertex set . Then the following three statements are equivalent:
is a BMG.
is consistent and .
is sf-colored and is consistent.
In this case, is the unique LRT for , and a leaf-colored tree on explains if and only if it agrees with .
It is important to note that both arc insertions and deletions may lead to creation and loss of informative triples. In particular, when starting from a BMG, both types of modifications have the potential to make the triple set inconsistent as the example in Fig. 2 shows. This is indeed often the case even for moderate disturbances of a BMG as we shall see in Sec. 8.
We expect that empirically estimated best match relations will typically contain errors that correspond to both arc insertions and deletions w.r.t. the unknown underlying “true” best match graph. This motivates the problem of editing a given vertex-colored digraph to a BMG:
Problem 1 (-BMG Editing).
|Input:||A properly -colored digraph and an integer .|
|Question:||Is there a subset such that and|
|is an -BMG?|
Natural variants are -BMG Completion and -BMG Deletion where is replaced by and , respectively, i.e., only addition or deletion of arcs is allowed. Both -BMG Editing and its variants are NP-complete (Schaller et al., 2021d).
The heuristic algorithms considered in this contribution can be thought of as maps on the set of finite vertex-colored digraphs such that is a BMG for every vertex-colored input graph . In particular, the following property of such algorithms is desirable:
A (BMG-editing) algorithm is consistent if whenever is a BMG.
4 A simple, triple-based heuristic
The triple-based characterization summarized by Prop. 1 suggests a simple heuristic for BMG editing that relies on replacing the consistency checks for triple sets by the extraction of maximal sets of consistent triples (see Alg. 1). Both MaxRTC, the problem of extracting from a given set of rooted triples a maximum-size consistent subset, and MinRTI, the problem of finding a minimum-size subset such that is consistent, are NP-hard (Jansson, 2001). Furthermore, MaxRTC is APX-hard and MinRTI is -inapproximable (Byrka et al., 2010). However, because of their practical importance in phylogenetics, a large number of practically useful heuristics have been devised, see e.g. (Gasieniec et al., 1999, Wu, 2004, Tazehkand et al., 2013).
As a consequence of Prop. 1, Alg. 1 is consistent, i.e., if and only if the input graph is a BMG, if a consistent heuristic is employed to solve MaxRTG/MinRTI, i.e., if consistent triple sets remain unchanged by the method approximating MaxRTG/MinRTI.
The heuristic Alg. 1 is not always optimal, even if MaxRTC/MinRTI is solved optimally. Fig. 3 shows an unconnected 2-colored graph on three vertices that is not a BMG and does not contain informative triples. The BMG produced by Alg. 1 introduces two arcs into . However, can also be edited to a BMG by inserting only one arc.
A simple improvement is to start by enforcing obvious arcs: If is the only vertex with color , then by definition there must be an arc for every vertex . The computation then starts from the sets of informative triples of the modified graph. We shall see below that these are the only arcs that can safely be added to without other additional knowledge or constraints (cf. Thm. 2 below).
5 Locally optimal splits
Finding an optimal BMG editing of a digraph is equivalent to finding a tree on that minimizes the cardinality of
We may ask, nevertheless, if trees on contain information about arcs and non-arcs in that are “unambiguously false” in the sense that they are contained in every edit set that converts into a BMG. Denote by the set of all phylogenetic trees on . The set of these “unambiguously false” (non-)arcs can then be expressed as
Since there are in general exponentially many trees on and thus, the problem of determining seems to be quite challenging on a first glance. We shall see in Thm. 2, however, that can be computed efficiently. We start with a conceptually simpler construction.
Let be a properly vertex-colored digraph and a partition of with . Moreover, let be the set of trees on that satisfy . The set of unsatisfiable relations (UR), denoted by , is defined as
The associated UR-cost is .
The set of (phylogenetic) trees is non-empty since in Def. 8. Moreover, by construction, if and only if
Intriguingly, the set , and thus the UR-cost , can be computed in polynomial time without any explicit knowledge of the possible trees to determine the set . To this end, we define the three sets
Let be a properly vertex-colored digraph and let be a partition of with . Then
The proof of Lemma 1 relates the possible cases between and the tree set in a straightforward manner. Since it is rather lengthy it is relegated to Appendix. Fig. 4 gives examples for all three types of unsatisfiable relations, i.e., for , , and . In particular, we have since it is an arc in but contains another red vertex . Moreover, since it is not an arc in but does not contain another green vertex. Finally, we have since it is not an arc in but is the only blue vertex in . In the example, the graph is already a BMG which, however, is not true in general.
The set can be computed in quadratic time.
We first compute all numbers of vertices in with a given color . This can be done in if we do not explicitly store the zero-entries. Now, , i.e. , can be checked in constant time, and thus, it can also be decided in constant time whether or not a pair is contained in or . Since, given , the condition is equivalent to , membership in
can also be decided in constant time. Checking all ordered pairsthus requires a total effort of . ∎∎
Our discussion so far suggests a recursive top-down approach, made precise in Alg. 2. In each step, one determines a “suitably chosen” partition and then recurses on the subgraphs of the edited graph . More details on such suitable partitions will be given in Thm. 3 below. The parts in the algorithm highlighted in color can be omitted. They are useful, however, if one is also interested in a tree that explains the editing result and to show that is indeed a BMG (see below). Alg. 2 is designed to accumulate the edit sets in each step, Line 2. In particular, the total edit cost and the scores are closely tied together, which follows from the following result:
All edit sets constructed in Alg. 2 are pairwise disjoint.
The edit cost of Alg. 2 is the sum of the UR-costs in each recursion step.
It is important to note that the edits must be applied immediately in each step (cf. Line 2 in Alg. 2). In particular, Lemma 2 and Cor. 2 pertain to the partitioning of the edited graph , not to the original graph .
Every pair of edited graph and tree produced as output by Alg. 2 satisfies . In particular, is a BMG.
By construction, the tree is phylogenetic and there is a one-to-one correspondence between the vertices and the recursion steps, which operate on the sets . If (or, equivalently, is an inner vertex of ), we furthermore have for the partition of chosen in that recursion step. In the following, we denote by the graph during the editing process, and by the input graph, i.e., as in Alg. 2. For brevity, we write for the arc set of the final edited graph and .
Let us assume, for contradiction, that there exists (a) , or (b) . In either case, we set and consider the recursion step on with the corresponding partition chosen for . Note that , and thus . Moreover, let be the child of such that , and .
Case (a): .
Since and by the definition of best matches, there must be a vertex of color such that , and thus . Moreover, we have , and . Two subcases need to be considered, depending on whether or not is an arc in at the beginning of the recursion step. In the first case, the arguments above imply that , and thus, by Lemma 1. Hence, we delete the arc in this step. In the second case, it is an easy task to verify that none of the definitions of , , and matches for . Since this step is clearly the last one in the recursion hierarchy that can affect the (non-)arc , it follows for both subcases that ; a contradiction.
Case (b): .
Since and by the definition of best matches, there cannot be a vertex of color such that , and thus . Moreover, we have , and . Again, two subcases need to be distinguished depending on whether or not is an arc in at the beginning of the recursion step. In the first case, the arguments above make it easy to verify that none of the definitions of , , and matches for . In the second case, we obtain , and thus, by Lemma 1. Hence, we insert the arc in this step. As before, the (non-)arc remains unaffected in any deeper recursion step. Therefore, we have in both subcases; a contradiction.
Finally, immediately implies that is a BMG. ∎∎
Cor. 2 suggests a greedy-like “local” approach. In each step, the partition is chosen to minimize the score in Line 2. The example in Fig. 5 shows, however, that the greedy-like choice of does not necessarily yield a globally optimal edit set.
In order to identify arcs that must be contained in every edit set, we first clarify the relationship between the partitions on and the partitions defined by the phylogenetic trees on .
Let be a set with . Let be the set of all partitions of with . Then the set of all phylogenetic trees with leaf set satisfies .
For every , is a set of phylogenetic trees on . Hence, we conclude . Conversely, assume that . Since (with root ) is a phylogenetic tree and has at least two leaves, we have . Together with , this implies . In particular, satisfies for some , and is therefore contained in . ∎∎
Using Lemma 3 and given that , we can express the set of relations that are unsatisfiable for every partition as follows
i.e., it coincides with the set of relations that are unsatisfiable for every phylogenetic tree, and thus part of every edit set. Note that is trivially empty if . We next show that can be computed without considering the partitions of explicitly.
Let be a properly vertex-colored digraph with then