The systematic comparison of hierarchies is a fundamental task in a wide range of domains. Entities in biology often stand in a hierarchical relationship to one another (Hainke et al. (2012), Dutkowski et al. (2013)). A phylogenetic tree, for example, groups all descendants of a common ancestor and matching metrics are used in order to evaluate the dissimilarity between such trees (Damian Bogdanowicz (2013)). The most common way to measure the distance between trees is the tree edit distance (Tai (1979), Yoshino and Hirata (2017)). In this work we consider unordered labeled trees in which the order among siblings is irrelevant. The tree edit distance between trees is defined by the minimal-cost sequence of node edit operations (insert, delete, relabel) that transforms one tree into another. It is shown in Tai (1979) that there is a strong relationship between a Tai mapping and a sequence of editing operations, i.e. given a Tai mapping the tree edit distance can easily be computed.
A Tai mapping defines a one-to-one node mapping between trees that preserves ancestor relationships (see Figure 0(a)). We use and to denote ancestor orders, i.e. if is an ancestor of , and if or . Formally, for any two trees , a Tai mapping is defined as a mapping such that for any distinct , we have that
Equation 1 combines one-to-one and ancestor order property with the conjunction. Note that a Tai mapping can equivalently be defined combining both properties, i.e. for any distinct we have that
). Hence, different integer linear programming (ILP) based algorithms have been proposed to compute such a mapping (Kondo et al. (2014), Do et al. (2019), Böcker et al. (2013), Hong et al. (2017)). A naive ILP formulation Böcker et al. (2013) contains a constraint for each pair of edges such (2) does not hold, giving rise to variables and constraints, where and are the number of nodes of the two input trees, and thus does not allow to practically solve even moderate-sized instances (Kondo et al. (2014), Böcker et al. (2013)). In Hong et al. (2017) the authors therefore divide the problem into subproblems with constraints each, by utilizing a dynamic programming approach from Fukagawa et al. (2011). In Do et al. (2019), on the other hand, the authors introduced two classes of valid inequalities of the Tai mapping polytope, crossing edges clique constraints and semi-independent clique constraints, and use them as cutting planes in a branch-and-bound scheme. Their implementation achieved a 13-fold speed-up compared to the naive ILP in experiments on perturbed human single-cell data. Here, we derive a class of valid inequalities that generalize both types of previously introduced clique constraints. We formalize them as special classes of anti Tai mappings defined as follows.
Anti Tai mapping
A mapping that consists of edges for which (1) does not hold is called an anti Tai mapping (see Figure 0(b)). More formally, an anti Tai mapping is defined as a binary relation111Note that anti Tai mapping is not really a mapping any more. Hence, it should be understood as anti ”Tai mapping”. if for any distinct we have that
Let be a graph on vertex set such that there is an edge if and only if (3) holds. Then a maximum-weight independent set in corresponds to a maximum-weight Tai mapping, while a maximum clique corresponds to a maximum-weight anti Tai mapping. Recall that the stable set polytope, denoted by STAB, is the convex hull of stable (independent) sets of and that it is contained in the fractional stable set polytope
see Grötschel et al. (1993). It is known that STAB=QSTAB if and only if the graph is perfect, in which case linear optimization problems over STAB can be solved in polynomial time Grötschel et al. (1993). As the problem of finding a maximum-weight Tai mapping is NP-hard Zhang et al. (1992), it follows that, unless P=NP, the graph defined above is not perfect and furthermore that STABQSTAB
. Heuristically, one can still use the clique constraints in (4) in a cutting plane algorithm for solving an integer program for the maximum-weight Tai mapping problem, if it is possible to efficiently solve the corresponding separation problem, which amounts to finding a maximum-weight anti Tai mapping. While it is not known whether such a separation problem is polynomially solvable in general, we manage to answer this question in the affirmative when one of the two trees is a path. Specifically, we construct a dynamic program to compute the maximum-weight anti Tai mapping for the case where one tree is a path in time . Furthermore, we define a semi-independent antimatching (si-antimatching) as a restricted class of an anti Tai mapping and use this to provide a polynomially computable lower bound on the maximum-weight anti Tai mapping in the case of two trees. More precisely, let denote that and are on the same root-to-leaf path (“comparable”), i.e. if or . Otherwise, we say that and are incomparable. A mapping is a si-antimatching if for any distinct :
Note that any pair satisfying (5) also satisfies (3), and hence the set of all such pairs form a subgraph of the graph defined above. It follows that a maximum-weight si-antimatching can be used to provide a valid clique constraint for the Tai mapping problem. Motivated by this, we introduce a dynamic program that optimally solves the maximum-weight si-antimatching problem in time . We observe further that the same dynamic program can be combined with our optimal path-tree anti Tai mapping algorithm mentioned above (that is, when of the two trees is a path) to obtain a polynomially computable lower bound on the maximum-weight anti Tai mapping in the general case.
The rest of the paper is structured as follows. In Section 2, we study the properties of an si-antimatching and show that it induces a partition of the matched vertices in each tree into a path and an independent set such that the path in one tree is mapped (by the antimatching) to the independent set in the other and vice versa. We use this structural result to derive a dynamic program for computing a maximum-weight si-antimatching. In Section 3, we give a dynamic programming formulation for the anti Tai mapping problem between a tree and a path, and combine this result with our dynamic programming idea from Section 2 to give a dynamic program that provides a lower bound on the maximum-weight anti Tai mapping between two trees. We conclude in Section 4.
2 Si-antimatching problem
In this section, we present an efficient algorithm for this special case of anti Tai mapping. Surprisingly, si-antimatching turns out to be solvable in polynomial time and we provide a dynamic program that solves the problem directly on two unordered labeled trees and in time . Our dynamic program strongly relies on the decomposition theorem (Theorem 2.1) that states that every si-antimatching can be decomposed into an antichain and a path in , and a path and an antichain in such that an antichain in maps into a path in , and a path in maps into an antichain in (see Figure 3). We further argue that there exists an order in that decomposition which in turn provides the way of computing an entry in a dynamic table.
2.1 Decomposition theorem
We will first show that si-antimatching can get nicely decomposed, which will form the basis for our algorithm.
Let and denote two trees. We say that any two are semi-independent pair of edges (si-edges) if holds true.
Let be a set of semi-independent edges between trees and . Since here we allow edges with common vertices (i.e. it is not a matching set), we will refer to any such set as semi-independent antimatching or, in short, si-antimatching222 Typically in an antimatching, any pair of distinct edges have a common endpoint. Note that we use the notion of antimatching in a slightly different way, i.e. any pair of distinct edges could have a common endpoint..
For any and si-antimatching let . If is a singleton, e.g. , we will often omit the set notation and write . Analogously, for any let . Furthermore, let and denote all nodes in and that are incident to si-edges in . We say that root-to-leaf path in tree is maximal if no other root-to-leaf path contains more nodes of than . Let .
Let denote an si-antimatching and let denote a maximal path in . Then is an antichain, i.e. any two distinct nodes in are incomparable.
Suppose the opposite, i.e. there exist , , such that (see Figure 2). Since is maximal, there must exist such that , and , . Thus, nodes and must be comparable with nodes in and . Hence, it must be that , where the notation denotes a path in a tree with a given start and end node, while LCA stands for a lowest common ancestor of a given set of nodes. Which is a contradiction with the fact that and cannot lie on a common path in . ∎
Let be si-antimatching and be an antichain. Then either or , for some , lie on a single root-to-leaf path in .
Suppose nodes in do not lie on a path, i.e. there exist such that . Note that contains only a single node. Let denote that unique node. Since for all it follows that and for all . Thus, all nodes in must lie on the path . ∎
Let denote si-antimatching. Then there exist a maximal path such that all nodes in lie on a root-to-leaf path in .
Let be an arbitrary maximal path and an antichain (by Lemma 2.1). Suppose not all nodes in lie on a path, i.e. there exist such that nodes in do not all lie on a single path in . Then the path containing is also maximal since otherwise there would exist , , and , such that , which is a contradiction to the fact that . Hence, there exists a single node such that for all . Therefore, is also an antichain and by Lemma 2.2 it follows that is a path and the claim of the lemma follows. ∎
Now we are ready to state the main result of this section that immediately follows from previous lemmas:
Theorem 2.1 (Decomposition theorem).
For any si-antimatching there exists a partition of into sets , for some root-to-leaf path , and an antichain, such that is an antichain in and all lie on some root-to-leaf path in .
In the following, we will explain our algorithm for the si-antimatching problem based on the decomposition theorem 2.1. Given si-antimatching and for let denote the lowest ancestor of in . Analogously denote for . We start by partitioning antichains into equivalence classes.
Partition into equivalence classes such that belong to the same equivalence class if . Analogously partition into equivalence classes .
Without loss of generality we will assume that , for , and similarly , for (see Figure 4). Furthermore, for let . Given the above decomposition of our problem, with the following theorem we show that we have an order that would allow for dynamic programming to be used.
Let denote si-antimatching, and partitioning of and , respectively, and and equivalence classes of and , respectively. Then for
for all and .
Let’s assume that for some , there exists such that
for all , and . That implies that there must exist some such that
and . Note that it holds that , since we assumed that , which implies a contradiction with the fact that is si-antimatching.
Let for some and for all . But then there exists such that for , we have that . Note that it holds that , since we assumed that , which again implies a contradiction with the fact that is si-antimatching. ∎
We now see that any si-antimatching has the structure illustrated in Figure 5. Particularly, if we separate elements of into equivalence classes of relation then for from distinct equivalence classes we have for all and . Analogous claim holds for . In other words, images of distinct equivalence classes under are not interwoven, allowing for a dynamic programming approach to be used.
We will denote the set of vertices of the subtree rooted in the vertex with , the set of children of the vertex with , the root of a tree with and the unique parent of vertex with .
Before stating the main result, note the following.
Let , , w.l.o.g. , . Then either or .
Assume both and . Then and which is a contradiction with the assumption that is an si-antimatching. ∎
Let be the weight of maximum weight antichain in the subtree rooted at vertex with weight on vertex . Then,
Let and . The union of antichains of any family of subtrees rooted in , is an antichain and so any subset of is an si-antimatching. Since we have for all , and for all , , inductively any set found by the dynamic program is an si-antimatching. Conversely, let be a non-empty si-antimatching. If , we have , where and are respectively the least and the greatest element of . If , w.l.o.g. . By lemma 2.4 it is not a loss of generality to assume . Partition where and . We have where and are respectively the least and the greatest element of . By proceeding analogously with the subtrees rooted in and and the si-antimatching , we obtain .
It should be noted that it is possible to implement the algorithm from theorem 2.3 with running time . Specifically, for a fixed the maximum weight antichains rooted in are independent of eachother, allowing us to simply accumulate them while iterating through .
3 Anti Tai mapping problem
We first present a quadratic algorithm for computing optimal anti Tai mapping in the case when one of the trees consists of a single root-to-leaf path. It generalizes an algorithm from Do et al. (2019) which computes optimal anti Tai mapping in the case when both trees consist of a single root-to-leaf path.
Optimal anti Tai mapping of path and tree rooted in satisfies
Let be edge selected at any point of the dynamic program. Then any element of forms an anti Tai mapping with . Inductively, any set found by the dynamic program is anti Tai mapping. Conversely, let be an anti Tai mapping. If then for all and for all so inductively there is a sequence of steps corresponding to . ∎
Running time of the algorithm presented by theorem 3.1 is . We see this by noting that the dynamic program has states and that for a fixed , computing all amounts to a single tree traversal.
The algorithm of corollary 1 can be implemented with asymptotic running time of , analogously to the optimal si-antimatching algorithm of theorem 2.3. We provide a sample implementation at Blažević (2021). It is also worth noting, for practical purposes, that those algorithms admit a straightforward parallelization. Specifically, all can be checked in parallel.
3.2 Anti Tai mapping on DAGs
In the following section, we will make two straightforward observations regarding generalizations of anti Tai mapping to more general directed acyclic graphs (DAGs). First note that theorem 3.1 cannot be extended to DAGs since it depends on sets of descendants of distinct children being disjoint. Observation 3.1 notes that we can, however, do somewhat better than falling back to computing anti Tai mapping on pairs of paths (as is done in Do et al. (2019)) by extending a path to an arbitrary topological ordering. Observation 3.2 relates maximum-weight anti Tai mapping to maximum weight antichain in the graph constructed by the product of two DAGs solving the problem when one of the DAGs is a path in polynomial time.
Let and be directed acyclic graphs, , , arbitrary topological ordering of vertices of . Then optimal anti Tai mapping of and satisfies where
The observation follows from the fact that by definition of topological ordering we have incomparable with .
Let and be directed acyclic graphs, , . Let where . Then the weight of optimal anti Tai mapping of and is equal to the weight of a maximum weight antichain in .
The observation follows from the fact that is a transitively closed directed acyclic graph and is an anti Tai mapping if and only if .
In this paper we consider the problem of generating cutting planes for the natural integer programming formulation of the Tai mapping problem. Our cutting planes are based on finding a maximum-weight clique in a subgraph defined on the set of pairs of nodes of the two given trees, which we call an anti Tai mapping. For the special class of si-antimatching, we give a decomposition theorem that describes its precise structure and hence allows us to use dynamic programming to find the maximum-weight si-antimatching in time. Inspired by this result, we also obtain a dynamic program that provides a polynomially computable lower bound on the maximum-weight anti Tai mapping. Whether the latter problem is NP-hard remains an interesting open question.
- Tai (1979) K.-C. Tai, The tree-to-tree correction problem, J. ACM 26 (1979) 422–433.
- Zhang et al. (1992) K. Zhang, R. Statman, D. Shasha, On the editing distance between unordered labeled trees, Information Processing Letters 42 (1992) 133–139.
- Hainke et al. (2012) K. Hainke, J. Rahnenführer, R. Fried, Cumulative disease progression models for cross-sectional data: A review and comparison, Biometrical Journal 54 (2012) 617–640.
- Dutkowski et al. (2013) J. Dutkowski, M. Kramer, M. A. Surma, R. Balakrishnan, J. M. Cherry, N. J. Krogan, T. Ideker, A gene ontology inferred from molecular networks, Nature Biotechnology 31 (2013) 38–45.
- Damian Bogdanowicz (2013) K. G. Damian Bogdanowicz, On a matching distance between rooted phylogenetic trees, International Journal of Applied Mathematics and Computer Science 23 (2013) 669–684.
- Yoshino and Hirata (2017) T. Yoshino, K. Hirata, Tai mapping hierarchy for rooted labeled trees through common subforest, Theory Comput. Syst. 60 (2017) 759–783.
- Zhang and Jiang (1994) K. Zhang, T. Jiang, Some max snp-hard results concerning unordered labeled trees, Inf. Process. Lett. 49 (1994) 249–254.
- Kondo et al. (2014) S. Kondo, K. Otaki, M. Ikeda, A. Yamamoto, Fast computation of the tree edit distance between unordered trees using ip solvers, 2014, pp. 156–167. doi:10.1007/978-3-319-11812-3_14.
- Do et al. (2019) V. H. Do, M. Blazevic, P. Monteagudo, L. Borozan, K. M. Elbassioni, S. Laue, F. R. Ringeling, D. Matijevic, S. Canzar, Dynamic pseudo-time warping of complex single-cell trajectories, in: L. J. Cowen (Ed.), Research in Computational Molecular Biology, RECOMB 2019, Washington, DC, USA, May 5-8, 2019, Proceedings, volume 11467 of Lecture Notes in Computer Science, Springer, 2019, pp. 294–296.
- Böcker et al. (2013) S. Böcker, S. Canzar, G. W. Klau, The generalized robinson-foulds metric, in: A. Darling, J. Stoye (Eds.), Algorithms in Bioinformatics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 156–169.
- Hong et al. (2017) E. Hong, Y. Kobayashi, A. Yamamoto, Improved methods for computing distances between unordered trees using integer programming, 2017. doi:10.1007/978-3-319-71147-8_4.
- Fukagawa et al. (2011) D. Fukagawa, T. Tamura, A. Takasu, E. Tomita, T. Akutsu, A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures, BMC Bioinformatics 12 (2011) S13.
Grötschel et al. (1993)
M. Grötschel, L. Lovász, A. Schrijver, Geometric Algorithms and Combinatorial Optimization, volume 2 ofAlgorithms and Combinatorics, second corrected ed., Springer, 1993.
- Blažević (2021) M. Blažević, An implementation of anti Tai mapping, https://github.com/krofna/antitai, 2021.