Largest Weight Common Subtree Embeddings with Distance Penalties

by   Andre Droschinsky, et al.

The largest common embeddable subtree problem asks for the largest possible tree embeddable into two input trees and generalizes the classical maximum common subtree problem. Several variants of the problem in labeled and unlabeled rooted trees have been studied, e.g., for the comparison of evolutionary trees. We consider a generalization, where the sought embedding is maximal with regard to a weight function on pairs of labels. We support rooted and unrooted trees with vertex and edge labels as well as distance penalties for skipping vertices. This variant is important for many applications such as the comparison of chemical structures and evolutionary trees. Our algorithm computes the solution from a series of bipartite matching instances, which are solved efficiently by exploiting their structural relation and imbalance. Our analysis shows that our approach improves or matches the running time of the formally best algorithms for several problem variants. Specifically, we obtain a running time of O(|T| |T'|Δ) for two rooted or unrooted trees T and T', where Δ={Δ(T),Δ(T')} with Δ(X) the maximum degree of X. If the weights are integral and at most C, we obtain a running time of O(|T| |T'|√(Δ) (C{|T|,|T'|})) for rooted trees.


page 1

page 2

page 3

page 4


A Scaling Algorithm for Weighted f-Factors in General Graphs

We study the maximum weight perfect f-factor problem on any general simp...

Finding k-Secluded Trees Faster

We revisit the k-Secluded Tree problem. Given a vertex-weighted undirect...

An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings

A widely used method for determining the similarity of two labeled trees...

Optimal Distributed Covering Algorithms

We present a time-optimal deterministic distributed algorithm for approx...

Gomory-Hu Trees in Quadratic Time

Gomory-Hu tree [Gomory and Hu, 1961] is a succinct representation of pai...

A Simple Algorithm for Optimal Search Trees with Two-Way Comparisons

We present a simple O(n^4)-time algorithm for computing optimal search t...

Cartesian trees and Lyndon trees

The article describes the structural and algorithmic relations between C...

1 Introduction

The maximum common subgraph problem asks for a graph with a maximum number of vertices that is isomorphic to induced subgraphs of two input graphs. This problem arises in many domains, where it is important to find the common parts of objects which can be represented as graphs. An example of this are chemical structures, which can be interpreted directly as labeled graphs. Therefore, the problem has been studied extensively in cheminformatics [5, 16, 17]. Although elaborated backtracking algorithms have been developed [16, 2], solving large instances in practice is a great challenge. The maximum common subgraph problem is NP-hard and remains so even when the input graphs are restricted to trees [6]. However, in trees it becomes polynomial-time solvable when the common subgraph is required to be connected, i.e., it must be a tree itself. This problem is then referred to as maximum common subtree problem and the first algorithm solving it in polynomial-time is attributed to J. Edmonds [12]. Also requiring that the common subgraph must be connected (or even partially biconnected) several extensions to tree-like graphs have been proposed, primarily for applications in cheminformatics [19, 17, 3]. Some of these approaches are not suitable for practical applications due to high constants hidden in the polynomial running time. Other algorithms are efficient in practice, but restrict the search space to specific common subgraphs. Instead of developing maximum common subgraph algorithms for more general graph classes, which has proven difficult, a different approach is to represent molecules simplified as trees [15]. Then, vertices typically represent groups of atoms and their comparison requires to score the similarity of two vertices by a weight function. This, however, is often not supported by algorithms for tree comparison. Moreover, it maybe desirable to map a path in one tree to a single edge in the other tree, skipping the inner vertices. Formally, this is achieved by graph homeomorphism instead of isomorphism.

Various variants for comparing trees have been proposed and investigated [18]. Most of them assume rooted trees, which may be ordered or unordered. Algorithms tailored to the comparison of evolutionary trees typically assume only the leaves to be labeled, while others support labels on all vertices or do not consider labels at all. The well-known agreement subtree problem, for example, considers the case, where only the leaf nodes are labeled, with no label appearing more than once per tree [11]. We discuss the approaches most relevant for our work. Gupta and Nishimura [8] investigated the largest common embeddable subtree problem in unlabeled rooted trees. Their definition is based on topological embedding (or homeomorphism) and allows to map edges of the common subtree to vertex-disjoint paths in the input trees. The algorithm uses the classical idea to decompose the problem into subproblems for smaller trees, which are solved via bipartite matching. A solution for two trees with at most vertices is computed in time using a dynamic programming approach. Fig. 1 illustrates the difference between maximum common subgraph and largest common embeddable subtree. Lozano and Valiente [10] investigated the maximum common embedded subtree problem, which is based on edge contraction. In both cases the input graphs are rooted unlabeled trees. Note, the definition of their problems is not equivalent. The first is polynomial time solvable, while the second is NP-hard for unordered trees, but polynomial time solvable for ordered trees. Many algorithms do not support trees, where leaves and the inner vertices both have labels. A notable exception is the approach by Kao et al. [9], where only vertices with the same label may be mapped. This algorithm generalizes the approach by Gupta/Nishimura and improves its running time to , where denotes the number of vertex pairs with the same label and the maximum degree of all vertices.

We consider the problem of finding a largest weight common subtree embedding (LaWeCSE), where matching vertices are not required to have the same label, but their degree of agreement is determined by a weight function. We build on the basic ideas of Gupta and Nishimura [8]. To prevent arbitrarily long paths which are mapped to a common edge we study a linear distance penalty for paths of length greater than 1. Note that, by choosing a high distance penalty, we solve the maximum common subtree (MCS) problem as a special case. By choosing weight 1 for equal labels and sufficiently small negative weights otherwise, we solve a problem equivalent to the one studied by Kao et al. [9].

Our contribution.

We propose and analyze algorithms for finding largest weight common subtree embeddings. Our method requires to solve a series of bipartite matching instances as subproblem, which dominates the total running time. We build on recent results by Ramshow and Tarjan [13] for unbalanced matchings. Let and be labeled rooted trees with and vertices, respectively, and the smaller degree of the two input trees. For real-valued weight functions we prove a time bound of . For integral weights bounded by a constant we prove a running time of . This is an improvement over the algorithm by Kao et al. [9] if there are only few labels and the maximum degree of one tree is much smaller than the maximum degree of the other. In addition, we support weights and a linear penalty for skipped vertices.

Moreover, the algorithm by Kao et al. [9] is designed for rooted trees only. A straight forward approach to solve the problem for unrooted trees is to try out all pairs of possible roots, which results in an additional factor. However, our algorithm exploits the fact that there are many similar matching instances using techniques related to [1, 4]. This includes computing additional matchings of cardinality two. For unrooted trees and real-valued weight functions we prove the same time bound as for rooted trees. This leads to an improvement over the formally best algorithm for solving the maximum common subtree problem, for which a time bound of has been proven [4].

2 Preliminaries

We consider finite simple undirected graphs. Let be a graph, we refer to the set of vertices by or and to the set of edges by or . An edge connecting two vertices is denoted by or . The order of a graph is its number of vertices. The neighbors of a vertex are defined as . The degree of a vertex is , the degree of a graph is the maximum degree of its vertices.

A path is a sequence of pairwise disjoint vertices connected through edges and denoted as . We alternatively specify the vertices or edges only. The length of a path is its number of edges. A connected graph with a unique path between any two vertices is a tree. A tree with an explicit root vertex is called rooted tree, denoted by . In a rooted tree we denote the set of children of a vertex by and its parent by , where . For any tree and two vertices the rooted subtree is induced by the vertex and its descendants related to the tree . If the root is clear from the context, we may abbreviate . We refer to the root of a rooted tree by .

If the vertices of a graph can be separated into exactly two disjoint sets such that , then the graph is called bipartite. In many cases the disjoint sets are already given as part of the input. In this case we write , where .

For a graph a matching is a set of edges, such that no two edges share a vertex. For an edge , the vertex is the partner of and vice versa. The cardinality of a matching is its number of edges . A weighted graph is a graph endowed with a function . The weight of a matching in a weighted graph is . We call a matching of a weighted graph a maximum weight matching (MWM) if there is no other matching of with . A matching of is a MWM of cardinality (MWM) if there is no other MWM of with .

For convenience we define the maximum of an empty set as .

3 Gupta and Nishimura’s algorithm

In this section we formally define a Largest Common Subtree Embedding (LaCSE) and present a brief overview of Gupta and Nishimura’s algorithm to compute such an embedding. The following two definitions are based on [8]. [Topological Embedding] A rooted tree is topologically embeddable in a rooted tree if there is an injective function , such that

  1. If is a child of , then is a descendant of .

  2. For distinct children of , the paths from to and from to have exactly in common.

is root-to-root topologically embeddable in , if . [(Largest) Common Subtree Embedding; (L)CSE ] Let and be rooted trees and be topologically embeddable in both and . For such a let and be topological embeddings.

  • Then is a Common Subtree Embedding.

  • If there is no other tree topologically embeddable in both and with , then is a Largest Common Embeddable Subtree and is a Largest Common Subtree Embedding.

  • An CSE with is a root-to-root CSE.

  • A root-to-root CSE is largest, if it is of largest weight among all root-to-root common subtree embeddings.

Algorithm from Gupta and Nishimura.

Gupta and Nishimura [8] presented an algorithm to compute the size of a largest common embeddable subtree based on dynamic programming, which is similar to the computation of a largest common subtree, described in, e.g.,[12, 4]. Let and be rooted trees and be a table of size . For each pair of vertices the value stores the size of a LaCSE between the rooted subtrees and . Gupta and Nishimura proved, that an entry is determined by the maximum of the following three quantities.

  • , where is a MWM of the complete bipartite graph with edge weight for each pair

Here, represents the case, where the vertex is not mapped. To satisfy ii) from Def. 3, we may map at most one child . represents the case, where is not mapped and at most one child is allowed. represents the case . To maximize the number of mapped descendants we compute a maximum weight matching, where the children of and are the vertex sets and , respectively, of a bipartite graph. The edge weights are determined by the previously computed solutions, i.e., the LaCSEs between the children of and and their descendants, namely between and for each pair of children . The algorithm proceeds from the leaves to the roots. From the above recursive formula, we get if or is a leaf, which was separately defined in [8].

A maximum value in the table yields the size of an LaCSE. We obtain the size of a root-to-root LaCSE from of the root vertices . Note, with storing additional data, it is easy to obtain a (root-to-root) LaCSE . [Gupta, Nishimura, [8]] Computing an LaCSE between two rooted trees of order at most is possible in time .

4 Largest Weight Common Subtree Embeddings

First, we introduce weighted common subtree embeddings between labeled trees. Part of the input is an integral or real weight function on all pairs of the labels. Next, we consider a linear distance penalty for skipped vertices in the input trees. After formalizing the problem and presenting an algorithm, we prove new upper time bounds.

Vertex Labels.

In many application domains the vertices of the trees need to be distinguished. A common representation of a vertex labeled tree is , where with as a finite set of labels. Let assign a weight to each pair of labels. Instead of maximizing the number of mapped vertices, we want to maximize the sum of the weights of all vertices mapped by a common subtree embedding . For simplicity, we will omit and for the rest of this paper and define for any two vertices .

Edge Labels.

Although not as common as vertex labels, edge labels are useful to represent different bonds between atoms or relationship between individuals. In a common subtree embedding we do not map edges to edges but paths to paths. Since in an embedding inner vertices on mapped paths do not contribute to the weight, we do the same with edges. I.e., both paths need to have length 1 for their edge labels to be considered. Here again, we want to maximize the weight of these edges mapped to each other (additional to the weight of the mapped vertices).

Distance Penalties.

Depending on the application purpose it might be desirable that paths do not have arbitrary length. Here, we introduce a linear distance penalty for paths of length greater than 1. I.e., each inner vertex on a path corresponding to an edge of the common embeddable subtree lowers the weight by a given penalty . By assigning the value we effectively compute a maximum common subtree. The following definition formalizes an LaCSE under a weight function and a distance penalty .

[Largest Weight Common Subtree Embedding; LaWeCSE ] Let and be rooted vertex and/or edge labeled trees. Let be a common subtree embedding from to . Let assign a weight to each pair of labels. Let be a distance penalty. We refer to a path in the tree corresponding to a single edge in the common embeddable subtree as topological path. Let be the corresponding path in . Then

  • , if , or

  • , otherwise.

  • The weight is the sum of the weights of all vertices mapped by plus the weights of all topological paths .

  • If is of largest weight among all common subtree embeddings, then is a Largest Weight Common Subtree Embedding (LaWeCSE).

The definition of root-to-root LaWeCSE is analogue to Def. 3. A closer look at the definition of reveals that each inner vertex (the skipped vertices) on a topological path or its mapped path subtracts from the embedding’s weight. Fig. 1 illustrates two weighted common subtree embeddings.

Labeled MCS (green, dashed) and LaCSE (black, dotted) between and , .

The black embedding has weight 1.7, since the vertex is skipped and therefore the penalty is applied; the weight between the edges is not added. The green embedding has weight 5, 2 from the vertices, 3 from the topological paths of length 1.
Figure 1: subfig:mcsles) Although ’intuitively’ is more similar to than to , both MCSs have size 3. However, the LaCSE between and has 6 mapped vertices. subfig:les_edges) Two weighted embeddings; one with a skipped vertex, the other where the edge labels contribute to the weight.

The dynamic programming approach.

To compute a LaWeCSE, we need to store some additional data during the computation. In Gupta and Nishimura’s algorithm there is a table of size to store the weight of LaCSEs between subtrees of the input trees. In our algorithm we need a table of size . An entry stores the weight of a LaWeCSE between the rooted subtrees and of type . Type represents a root-to-root embedding between and ; an embedding, where or is skipped. Skipped in the sense, that at least on of will be an inner vertex when mapping some ancestor nodes during the dynamic programming. For type we subtract the penalty from the weight for the skipped vertices before storing it in our table. We obtain the weight of a LaWeCSE and a root-to-root LaWeCSE, respectively, from the maximum value of type and from , respectively. The following lemma specifies the recursive computation of an entry . Let and . For let and . Then

Let be a bipartite graph with edge weights for each pair . Then

  • , where is a MWM on .


The type represents the case of an embedding between and which is not root-to-root. From the definition of the vertex is skipped and from the definition of the vertex is skipped, so it is indeed not root-to-root. Since either or was skipped, we subtract the penalty . This ensures we have taken inner vertices of later steps of the dynamic programming into account.

The type implies that is mapped to . Each edge in represents the weight of a LaWeCSE from one child of to one child of . A maximum matching yields the best combination which satisfies Def. 3. If a child of is mapped to a child of , the paths and have length one. Then from Def. 4 we have to add the weight . Otherwise at least one path has length greater than one and we have to subtract the distance penalty for each inner vertex. We already did that while computing . ∎

Time and space complexity.

We next analyze upper time and space bounds. Thereby we distinguish between real- and integer-valued weight functions . If we use dynamic programming starting from the leaves to the roots, we need to compute each value only once.

Let and be rooted vertex and/or edge labeled trees. Let be a weight function, , and be a distance penalty.

  • A LaWeCSE between and can be computed in time and space .

  • If the weights are integral and bounded by a constant , a LaWeCSE can be computed in time .

We first need to provide two results regarding maximum weighted matchings.

[[14]] Let be a weighted bipartite graph containing edges and vertex sets of sizes and . W.l.o.g. . We can compute a MWM on in time and space . If is complete bipartite, we may simplify the time bound to .

If the weights are integral and bounded by a constant, the following result for a minimum weight matching was shown in [7]. We solve MWM by multiplying the weights by -1. Let as in Lemma 4 and the weights be integral and bounded by a constant . We can compute a MWM on in time . If is complete bipartite, we may simplify the time bound to . Unfortunately, there is no space bound given in [7]

. However, since their algorithm is based on flows, the space bound is probably

. If we assume this to be correct, the space bound in Theorem 4 applies to integral weights too.

Proof of Theorem 4.

We observe that the entries of type in the table dominate the computation time. We assume the bipartite graphs on which we compute the MWMs to be complete. We further observe that each edge representing the weight of the LaWeCSE between the subtrees and is contained in exactly one of the matching graphs.

Let us assume real weights first. requires space. We may also compute each MWM within the same space bound. This proves the total space bound. From Lemma 4 the time to compute all the MWMs is bounded by

Let us assume to be integral and bounded by a constant next. This implies a weight of each single matching edge of at most , since no more than edges and vertices in total may contribute to the weight. Negative weight edges never contribute to a MWM and may safely be omitted. From Lemma 4 the time bound is

5 Largest Weight Common Subtree Embeddings for Unrooted Trees

In this section we consider a LaWeCSE between unrooted trees. I.e., we want to find two root vertices , and a common subtree embedding between and such that there is no embedding between and , , , with . We abbreviate this as LaWeCSE. In Sect. 5.1 we present a basic algorithm and a first improvement by fixing the root of . In Sect. 5.2 we speed up the computation by exploiting similarities between the different chosen roots of . In each section we prove the correctness and upper time bounds of our algorithms.

Figure 2: The weight of a LaWeCSE between and is 2.8 (black, dotted), since we have one skipped vertex for penalty . The weight of a LaWeCSE between and is 3.6 (green, dashed) for two skipped vertices. The latter one is also a LaWeCSE.

5.1 Basic algorithm and fixing one root

The basic idea is to compute for each pair of vertices a (rooted) LaWeCSE from to and output a maximum solution. This is obviously correct and the time bound is .

In our previous work [4] we showed how to compute a maximum common subtree between unrooted trees by arbitrarily choosing one root vertex of and then computing MCSs between and for all . We may adapt this strategy to find a LaWeCSE between the input trees. However, Fig. 2 shows that this strategy sometimes fails. A LaWeCSE between and rooted at any vertex results in a weight of at most 2.8. However, rooting at results in a LaWeCSE of weight 3.6.

In a maximum common subtree between trees and , let be an arbitrarily chosen root of . If any two children of their parent node are mapped to vertices of , then is also mapped. This statement is independent from the chosen root node, since a common subtree is connected. If we want to compute a (rooted) LaWeCSE, the statement is also true (for the given root). This follows from Def. 3 ii). However, if we choose as root in a LaWeCSE, we may skip and map , forming the topological path . Whatever we do, if we skip vertex as an inner vertex of a topological path, this is the only path containing ; otherwise we violate Def. 3. We record this as a lemma. Let and be unrooted trees. Let be a LaWeCSE from to and be an inner vertex of a topological path with its neighbors . Then maps vertices from exactly two of the rooted subtrees to .

To compute a LaWeCSE , additionally to the strategy from [4], we need to cover the case, that there is no single vertex mapped by , such that all vertices mapped by are contained in , cf. Fig. 2 with as chosen root. In this case let be the unique inner vertex of a topological path , such that all vertices mapped by are contained in . An example is the yellow vertex in Fig. 2. Further, let be the topological path containing and with and .

Then there is a LaWeCSE between the rooted subtrees and containing and all its descendants mapped by . There is another LaWeCSE between the rooted subtrees and containing and all its descendants mapped by .

For any vertex let refer to the table corresponding to and . Then is the weight of the LaWeCSE minus the penalty for the inner vertices ; the penalty is 0, if . is the weight of the LaWeCSE minus the penalty for the inner vertices and . I.e., the penalty for each inner vertex on the paths excluding is included in . Therefore . Before summarizing this strategy in the following Lemma 5.1, we exemplify it on Fig. 2.

The LaWeCSE is depicted by green dashed lines with mapping and . The yellow inner vertex fulfills the condition that contains all vertices mapped by . We have paths and . Then for the mapping . Further for the mapping and the skipped vertex . Note, there is no inner vertex in . We obtain .

Let and be trees. Let be arbitrarily chosen. Let be the weight of a LaWeCSE from to and . Then the weight of a LaWeCSE is the maximum of the following two quantities.

Let the preconditions be as in Lemma 5.1. Then can be computed in time and in time ; both can be computed in space .


For any vertices consider the rooted subtree . Let be the parent vertex of with . Then for each vertex , i.e., is a vertex of , which is not contained in the rooted subtree . Therefore we can identify each table entry by . In other words, all the table entries needed to compute are determined first by a node , and second by either an edge or the root vertex . Therefore, the space needed to store all the table entries and thus compute is .

From Theorem 4 for each we can compute in time . Thus, the time for is bounded by . For any edge we observe that the only rooted subtrees from to consider are and . Let and , be the weight of a LaWeCSE from to and , respectively. Let be a bipartite graph with vertices and edges between these vertices with weights defined by and , respectively. Let be a MWM on . Then . This follows from the construction of . Note, a MWM contains exactly 2 edges. Let such that for any . I.e., the vertices are ordered, such that and have weight at least for all . We remove all edges incident to except and . Analog we remove all but the two edges of greatest weight incident to <