 # An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings

A widely used method for determining the similarity of two labeled trees is to compute a maximum agreement subtree of the two trees. Previous work on this similarity measure is only concerned with the comparison of labeled trees of two special kinds, namely, uniformly labeled trees (i.e., trees with all their nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled trees with distinct symbols for distinct leaves). This paper presents an algorithm for comparing trees that are labeled in an arbitrary manner. In addition to this generality, this algorithm is faster than the previous algorithms. Another contribution of this paper is on maximum weight bipartite matchings. We show how to speed up the best known matching algorithms when the input graphs are node-unbalanced or weight-unbalanced. Based on these enhancements, we obtain an efficient algorithm for a new matching problem called the hierarchical bipartite matching problem, which is at the core of our maximum agreement subtree algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A labeled tree is a rooted tree with an arbitrary subset of nodes labeled with symbols. In recent years, many algorithms for comparing such trees have been developed for diverse application areas including biology [8, 19, 23], chemistry , linguistics [9, 21]

, and structured text databases [16, 17, 20].

A widely used measure of the similarity of two labeled trees is the notion of a maximum agreement subtree defined as follows. A labeled tree is a label-preserving homeomorphic subtree of another labeled tree if there exists a one-to-one mapping from the nodes of to those of such that for any nodes of , (1) and have the same label; and (2) is the least common ancestor of and if and only if is the least common ancestor of and . Let and be two labeled trees. An agreement subtree of and is a labeled tree which is also a label-preserving homeomorphic subtree of the two trees. A maximum agreement subtree is one which maximizes the number of labeled nodes. Let mast denote the number of labeled nodes in a maximum agreement subtree of and .

In the literature, many algorithms for computing a maximum agreement subtree have been developed. These algorithms focus on the special cases where and are either (1) uniformly labeled trees, i.e., trees with all their nodes unlabeled, or equivalently, labeled with the same symbol or (2) evolutionary trees , i.e., leaf-labeled trees with distinct symbols for distinct leaves.

We denote as the number of nodes in the labeled trees and , and as the maximum degree of and . For uniformly labeled trees and , Chung  gave an algorithm to determine whether is a label-preserving homeomorphic subtree of using time. Gupta and Nishimura  gave an algorithm which actually computes a maximum agreement subtree of and in time. For evolutionary trees, Steel and Warnow  gave the first polynomial-time algorithm for computing a maximum agreement subtree. Farach and Thorup  improved the time complexity from to . Faster algorithms for the case were also discovered. The algorithm of Farach, Przytycka and Thorup  runs in time, and that of Kao  takes time. Cole et al.  gave an -time algorithm for the case where and are binary trees. Przytycka  attempted to generalize the algorithm of Cole et al. so that the degree-2 restriction could be removed with the running time being .

For unrestricted labeled trees (i.e., trees where labels are not restricted to leaves and may not be distinct), little work has been reported, but they have applications in several contexts [1, 18]. For example, labeled trees are used to represent sentences in a structural text database [16, 17, 20] and querying such a database involves comparison of trees; an XML document can also be represented by a labeled tree . Instead of solving special cases, this paper gives an algorithm to compute where and are unrestricted labeled trees. As detailed below, our algorithm not only is more general but also uniformly improves or matches the previously best algorithms for subtree homeomorphism and evolutionary tree comparison.

Let (or simply when the context is clear) where if nodes and are labeled with the same symbol, and otherwise. Our algorithm computes in time. Thus, if and are uniformly labeled trees, then and the time complexity of our algorithm is , which is faster than the Gupta-Nishimura algorithm  for any . If and are evolutionary trees, then and the time complexity of our algorithm is , which is better than the bound claimed by Przytycka . In particular, our algorithm can attain the bound for binary trees . Also for general evolutionary trees, our algorithm runs in time since for any degree . This is faster than the time of the Farach-Thorup algorithm .

The efficiency achieved by our mast algorithm is based on improved algorithms for computing maximum weight matchings of bipartite graphs that satisfy some structural properties. Let be a bipartite graph with positive integer weights on its edges. Denote by , , , and the number of nodes, the number of edges, the maximum edge weight, and the total edge weight of , respectively. The best known algorithm for computing maximum weight bipartite matchings was given by Gabow and Tarjan , which takes time. For some applications where the total edge weight is small (say, ), Kao et al.  gave a slightly faster algorithm that runs in time. Intuitively, a bipartite graph is node-unbalanced if there are much fewer nodes on one side than the other. It is weight-unbalanced if its total weight is dominated by the edges incident to a few nodes; we call these nodes the dominating nodes. In this paper, we show how to enhance these two matching algorithms when the input graphs are either node-unbalanced or weight-unbalanced.

The node-unbalanced property has many practical applications (see, e.g., ) and has been exploited to improve various graph algorithms. For example, Ahuja et al.  adapted several bipartite network flow algorithms such that the running times depend on the number of nodes in the smaller side of the input bipartite graph instead of the total number of nodes. Tokuyama and Nakano used this property to reduce the time complexity of the minimum cost assignment problem  and the Hitchcock transportation problem . This paper presents similar improvements for maximum weight matching. Specifically, we show that the running time of the matching algorithms of Gabow and Tarjan  and Kao et al.  can be improved to and , respectively, where is the number of nodes in the smaller side of the input bipartite graph.

The weight-unbalanced property is exploited in another way. Given a weight-unbalanced bipartite graph , let be the subgraph of with its dominating nodes removed. Note that has a total weight much smaller than does. Based on the -time matching algorithm of Kao et al. , finding the maximum weight matching of is much faster than finding one of . To take advantage of this fact, we design an efficient algorithm that finds a maximum weight matching of from that of . This algorithm is substantially faster than applying directly the -time matching algorithm on .

These results for unbalanced graphs provide a basis for solving a new matching problem called the hierarchical bipartite matching problem. This matching problem is at the core of our mast algorithm and is defined as follows. Let be a rooted tree. Denote as the root of . Let denote the set of children of node . Every node of is associated with a positive integer and a weighted bipartite graph satisfying the following properties:

• .

• where . Each edge of has a positive integer weight, and there is no isolated node. For any node , the total weight of all the edges incident to is at most . Thus, the total weight of the edges in is at most .

See Figure 1 for an example. For any weighted bipartite graph , let denote a maximum weight matching of . The hierarchical matching problem is to compute for all internal nodes of . Let and . The problem can be solved by applying directly our results for node-unbalanced graphs; for example, it can be solved in time using the enhanced Gabow-Tarjan algorithm. However, this time complexity is not yet satisfactory. When comparing labeled trees, we often encounter instances of the hierarchical bipartite matching problem with being very large; in particular, is asymptotically much greater than . We further improve the running time to by making additional use of our technique for weight-unbalanced graphs and exploiting trade-offs between the size of the bipartite graphs involved and their total edge weight. Figure 1: T is an instance of the hierarchical bipartite matching problem. Gu is the bipartite graph associated with the node u.

The rest of the paper is organized as follows. Section 2 details our techniques of speeding up the existing algorithms for unbalanced graphs. Section 3 gives an efficient algorithm for solving the hierarchical matching problem. Finally, Section 4 describes our algorithm for computing maximum agreement subtrees.

## 2 Maximum weight matching of unbalanced graphs

Throughout this section, let be a weighted bipartite graph with no isolated nodes. Let , , , be the largest edge weight, and be the total edge weight.

Suppose every edge of has a positive integer weight. Gabow and Tarjan  and Kao et al.  gave an -time algorithm and an -time one to compute , respectively.

### 2.1 Matchings of node-unbalanced graphs

The following theorem speeds up the computation of if is node-unbalanced.

###### Theorem 1.
1. can be computed in time.

2. can be computed in time.

3. can be computed in time.

###### Proof.

Without loss of generality, we assume . The statements are proved as follows.

Statement 1. For any node in , let be the number of edges incident to . Suppose where , , and . We partition into and . Note that except , every set has nodes.

For any , denote as the subgraph of induced by all the edges incident to . Suppose that is a maximum weight matching of . Let . Note that a maximum weight matching of is also one of . Therefore, we can compute using the following algorithm:

• Step 1. Compute a maximum weight matching of .

• Step 2. For to ,

• let = ;

• compute a maximum weight matching of .

• Step 3. Return .

The running time is analyzed below. Let be the total number of edges in . For , , and . Using the matching algorithm by Gabow and Tarjan , we can compute in time. Note that and . Hence, the whole algorithm uses time.

Statement 2. Since we suppose , any matching of contains at most edges. Thus, for every , we can discard the edges incident to that are not among the heaviest ones; the remaining edges must still contain a maximum weight matching of . Note that we can find these edges in time, and from Statement 1, we can compute from them in time. The total time taken is .

Statement 3. The algorithm in the proof of Statement 1 can be adapted to find in time by using the -time matching algorithm of Kao et al.  to compute each . For any , we redefine to be the total weight of edges incident to . Then we can use the same analysis to show that the adapted algorithm runs in = time. ∎

### 2.2 Matchings of weight-unbalanced graphs

We show how to speed up the matching algorithm of Kao et al.  when the input graph is weight-unbalanced. The key technique is stated in Lemma 2 below. The following example illustrates how this lemma can help. Suppose that has dominating nodes. Let be the subgraph of with the dominating nodes removed. Let be the total edge weight of . Since is weight-unbalanced, we further assume . To compute , we can first use Theorem 1(3) to compute in time and then use Lemma 2 to compute from in time. The total running time is , which is smaller than the running time of using the algorithm of Kao et al.  to find directly.

###### Lemma 2.

Let be a subset of nodes of . Let be the subgraph of constructed by removing the nodes in . Denote by the set of edges in . Given , we can compute in time.

###### Proof.

First, we show that using time, we can find a set of only edges such that still contains a maximum weight matching of . In the proof of Theorem 1(2), it has already been shown that we can find in time a set of edges that contains . Thus, it suffices to find in time another set of edges that contains ; is just the smaller of these two sets. Let be the subset of nodes of that are endpoints of . For any , we select, among the edges incident to , a subset of edges , which is the union of the following two sets:

• ;

• is among the heaviest edges with .

Observe that must contain a maximum weight matching of , and these edges can be found in time.

By discarding all unnecessary edges (i.e., edges neither in nor in ), we can assume that has only edges, while still containing and . This preprocessing requires an extra time for finding .

Below, we describe a procedure which, given any bipartite graph and any node of , finds from in time, where is the number edges of . Then, starting from , we can apply this procedure repeatedly times to find from . Since is assumed to have only edges, this process takes time. This lemma follows.

Let and be a maximum weight matching of and , respectively; denote by the set of augmenting paths and cycles formed in , and let be the augmenting path in starting from . Note that the augmenting paths and cycles in cannot improve the matching ; otherwise, is not a maximum weight matching of . Thus, we can transform to using . Note that is indeed a maximum augmenting path starting from , which can be found in time . ∎

## 3 Hierarchical bipartite matching

Throughout this section, let be a rooted tree as defined in the definition of the hierarchical bipartite matching problem in §1. The root of is denoted by . For each node of , and denote the weight and the bipartite graph associated with , respectively. Furthermore, let , and .

In this section, we describe an algorithm for computing for all in time. Our algorithm is based on two crucial observations. One is that for any value , there are at most graphs with its second maximum edge weight greater than . The other is that most of these graphs have their total weight dominated by edges incident to a few nodes. For those graphs with a large second maximum edge weight, we compute their maximum weight matchings using a less weight-sensitive algorithm. As there are not many of them, the computation is efficient. For the other graphs, their weights are dominated by the edges incident to a few nodes. Thus, using Lemma 2 and a weight-efficient matching algorithm, we can compute the maximum weight matchings for these graphs efficiently. Details are as follows.

Consider any subset of nodes of . Let . We say that has a critical degree if for every , has at most children with weight at least . For any internal node , let , i.e., the value of the second largest over all the children of . Lemma 3 below shows the importance of and critical degrees. Lemma 3(1) shows that there are not many nodes with large ; for those nodes with small , they should not have a large critical degree, and Lemma 3(2) states that the maximum weight matchings associated with these nodes can be computed efficiently.

###### Lemma 3.
1. Let be any positive number. Let be the set of nodes of with . Then .

2. Let be any set of nodes of . If has critical degree , then we can compute for all in time.

###### Proof.

The statement are proved as follows.

Statement 1. Let be the set of nodes in such that (1) and (2) either is a leaf or for all children of . Since the subtrees rooted at the nodes of are disjoint, . Thus, . Let be the tree in induced by , i.e., contains exactly the nodes of and the least common ancestor of every two nodes of . Note that has at most leaves and at most internal nodes. On the other hand, every node of is an internal node of ; thus, .

Statement 2. For every node , let be the set of ’s children that have a weight at least each. Let be the set of the rest of ’s children. Note that because has critical degree . Since the weight of is at most and , by Theorem 1 we can compute in time

 O(√b∑x∈L(u)w(x)+|Eu|). (1)

Since has at most edges and , by Lemma 2, we can compute from in time

 O(|Eu|+(h2∑x∈L(u)w(x)+h3)logb)=O(h3logb∑x∈L(u)w(x)+|Eu|). (2)

From Equations (1) and (2), we can compute for all in time

 O(∑u∈B((√b+h3logb)∑x∈L(u)w(x)+|Eu|)).

Since the subtrees rooted at some node in are disjoint, . This statement follows. ∎

We are now ready to compute for all nodes of . We divide all the nodes in into two sets: and .

Every node has ; by Lemma 3(1), . Furthermore, by Theorem 1(2), the time for computing for all is = = . This time complexity is still far from our goal as may be much larger than . To improve the time complexity, we first note that using the technique for proving Lemma 2, we can compute in time depending only on

. Then, with a better estimation of

, we can reduce the time complexity to . Details are given in Lemma 4(1).

For , we can handle the nodes with easily. For nodes with , we apply Lemma 3(2) to compute . The basic idea is to partition the nodes into a constant number of sets according to such that every set has critical degree . This can ensure that the total time to compute all the is . Details are given in Lemma 4(2).

###### Lemma 4.
1. We can compute for all in time.

2. We can compute for all in time.

###### Proof.

The two statements are proved as follows.

Statement 1. Observe that for any , has at most edges relevant to the computation of , and they can be found in time. Let be this set of edges. Below, we assume that, for every , has only edges in . Otherwise, it costs extra time to find all and the assumption holds.

For every , let . Obviously, the nonempty sets form a partition of . Below, we show that for any nonempty , we can compute for all in time. Thus, the time for computing for all is , and Statement 1 follows.

We now give the details of computing for all . Let be the child of where is the largest over all children of . Since , every edge of has weight at most . By Theorem 1(2) and Lemma 2, and the fact , we can find in time. By Lemma 3(1), . Thus, we can compute for all in time

 O(∑u∈Φk√bb2log(b2kb3)+|E′u|logb) = O(w(r)b2.52kb3(k+logb)+∑u∈Φk|E′u|logb) = O(w(r)k/2k+∑u∈Φk|E′u|logb).

Statement 2. We partition as follows. Let be the set of nodes in with weight greater than . For any , let Obviously, .

Since and for all nodes in , has critical degree one. By Lemma 3(2), we can compute for all nodes in using time.

Each is handled as follows. For every node , has at most children with weight at least ; otherwise and . Thus, has critical degree . By Lemma 3(2), we can compute for all in time. In summary, we can compute for all in time. ∎

###### Theorem 5.

We can compute for all nodes in time.

###### Proof.

It follows from Lemma 4 and the fact that and . ∎

## 4 Computing maximum agreement subtrees

By generalizing the work of Cole et al.  on binary evolutionary trees, we can easily derive an algorithm to compute a maximum agreement subtree of two labeled trees. There is, however, a bottleneck of computing the maximum weight matchings of a large number of bipartite graphs with nonconstant degrees. By using our result on the hierarchical bipartite matchings, we can eliminate this bottleneck and obtain the fastest known mast algorithm. Section 4.1 introduces basics of labeled trees. Section 4.2 uses our results on the hierarchical bipartite matchings to remove the bottleneck in our mast algorithm. Section 4.3 details our mast algorithm and analyzes its time complexity. Section 4.4 discusses the generalization of the work of Cole et al. .

Throughout this section, and denote two labeled tress with nodes and of degree . Let where if nodes and are labeled with the same symbol, and otherwise. Also, let denote .

### 4.1 Basics

For a rooted tree and any node of , let denote the subtree of that is rooted at . For any set of symbols, the restricted subtree of with respect to , denoted by , is the subtree of (1) whose nodes are the nodes with labels from and the least common ancestors of any two nodes with labels from and (2) whose edges preserve the ancestor-descendant relationship of . Figure 2 gives an example. Note that may contain nodes with labels outside . For any labeled tree , let denote the restricted subtree of with respect to the set of symbols used in .

A centroid path decomposition  of a rooted tree is a partition of its nodes into disjoint paths as follows. For each internal node in , let denote the set of children of . Among the children of , one is chosen as the heavy child, denoted by , if the subtree of rooted at contains the largest number of nodes; the other children of are side children. We call the edge from to its heavy child a heavy edge. A centroid path is a maximal path formed by heavy edges; the root centroid path is the centroid path that contains the root of . See Figure 3 for an example.

Let denote the set of the centroid paths of . Note that can be constructed in time. For every , the root of , denoted , refers to the node on that is the closest to the root of , and denotes the set of the side children of the nodes on . For any node on , a subtree rooted at some side child of is called a side tree of , as well as a side tree of . Let be the set of side trees of . Note that for every , .

The following lemma states two useful properties of the centroid path decomposition.

###### Lemma 6.

Let and be two labeled trees.

1. .

2. .

###### Proof.

The two statements are proved as follows.

Statement 1. A centroid path is attached to another centroid path if the root of is the child of a node on . We define the level of a centroid path as follows. The root centroid path has level zero. A centroid path has level if it is attached to some centroid path with level . Note that any subtree attached to a centroid path with level has size at most . Thus, there are at most different levels. Moreover, subtrees attached to centroid paths with the same level are all disjoint.

For any , denote by the set of all centroid paths in with level . Then .

Statement 2. We divide the centroid paths into 2 groups. We first consider the centroid paths on level where . For any such ,

 ∑P∈Di√min{d,|Tr(P)1|}ΔTr(P)1,T2≤∑P∈Di√dΔTr(P)1,T2≤√dΔT1,T2.

Thus, .

Next, we consider the centroid paths on level where . Note that for a path on level , . Thus, is at most

 ∑i≥0∑P∈Dlog2nd+i√d/2i+1ΔTr(P)1,T2≤∑i≥0√d/2i+1ΔT1,T2=O(√dΔT1,T2).

The following notion captures which pairs of nodes of two labeled trees and are important. Consider any centroid paths and . For any node or , let be the set of symbols labeling and the nodes in the side trees of . Let be the set of node pairs with . Let .

### 4.2 Matchings

As explained later in §4.3, we can easily generalize the dynamic programming approach in  to compute for any two labeled trees and , but there is a bottleneck of computing the maximum weight matchings of a large number of bipartite graphs with nonconstant degrees. This section uses our results on hierarchical bipartite matchings to remove this bottleneck.

First of all, we identify the bipartite graphs for which maximum weight matchings are required. For any nodes and , define as the weighted bipartite graph between and where edge has weight . Furthermore, define as the graph constructed from by removing all the zero-weight edges and all the edges adjacent to the heavy child of or . Note that the total edge weight of can be significantly smaller than that of . Yet by Lemma