Top dags were introduced by Bille et al.  as a formalism for the compression of unranked node-labelled trees. Roughly speaking, the top dag for a tree is the dag-expression of an expression that evaluates to , where the expression builds from edges using two merge operations (horizontal and vertical merge). In , a linear time algorithm is presented that constructs from a tree of size with node labels from the set a top dag of size , where (note that this definition of avoids the base 1 in the logarithm). Later, in  this bound was improved to (for the same algorithm as in ). It is open whether this bound can be improved to for the construction from . A simple counting argument shows that is the information-theoretic lower bound for the size of the top dag. We present a new linear-time top dag construction that achieves this bound. In addition, our construction has two properties that are also true for the original construction of Bille et al. : (i) the size of the constructed top dag is bounded by , where is the minimal dag of and (ii) the height of the top dag is bounded by . Concerning (i) it was shown in  that the -factor is unavoidable. The logarithmic bound on the height of the top dag in (ii) is important in order to get the logarithmic time bounds for the querying operations (e.g., computing the label, parent node, first child, right sibling, depth, height, nearest common ancestor, etc. of nodes given by their preorder numbers) in .
Our construction is based on a modification of the algorithm BU-Shrink (bottom-up shrink) from , which constructs in linear time a TSLP of size for a given binary tree. In fact, we construct the top dag in two phases: in the first step we apply the modification of BU-Shrink, whereas in a second phase we apply the top dag construction of Bille et al.
Let be a finite alphabet. By we denote the set of ordered labelled trees with labels from . Here, “ordered” means that the children of every node are linearly ordered. Also note that trees are unranked in the sense that the node label does not determine the number of children of a node, which is also called the degree of the node. For a tree we denote the label of its root by , the set of its nodes by and the set of its edges by . By we denote the label of the node in . For we denote with the subtree of that is rooted in .
A cluster of rank is just a tree . A cluster of rank consists of a tree together with a distinguished leaf node that we call the bottom boundary node of . In both cases, the root is called the top boundary node. Let be the set of all clusters of rank and let . With we denote the rank of the cluster . An atomic cluster consists of a single edge, i.e., it is a tree with two nodes.
We define two partial binary merge operations :
(the vertical merge of and ) is only defined if and . We obtain by taking the disjoint union of and and then merging with (note that this is possible since the labels coincide). The rank of is and if , then the bottom boundary node of is .
(the horizontal merge of and ) is only defined if and . We obtain by taking the disjoint union of and and then merging with (note that this is possible since the labels coincide). The rank of is . In case (resp., ), the bottom boundary node of is (resp., ).
The (minimal) directed acyclic graph (dag) for a tree is obtained by keeping for every subtree of only one occurrence of that subtree and replacing every edge that goes to a node in which a copy of is rooted by an edge to the unique chosen occurrence of . We denote this dag as . Note that the number of nodes in is the number of different subtrees that occur in .
We can now define top trees and top dags. A top tree is a binary node-labelled ordered tree, where every internal node is labelled with one of the two operations and every leaf is labelled with an atomic cluster plus a bit for the rank of the cluster. The latter information can be represented by a triple with and . A top tree can be evaluated to a tree by recursively applying the merge operations at its inner nodes (this might lead to an undefined value since the merge operations are only partially defined). We say that is a top tree for and is a top dag for .
Let be a tree. A subcluster of of rank one is an induced subgraph of that is obtained as follows: Take a node with the ordered sequence of children and let . Let be a node that belongs to one of the subtrees . Then the tree is induced by the nodes in . The node (resp., ) is the top (resp., boundary) node of the cluster. A subcluster of of rank zero is obtained in the same way, except that we take the tree induced by the nodes in . Its top boundary node is . Note that every edge of is a subcluster of . We identify a subcluster of with the set of edges of belonging to the subcluster. If is a top tree for then it follows easily by induction that every subtree of evaluates to an isomorphic copy of a subcluster of .
3. Optimal worst-case compression
We can now state and prove the main result of this paper:
Let . There is a linear time algorithm that computes from a given tree with edges a top dag of height , whose size is bounded by and .
We first prove the theorem without the bound on the size of the constructed top dag. In a second step, we explain how to modify the algorithm in order to get the bound.
Take a tree with edges and let . We build from a sequence of trees , where every edge ( is the parent node of ) is labelled with a subcluster of . If is a leaf of , then is a subcluster of rank with top boundary node , otherwise is a subcluster of rank with top boundary node and bottom boundary node . The number of edges in the subcluster is also called the weight of the edge .
Our algorithm does not have to store the subclusters explicitly but only their weights. Moreover, the algorithm builds for every edge a top tree . The invariant of the algorithm is that evaluates to (an isomorphic copy of) the subcluster . The top trees are stored as pointer structures, but below we write them for better readability as expressions using the operators and .
The initial tree is the tree , where for every edge . This is a subcluster of rank if is a leaf, and of rank otherwise. Hence, we set and , where is the rank of subcluster .
Let us now fix a number that will be made precise later. Our algorithm proceeds as follows: Let be the current tree. We proceed by a case distinction. Ties between the following three cases can be broken in an arbitrary way. The updating for the subclusters is only shown to give a better intuition for the algorithm; it is not part of the algorithm.
Case 1. There exist edges of weight at most such that is the unique child of . We obtain from by (i) removing the node , and (ii) replacing the edges by the edge . Moreover, we set
For all edges we set , and .
Case 2. There exist edges of weight at most such that is a leaf and the left sibling of . Then is obtained from by removing the edge . Moreover, we set
For all edges we set , and .
Case 3. There exist edges of weight at most such that is a leaf and the right sibling of . Then is obtained from by removing the edge . Moreover, we set
For all edges we set , and .
If none of the above three cases holds, then the algorithm stops. Let be the final tree that we computed. Note that every edge of has weight at most . We now bound the number of edges of :
Claim: The number of edges of is at most
Let be the number of edges of . Thus, has many nodes. If we are done, since . So, assume that . Let be the set of all nodes of degree at most one except for the root node. We must have . For every node , let be its parent node. We now assign to certain edges of (possibly several) markings by doing the following for every : If the weight of the edge is larger than then we assign to a marking. Now assume that the weight of is at most . If has degree one and is the unique child of , then the weight of must be larger than (otherwise, we would merge the edges and ), and we assign a marking to . Let us now assume that is a leaf. Since has at least two edges, one of the following three edges must exist:
, where is the left sibling of ,
, where is the right sibling of ,
where has degree one.
Moreover, one of these edges must have weight more than . We choose such an edge and assign a marking to it. The following then holds:
Markings are only assigned to edges of weight more than .
Every edge of can get at most 4 markings.
In total, contains many markings.
Since the sum of all edge weights of is , we obtain
Thus, we have .
We now build a top tree for as follows: Construct a top tree for of height using the algorithm from . Consider a leaf of . It corresponds to an edge . In the process of folding the cluster into the edge we have constructed the top tree that evaluates to the cluster . Therefore, we obtain a top tree for by replacing every leaf of by the top tree . To bound the minimal dag of we have to count the number of different subtrees of . This number can be upper bounded by the number of nodes in (which is in ) plus the number of different top trees of size at most . The latter number can be bounded as follows: A top tree for a tree from is a binary tree with many node labels ( many different atomic clusters together with the bit for their rank and two labels for the two merge operations). The number of binary trees with nodes is bounded by . Hence, we can bound the number of different top trees of size at most by with . Note that , where . Take . We obtain the following upper bound on the number of non-isomorphic subtrees of :
Moreover, the height of is in since and all have height .
It remains to argue that our algorithm can be implemented in linear time. The arguments are more or less the same as for the analysis of BU-Shrink in : The algorithm maintains for every node of its degree, and for every edge its weight . Additionally, the algorithm maintains a queue that contains pointers to all edges of having weight at most and such that has degree one. Then every merging step can be done in constant time, and there are at most merging steps. Finally, the minimal dag of can be computed in linear time by .
We now explain the modification of the above algorithm such that the constructed top dag has size . The main idea is that we perform the first phase of the above algorithm (where the tree is constructed) on instead of itself. Thus, the algorithm starts with the construction of from , which is possible in linear time . We now build from a sequence of dags , where analogously to the above construction every edge of is labelled with a weight and a top tree . Since we are working on the dag, we cannot assign a unique subcluster of to the dag-edge . In fact, every edge of represents a set of isomorphic subclusters that are shared in the dag. The top tree evaluates to an isomorphic copy of these subclusters.
The initial dag is the dag where every edge is labelled with the top tree that only consists of the leaf ( assigns to every node of the dag its label from ) where if is a leaf of the dag and otherwise. We take the same threshold value as before. Let be the current dag. Also the case distinction is the same as before:
Case 1. There exist edges of weight at most such that is the unique child of . We obtain from by replacing the edge by the edge . If the node has no incoming edge after this modification, we can remove and the edge (although this is not necessary for the further arguments). The weights and the top trees are updated in exactly the same way as in the previous case 1.
Case 2. There exist edges of weight at most such that is a leaf of (i.e., has no outgoing edge) and the left sibling of . Then is obtained from by removing the edge . If has no more incoming edges after this modification, then we can also remove . The weights and the top trees are updated in exactly the same way as in the previous case 2.
Case 3. There exist edges of weight at most such that is a leaf and the right sibling of . Then is obtained from by removing the edge . If has no more incoming edges after this modification, then we can remove . The weights and the top trees are updated in exactly the same way as in the previous case 3.
If none of the above three cases holds, then the algorithm stops. Let be the final dag that we computed. We can unfold to a tree . This tree is one of the potential outcomes of the above tree version of the algorithm. The rest of the construction is the same as before, i.e., we apply the algorithm of Bille et al. in order to get a top tree for , which is then combined with the top trees in order to get a top dag for . Let be the minimal dag of . The size bound and the height bound for follow from our previous arguments. It remains to show that the size of is bounded by . Note that the size of the dag is bounded by (the number of nodes and edges does not increase when constructing from ). Moreover, every top tree has size . Therefore, the total size of all top trees is bounded by . Moreover, the construction of Bille et al. ensures that the size of the top dag for is bounded by (since is a dag for ). This implies that the size of the final top dag is bounded by . ∎
Let . In the following example we show three nodes of a dag
(on the left) and a possible run of the merging algorithm up to (on the right):
In the first step we merge the two atomic clusters using . We can do this by removing the left edge or the right edge , since is a leaf. Next, we merge the cluster using . This is done by removing the edge and replacing it with the edge . Since now has no incoming edges, is removed. The weight of the edge is since its cluster is a tree with three edges. This means that, in order to be able to perform these two merges, the starting tree must have size at least , which is , since and .
-  P. Bille, I. L. Gørtz, G. M. Landau, and O. Weimann. Tree compression with top trees. Inf. Comput., 243:166–177, 2015.
-  P. Bille, F. Fernstrøm, and I. L. Gørtz. Tight bounds for top tree compression. In Proc. SPIRE 2017, LNCS 10508, 97–102. Springer, 2017.
-  P. J. Downey, R. Sethi, and R. E. Tarjan. Variations on the common subexpression problem. J. ACM, 27(4):758–771, 1980.
-  M. Ganardi, D. Hucke, A. Jez, M. Lohrey, and E. Noeth. Constructing small tree grammars and small circuits for formulas. J. Comput. Syst. Sci., 86:136–158, 2017.
-  L. Hübschle-Schneider and R. Raman. Tree compression with top trees revisited. In Proc. SEA 2015, LNCS 9125, 15–27. Springer, 2015.