DeepAI
Log In Sign Up

On the Collection of Fringe Subtrees in Random Binary Trees

A fringe subtree of a rooted tree is a subtree consisting of one of the nodes and all its descendants. In this paper, we are specifically interested in the number of non-isomorphic trees that appear in the collection of all fringe subtrees of a binary tree. This number is analysed under two different random models: uniformly random binary trees and random binary search trees. In the case of uniformly random binary trees, we show that the number of non-isomorphic fringe subtrees lies between c_1n/√(ln n)(1+o(1)) and c_2n/√(ln n)(1+o(1)) for two constants c_1 ≈ 1.0591261434 and c_2 ≈ 1.0761505454, both in expectation and with high probability, where n denotes the size (number of leaves) of the uniformly random binary tree. A similar result is proven for random binary search trees, but the order of magnitude is n/ln n in this case. Our proof technique can also be used to strengthen known results on the number of distinct fringe subtrees (distinct in the sense of ordered trees). This quantity is of the same order of magnitude in both cases, but with slightly different constants in the upper and lower bounds.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/10/2021

Distinct Fringe Subtrees in Random Trees

A fringe subtree of a rooted tree is a subtree induced by one of the ver...
03/22/2015

Adaptive Concentration of Regression Trees, with Application to Random Forests

We study the convergence of the predictive surface of regression trees a...
12/22/2021

An algorithm for generating random mixed-arity trees

Inspired by [4] we present a new algorithm for uniformly random generati...
11/14/2018

Cutting resilient networks -- complete binary trees

In our previous work, we introduced the random k-cut number for rooted g...
05/01/2020

Distributions of restricted rotation distances

Rotation distances measure the differences in structure between rooted o...
02/07/2020

A One-to-One Correspondence between Natural Numbers and Binary Trees

A characterization is provided for each natural number except one (1) by...
08/19/2020

Counting embeddings of rooted trees into families of rooted trees

The number of embeddings of a partially ordered set S in a partially ord...

1 Introduction

A subtree of a rooted tree that consists of a node and all its descendants is called a fringe subtree. Fringe subtrees are a natural object of study in the context of random trees, and there are numerous results for various random tree models, see e.g. [3, 9, 11, 13].

Fringe subtrees are of particular interest in computer science: One of the most important and widely used lossless compression methods for rooted trees is to represent a tree as a directed acyclic graph, which is obtained by merging nodes that are roots of identical fringe subtrees. This compressed representation of the tree is often shortly referred to as minimal DAG and its size (number of nodes) is the number of distinct fringe subtrees occurring in the tree. Compression by minimal DAGs has found numerous applications in various areas of computer science, as for example in compiler construction [2, Chapter 6.1 and 8.5], unification [25], symbolic model checking (binary decision diagrams) [7], information theory [21, 28] and XML compression and querying [8, 20].

In this work, we investigate the number of fringe subtrees in random binary trees, i.e. random trees such that each node has either exactly two or no children. So far, this problem has mainly been studied with respect to ordered fringe subtrees in random ordered binary trees: A uniformly random ordered binary tree of size (with

leaves) is a random tree whose probability distribution is the uniform probability distribution on the set of ordered binary trees of size 

. In [19], Flajolet, Sipala and Steyaert proved that the expected number of distinct ordered fringe subtrees in a uniformly random ordered binary tree of size is asymptotically equal to , where is the constant . This result of Flajolet et al. was extended to unranked labelled trees in [6] (for a different constant ). Moreover, an alternative proof to the result of Flajolet et al. was presented in [26] in the context of simply-generated families of trees.

Another important type of random trees are so-called random binary search trees: A random binary search tree of size is a binary search tree built by inserting the keys according to a uniformly chosen random permutation on . Random binary search trees naturally arise in theoretical computer science, see e.g. [12]. In [17], Flajolet, Gourdon and Martinez proved that the expected number of distinct ordered fringe subtrees in a random binary search tree of size is . This result was improved in [10] by Devroye, who showed that the asymptotics holds. Moreover, the result of Devroye was generalized from random binary search trees to a broader class of random ordered binary trees in [27]

, where the problem of estimating the expected number of distinct ordered fringe subtrees in random binary trees was considered in the context of so-called leaf-centric binary tree sources, which were introduced in

[23, 28] as a general framework for modeling probability distributions on the set of ordered binary trees of size .

In this work, we focus on estimating the number of non-isomorphic fringe subtrees in random ordered binary trees, where we call two binary trees non-isomorphic if they are distinct as unordered binary trees. This question arises quite naturally for example in the context of XML compression: Here, one distinguishes between so-called document-centric XML, for which the corresponding XML document trees are ordered, and data-centric XML, for which the corresponding XML document trees are unordered. Understanding the interplay between ordered and unordered structures has thus received considerable attention in the context of XML (see, for example, [1, 5, 29]). In particular, in [24], it was investigated whether tree compression can benefit from unorderedness. For this reason, so-called unordered minimal DAGs were considered. An unordered minimal DAG of a binary tree is a directed acyclic graph obtained by merging nodes that are roots of isomorphic fringe subtrees, i.e. of fringe subtrees which are identical as unordered trees. From such an unordered minimal DAG, an unordered representation of the original tree can be uniquely retrieved. The size of this compressed representation is the number of non-isomorphic fringe subtrees occurring in the tree. So far, only some worst-case estimates comparing the size of a minimal DAG to the size of its corresponding unordered minimal DAG are known: Among other things, it was shown in [24] that the size of an unordered minimal DAG of a binary tree can be exponentially smaller than the size of the corresponding (ordered) minimal DAG.

However, no average-case estimates comparing the size of the minimal DAG of a binary tree to the size of the corresponding unordered minimal DAG are known so far. In particular, in [24] it is stated as an open problem to estimate the expected number of non-isomorphic fringe subtrees in a uniformly random ordered binary tree of size and conjectured that this number asymptotically grows as .

In this work, as one of our main theorems, we settle this open conjecture by proving upper and lower bounds of order for the number of non-isomorphic fringe subtrees which hold both in expectation and with high probability (i.e., with probability tending to as ). Our approach can also be used to obtain an analogous result for random binary search trees, though the order of magnitude changes to . Again, we have upper and lower bounds in expectation and with high probability. Our two main theorems read as follows.

Theorem 1

Let be the total number of non-isomorphic fringe subtrees in a uniformly random ordered binary tree with leaves. For two constants and , the following holds:

  1. ,

  2. with high probability.

Theorem 2

Let be the total number of non-isomorphic fringe subtrees in a random binary search tree with leaves. For two constants and , the following holds:

  1. ,

  2. with high probability.

To prove the above Theorems 1 and 2, we refine techniques from [26]. Our proof technique also applies to the problem of estimating the number of distinct ordered fringe subtrees in uniformly random binary trees or in random binary search trees. In this case, upper and lower bounds for the expected value have already been proven by other authors. Our new contribution is to show that they also hold with high probability.

Theorem 3

Let denote the total number of distinct fringe subtrees in a uniformly random ordered binary tree with leaves. Then, for the constant , the following holds:

  1. ,

  2. with high probability.

Here, the first part (i) was already shown in [19] and [26], part (ii) is new. Similarly, we are able to strengthen the results of [10] and [27]:

Theorem 4

Let be the total number of distinct fringe subtrees in a random binary search tree with leaves. For two constants and , the following holds:

  1. ,

  2. with high probability.

The upper bound in part (i) can already be found in [17] and [10]. Moreover, a lower bound of the form was already shown in [10] for the constant and in [27] for the constant . So our new contributions are part (ii) and the improvement of the lower bound on .

2 Preliminaries

Let denote the set of ordered binary trees, i.e. of ordered rooted trees such that each node has either exactly two or no children. We define the size of a binary tree as the number of leaves of and by we denote the set of binary trees of size for every integer . It is well known that , where denotes the -th Catalan number [18]: We have

(1)

where the asymptotic growth of the Catalan numbers follows from Stirling’s Formula [18]. Analogously, let denote the set of unordered binary trees, i.e. of unordered rooted trees such that each node has either exactly two or no children. The size of an unordered tree is again the number of leaves of and by we denote the set of unordered binary trees of size . We have , where denotes the -th Wedderburn-Etherington number. Their asymptotic growth is

(2)

for certain positive constants [4, 16]. In particular, we have .

A fringe subtree of a binary tree is a subtree consisting of a node and all its descendants. For a binary tree and a given node , let denote the fringe subtree of rooted at . Two fringe subtrees are called distinct if they are distinct as ordered binary trees.

Every tree can be considered as an element of by simply forgetting the ordering on ’s nodes. If two binary trees correspond to the same unordered tree , we call them isomorphic: Thus, we obtain a partition of into isomorphism classes. If two binary trees belong to the same isomorphism class, we can obtain from and vice versa by reordering the children of some of ’s (respectively, ’s) inner nodes. An inner node of an ordered or unordered binary tree is called a symmetrical node if the fringe subtrees rooted at ’s children are isomorphic. Let denote the number of symmetrical nodes of . The cardinality of the automorphism group of is given by . Thus, by the orbit-stabilizer theorem, there are many ordered binary trees in the isomorphism class of , and likewise many ordered representations of .

We consider two types of probability distributions on the set of ordered binary trees of size :

  • The uniform probability distribution on , that is, every binary tree of size is assigned the same probability

    . A random variable taking values in

    according to the uniform probability distribution is called a uniformly random (ordered) binary tree of size .

  • The probability distribution induced by the so-called Binary Search Tree Model (see e.g. [12, 17]): The corresponding probability mass function is given by

    (3)

    for every . A random variable taking values in according to this probability mass function is called a random binary search tree of size .

Before we start with proving our main results, we need two preliminary lemmas on the number of fringe subtrees in uniformly random ordered binary trees and in random binary search trees:

Lemma 1

Let be positive real numbers with . For every positive integer with , let be a set of ordered binary trees with leaves. We denote the cardinality of by . Let denote the (random) number of fringe subtrees with leaves in a uniformly random ordered binary tree with leaves that belong to . Moreover, let denote the (random) number of arbitrary fringe subtrees with more than leaves in a uniformly random ordered binary tree with leaves. We have

  1. for all with , the -constant being independent of ,

  2. for all with , again with an -constant that is independent of ,

  3. and

  4. with high probability, the following statements hold simultaneously:

    • for all with ,

    • .

We emphasize (since it will be important later) that the inequality in part (4), item (i), does not only hold with high probability for each individual , but that it is satisfied with high probability for all in the given range simultaneously.

Proof

(1) Recall first that the number of ordered binary trees with leaves is the Catalan number . We observe that every occurrence of a fringe subtree in in a tree with leaves can be obtained by choosing an ordered tree with leaves, picking one of the leaves and replacing it by a tree in . Thus the total number of occurrences is

Consequently, the average number is

by Stirling’s formula (the -constant being independent of in the indicated range).

(2) The variance is determined in a similar fashion: we first count the total number of pairs of fringe subtrees in

that appear in the same ordered tree with leaves. Each such pair can be obtained as follows: take an ordered tree with leaves, pick two leaves, and replace them by fringe subtrees in . The total number is thus

giving us

again by Stirling’s formula. The second moment and the variance are now derived from this formula in a straightforward fashion: We find

and thus, as ,

(3) To obtain the estimate for , we observe that the average total number of fringe subtrees with leaves is

where the estimate follows from Stirling’s formula again for . Summing over all , we get


(4) For the second part, we apply Chebyshev’s inequality to obtain concentration of :

Hence, by the union bound, the probability that the stated inequality fails for any in the given range is only , proving that the first statement holds with high probability. Finally, Markov’s inequality implies that

showing that the second inequality holds with high probability as well.

For the number of fringe subtrees in random binary search trees, a very similar lemma holds:

Lemma 2

Let be positive real numbers with and let and denote positive integers. Moreover, for every , let be a set of ordered binary trees with leaves and let denote the probability that a random binary search tree is contained in , that is, , where the sum is taken over all binary trees in . Let denote the (random) number of fringe subtrees with leaves in a random binary search tree with leaves that belong to . Moreover, let denote the (random) number of arbitrary fringe subtrees with more than leaves in a random binary search tree with leaves. We have

  • for ,

  • for all with , where the -constant is independent of ,

  • and

  • with high probability, the following statements hold simultaneously:

    • for all with ,

    • .

Proof

(1) In order to estimate , we define as the (random) number of arbitrary fringe subtrees with leaves in a random binary search tree with leaves. That is, for . Applying the law of total expectation, we find

As conditioned on for some integer

is binomially distributed with parameters

and , we find and hence

With (see for example [14]), the statement follows.

(2) In order to estimate , we apply the law of total variance:

Again as conditioned on for some integer is binomially distributed with parameters and , we find and . Thus, we have

With and

(see for example [14]), this yields


(3) In order to estimate , first observe that

With for and , this yields

(4) For the second part of the statement, we apply Chebyshev’s inequality to obtain:

Hence, by the union bound, the probability that the stated inequality fails for any in the given range is , proving that the given statement holds with high probability. Furthermore, with Markov’s inequality, we find

Thus, the second inequality holds with high probability as well.

3 Fringe Subtrees in Uniformly Random Binary Trees

3.1 Ordered Fringe Subtrees

We provide the proof of Theorem 3 first, since it is simplest and provides us with a template for the other proofs. Basically, it is a refinement of the proof for the corresponding special case of Theorem 3.1 in [26]. In the following sections, we refine the argument further to prove Theorems 1, 2 and 4.

Proof (Proof of Theorem 3)

We prove the statement in two steps: In the first step, we show that the upper bound holds for both in expectation and with high probability. In the second step, we prove the corresponding lower bound.

The upper bound: Let . The number of distinct fringe subtrees in a uniformly random ordered binary tree with leaves equals (i) the number of such distinct fringe subtrees of size at most plus (ii) the number of such distinct fringe subtrees of size greater than . We upper-bound (i) by the number of all ordered binary trees of size at most (irrespective of their occurrence as fringe subtrees), which is

This upper bound holds deterministically. Furthermore, we upper-bound (ii) by the total number of fringe subtrees of size greater than occurring in the tree: We apply Lemma 1 with and and let denote the set , such that , to obtain:

in expectation and with high probability as well, as the estimate from Lemma 1 (part (4)) holds with high probability simultaneously for all in the given range. As we have

we can combine the two bounds to obtain the upper bound on stated in Theorem 3, both in expectation and with high probability.

The lower bound: Again, let and . From the first part of the proof, we find that the main contribution to the total number of fringe subtrees in a uniformly random binary tree of size comes from fringe subtrees of sizes with . Hence, in order to lower-bound the number of distinct fringe subtrees in a uniformly random binary tree with leaves, we only count distinct fringe subtrees of sizes with and show that we did not overcount too much in the first part of the proof by upper-bounding this number by the total number of fringe subtrees of sizes . To this end, let denote the number of pairs of identical fringe subtrees of size in a uniformly random ordered binary tree of size . Each such pair can be obtained as follows: Take an ordered tree with leaves, pick two leaves, and replace them by the same ordered binary tree of size . The total number of such pairs of identical fringe subtrees of size is thus

By dividing by , i.e. the total number of binary trees of size , we thus obtain the expected value:

Thus, we find

If a binary tree of size occurs times as a fringe subtree in a uniformly random binary tree of size , it contributes to the random variable . Since for all non-negative integers , we find that is a lower bound on the number of distinct fringe subtrees with leaves. Hence, we have

The second sum is in expectation and thus with high probability as well by the Markov inequality. As the first sum is both in expectation and with high probability by our estimate from the first part of the proof, the statement of Theorem 3 follows.

As the main idea of the proof is to split the number of distinct fringe subtrees into the number of distinct fringe subtrees of size at most plus the number of distinct fringe subtrees of size greater than for some suitably chosen integer , this type of argument is called a cut-point argument and the integer is called the cut-point (see [17]). This basic technique is applied in several previous papers to similar problems (see for instance [10], [17], [26], [27]). Moreover, we remark that the statement of Theorem 3 can be easily generalized to simply generated families of trees.

3.2 Unordered Fringe Subtrees

In this subsection, we prove Theorem 1. For this, we refine the cut-point argument we applied in the proof of Theorem 3: In particular, for the lower bound on , we need a result due to Bóna and Flajolet [4] on the number of automorphisms of a uniformly random ordered binary tree. It is stated for random phylogenetic trees in [4], but the two probabilistic models are equivalent.

Theorem 5 ([4], Theorem 2)

Consider a uniformly random ordered binary tree with leaves, and let

be the cardinality of its automorphism group. The logarithm of this random variable satisfies a central limit theorem: For certain positive constants

and , we have

for every real number . The numerical value of the constant is .

With Theorem 5, we are able to upper-bound the probability that two fringe subtrees of the same size are isomorphic in our proof of Theorem 1:

Proof (Proof of Theorem 1)

We prove the statement in two steps: First, we show that the upper bound on stated in Theorem 1 holds both in expectation and with high probability, then we prove the respective lower bound.

The upper bound: The proof for the upper bound in Theorem 1 exactly matches the first part of the proof of Theorem 3, except that we choose a different cut-point: Let , where is the constant in the asymptotic formula (2) for the Wedderburn-Etherington numbers. We then find

both in expectation and with high probability, where the estimates for and follow again from Lemma 1. We have .

The lower bound: As a consequence of Theorem 5, the probability that the cardinality of the automorphism group of a uniformly random binary tree of size satisfies tends to as . We define as the set of ordered trees with leaves that do not satisfy this inequality, so that . Our lower bound is based on counting only fringe subtrees in for suitable . The reason for this choice is that we have an upper bound on the number of ordered binary trees in the same isomorphism class for every tree in . Recall that the number of possible ordered representations of an unordered binary tree with leaves is given by by the orbit-stabiliser theorem. Hence, the number of ordered binary trees in the same isomorphism class as a tree is bounded above by .

Now set for some positive constant , and consider only fringe subtrees that belong to , where . By Lemma 1, the number of such fringe subtrees in a random ordered binary tree with leaves is

both in expectation and with high probability. Since , the number of fringe subtrees that belong to in a random ordered binary tree of size becomes . We show that most of these trees are the only representatives of their isomorphism classes as fringe subtrees. To this end, we consider all fringe subtrees in for some that satisfies . Let the sizes of the isomorphism classes of trees in be , so that . By definition of , we have for every . Let us condition on the event that their number is equal to for some . Each of these fringe subtrees

follows a uniform distribution among the elements of

, so the probability of being in an isomorphism class with elements is . Moreover, the fringe subtrees are also all independent. Let be the number of pairs of isomorphic trees among the fringe subtrees with leaves. We have

Since this holds for all , the law of total expectation yields

Since , we find that

Thus

As in the previous proof, we see that is a lower bound on the number of non-isomorphic fringe subtrees with leaves. This gives us

The second sum is negligible since it is in expectation and thus also with high probability by the Markov inequality. For the first sum, a calculation similar to that for the upper bound shows that it is

both in expectation and with high probability. Since is arbitrary, we can choose any constant smaller than for .

4 Fringe Subtrees in Random Binary Search Trees

In this section, we prove our results presented in Theorem 2 and Theorem 4 on the number of distinct, respectively, non-isomorphic fringe subtrees in a random binary search tree. In order to show the respective lower bounds of Theorem 2 and Theorem 4, we need two theorems similar to Theorem 5: The first one shows that the logarithm of the random variable , where denotes a random binary search tree of size , satisfies a central limit theorem and is needed to estimate the probability that two fringe subtrees in a random binary search tree are identical. The second one transfers the statement of Theorem 5 from uniformly random binary trees to random binary search trees and is needed in order to estimate the probability that two fringe subtrees in a random binary search tree are isomorphic. The first of these two central limit theorems is shown in [15]:

Theorem 6 ([15], Theorem 4.1)

Consider a random binary search tree with leaves, and let . The logarithm of this random variable satisfies a central limit theorem: For certain positive constants and , we have

for every real number . The numerical value of the constant is

The second of these two central limit theorems follows from a general theorem devised by Holmgren and Janson [22]: Let denote a function mapping an ordered binary tree to a real number. Moreover, given such a mapping , define by

The theorem by Holmgren and Janson states:

Theorem 7 ([22], Theorem 1.14)

Let be a random binary search tree of size . If

then for certain constants and , we have

Moreover, if , then

for every real number . In particular, we have

Note that in [22], the equivalent binary search model is considered that allows binary trees to have unary nodes, so that the index of summation has to be shifted in the sum defining . Moreover, note that if we set for and otherwise, we have

by definition of in (3), and thus Theorem 6 follows as a special case of Theorem 7. This special case is also considered in Example 8.13 of [22].

As our main application of Theorem 7, we transfer the statement of Theorem 5 from uniformly random binary trees to random binary search trees, that is, we show that if the random number denotes the size of the automorphism group of a random binary search tree with leaves, then the logarithm of this random variable satisfies a central limit theorem as well. For this, we define the function in Theorem 7 by

We thus have

that is, evaluates to the number of symmetrical nodes in . Recall that equals the size of the automorphism group of . It is not difficult to check that satisfies the conditions of Theorem 7: As for every , we have and thus as well, so that the assumptions of Theorem 7 are satisfied. In order to determine the corresponding value , we start with estimating the expectation