Trees are data structures commonly used in mathematics and computer science. The problem of generating them in a uniformly random way has been already well studied [comparison]
, resulting in methods that have potential applications in areas like software testing, machine learning and statistics where unbiased sampling is usually desired. The algorithms generating random trees can be characterized by the types of constraints they accept. These constraints usually define a class of objects that we want to sample giving each of these objects an equal chance to be selected. Algorithms are known that randomly select objects from the following classes of trees:
unordered trees restricted by an expected number of nodes [comparison],
ordered trees restricted by an expected number of nodes [ordered],
ordered trees restricted by an expected number of nodes and their degree [korsh-kary],
binary trees restricted by an expected number of nodes [korsh] [remy],
and some more.
In this article we present a method for random selection of trees from a class restricted by a sequence of all outdegrees that occur in the tree. The method is heavily inspired by a binary tree generating algorithm proposed by Korsh [korsh]. Here we use a different encoding in order to represent nodes with various outdegrees. We also show that most of the desired properties of the Korsh method still hold.
It is worth noting that if the outdegrees are interpreted as arities, the algorithm can be used for generating random syntactic trees for arithmetical and logical expressions.
2 The algorithm
The input array of non-negative integers is expected to contain outdegrees of all the nodes that will be present in the resulting tree. As explained in the section 3, the constraints enforced by the contents of must be realistic, i.e. at least one tree that meets the requirements must exist. The lemma 1 provides simple means to examine if it does.
The convention used here is that the arrays are indexed starting with and the notation represents a subarray of starting from the element under index and ending with the element under the index , both inclusive. The operator used with arrays represents concatenation.
The result will be encoded in a prefix form. For instance, for an input a potential result represents the following tree:
The algorithm contains three components that are usually implemented as loops, which are random shuffle (line 1), search for the point of rotation (line 1) and the rotation itself (line 1), the other parts of the algorithm can be assumed to be performed in constant time. The search and rotation are linear, procedures that perform random shuffle in linear time are also known [fisher-yates], so the overall complexity of the algorithm presented here is where .
Let denote the number of elements of a set of nodes minus the sum of their outdegrees.
Let be a finite set of nodes. A tree can be constructed with a use of all elements of iff .
Let us use induction to prove the forward part. Let be a tree with a single node, obviously if contains only that node, then .
Now let be a larger tree, be the set of its nodes and be the root of . We assume that the statement holds for trees smaller than . We know that because has subtrees and the statement holds for all of them. Since is a union of and we can see that
and so the statement holds also for .
In the backwards part, the proof also goes by induction. If is a minimal set which meets the sufficient condition, it contains only a single leaf node. In this case , so the statement holds.
Now we are going to show that for if the statement holds for all sets smaller than , then it also holds for . Let be an element of with a maximal outdegree. Since and leaves are the only elements of that increase the value of , the set must contain at least leaves, otherwise would not be a positive number. Let us create a tree with as the root and the leaves as its leaves. Now we will treat as a single leaf node and define a set which consists of and the elements of that were not used for . contains less nodes than and a sum of arities lowered by , so can be calculated as
and since is a smaller set and a tree for it can be trivially converted into a tree for , the inductive step is established. ∎
Later we will use in an analogous way also for sequences.
A well-formed expression in Polish notation (or shortly an expression) is either a symbol representing a variable or a constant, or it is a symbol representing an operator concatenated by expressions in a number equal the arity of that operator.
It is worth noting that in Polish notation the only situation in which an expression is not well-formed is when some of its operators are followed by too few subexpressions. If it is followed by too many of them, then the whole string is not well-formed, but it has a prefix which is.
In this article we will use Polish notation for encoding rooted trees, where the leaves are represented by the 0 constant and the other nodes are operators with the operands being their children. We denote the nodes as digits that correspond to their outdegree.
If a string has a postfix of length , then its -rotation is the string .
For a well-formed expression of length , each of its non-identity rotations is not well-formed.
By definition 2, every operator in a well-formed expression must be followed by an exact number of subexpressions, so rotating a postfix of a string to its beginning has to leave at least one of the operators without some of its operands. ∎
Every string that is not a well-formed expression, but could be reordered into a well-formed expression is a rotation of a well-formed expression.
First let us denote such a string by and notice it always begins with a prefix containing disjoint well-formed expressions, and only the postfix that follows them is a single expression that is not well-formed. It is a direct conclusion of the remark 2.
Now let us analyze . First we define . We know that as the operators of lack operands to become a correct expression. We could fix if we placed well-formed expressions after it. Now let us notice that must equal as the assumption on the whole string is that we can build a single well-formed expression out of its symbols meaning that . We have previously shown that is the only expression in that lacks operands, so it is granted that contains exactly correct expressions, therefore the -rotation of is a well-formed expression. ∎
Given a string that can become a well-formed expression representing a tree by having its characters rearranged, we can select one of these representations in a uniformly random way by first randomly selecting one of permutations of , and then fixing it by an appropriate rotation.
It follows from the lemma 3 that every incorrect expression in the set of permutations of is a rotation of a well-formed one, and lemma 2 guarantees that for every well-formed expression there is the same number of its incorrect rotations. Considering the above, randomly choosing a permutation of and fixing it, using the method from lemma 3, guarantees uniformly random selection of a tree. ∎
The methods used in the proof of correcntess allow us to make some side notions. For instance, every well-formed expression of length has incorrect rotations, and every incorrect expression is a rotation of a well-formed one. Which means the chances for a randomly reordered well-formed expression being well-formed too are .
This, in turn, can be used in derivation of the Catalan numbers. Given a set of nodes of outdegree two, we can calculate the number of leaves that we need in order to build a proper tree using them.
Now since all these nodes can be ordered in ways, then the number of permutations that are pairwise different can be calculated by , but only of them is correct, giving us binary trees of the size .