    # Recursive PGFs for BSTs and DSTs

We review fundamentals underlying binary search trees and digital search trees, with (atypical) emphasis on recursive formulas for associated probability generating functions. Other topics include higher moments of BST search costs and combinatorics for a certain finite-key analog of DSTs.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Binary Search Trees

Consider the R program:

 {f <- function(x,V,k)}{{}{\ \ if(NROW(V)==0) k <- 0}{\ \ else {}{\ \ \ \ u <- V}{\ \ \ \ if(x==u) k <- 1}{\ \ \ \ else if(xu],k)) }{\ \ \ \ }}{\ \ k}{}}

where is a random permutation on and is initially .  To model successful searches, let

be a random odd integer satisfying

.  To model unsuccessful searches, let be a random even integer satisfying .  This scenario is exactly as described in .  It is assumed, of course, that and are drawn independently with uniform sampling.  We begin with even , because this case is simpler, followed by odd .

### 1.1 Unsuccessful Search

The probability generating function for , given , obeys a recursion 

 fn(z)=2z+n−1n+1fn−1(z),n≥2;
 f1(z)=z.

Note that always.  Differentiating with respect to :

 f′n(z)=2n+1fn−1(z)+2z+n−1n+1f′n−1(z)

we have first moment

 E(Kn)=f′n(1)=2n+1+f′n−1(1)

that is,

 gn=2n+1+gn−1

where and .  Clearly and .  Differentiating again:

 f′′n(z)=4n+1f′n−1(z)+2z+n−1n+1f′′n−1(z)

we have second factorial moment

 E(Kn(Kn−1))=f′′n(1)=4n+1f′n−1(1)+f′′n−1(1),

that is,

 hn=4n+1gn−1+hn−1

where and .  Clearly and

.  Finally, we have variance

 V(Kn)=hn−g2n+gn

which is when and when .  From (more typical) harmonic number-based exact expressions, it can be proved that [2, 6, 7]

 E(Kn)=2ln(n)+2(γ−1)+3n+o(1n),
 V(Kn)=2ln(n)+2(γ−π23+1)+7n+o(1n)

as .

### 1.2 Successful Search

The probability generating function for , given , obeys a recursion

 n2fn(z)=(n−1)(2z+n−1)fn−1(z)+z,n≥2;
 f1(z)=z.

Note that always.  Differentiating with respect to :

 n2f′n(z)=2(n−1)fn−1(z)+(n−1)(2z+n−1)f′n−1(z)+1

we have first moment

 E(Kn)=f′n(1)=2(n−1)+(n−1)(n+1)f′n−1(1)+1n2

that is,

 gn=(2n−1)+(n2−1)gn−1n2

where and .  Clearly and .  Differentiating again:

 n2f′′n(z)=4(n−1)f′n−1(z)+(n−1)(2z+n−1)f′′n−1(z)

we have second factorial moment

 E(Kn(Kn−1))=f′′n(1)=4(n−1)f′n−1(1)+(n−1)(n+1)f′′n−1(1)n2,

that is,

 hn=4(n−1)gn−1+(n2−1)hn−1n2

where and .  Clearly and .  Finally, we have variance which is when and when .

It can be proved that [2, 5, 6, 8]

 E(Kn)=2ln(n)+(2γ−3)+2ln(n)n+2γ+1n+o(1n),
 V(Kn) =2ln(n)+2(γ−π23+2)−4ln(n)2n+2(5−4γ)ln(n)n +(5+10γ−4γ2−2π23)1n+o(1n)

as .

### 1.3 Total Path Length

The total (internal) path length is the sum of taken over all odd integers from to .  It is not surprising that calculations are more involved here than before. The probability generating function for , given , obeys a recursion 

 fn(z)=zn−1nn−1∑k=0fk(z)fn−1−k(z),n≥1;
 f0(z)=1.

Note that always.  Differentiating with respect to :

 f′n(z)=(n−1)zn−2nn−1∑k=0fk(z)fn−1−k(z)+zn−1nn−1∑k=0[f′k(z)fn−1−k(z)+fk(z)f′n−1−k(z)]

we have first moment

 E(Ln) =f′n(1)=n−1n⋅n+1nn−1∑k=0[f′k(1)+f′n−1−k(1)] =n−1+2nn−1∑k=0f′k(1),

that is,

 gn=n−1+2nn−1∑k=0gk

where and .  Clearly , , and .  Differentiating again:

 f′′n(z) =(n−1)(n−2)zn−3nn−1∑k=0fk(z)fn−1−k(z)+2(n−1)zn−2nn−1∑k=0[f′k(z)fn−1−k(z)+fk(z)f′n−1−k(z)] +zn−1nn−1∑k=0[f′′k(z)fn−1−k(z)+2f′k(z)f′n−1−k(z)+fk(z)f′′n−1−k(z)]

we have second factorial moment

 E(Ln(Ln−1)) =f′′n(1) =(n−1)(n−2)+2(n−1)[f′n(1)−n+1]+2nn−1∑k=0f′k(1)f′n−1−k(1)+2nn−1∑k=0f′′k(1),

that is,

 hn=−(n−1)n+2(n−1)gn+2nn−1∑k=0gkgn−1−k+2nn−1∑k=0hk

where and .  Clearly , , and .  Finally, we have variance which is when and when .

It can be proved that [2, 5, 9]

 E(Ln)=2nln(n)+2(γ−2)n+2ln(n)+(2γ+1)+o(1),
 V(Ln) =(7−2π23)n2−2nln(n)+(17−2γ−4π23)n −2ln(n)+(5−2γ−2π23)+o(1)

as .

### 1.4 Higher Moments

A third moment expression appears in  for successful search; analogous work for unsuccessful search remains undone.  We focus on total (internal) path length for BSTs.  The cumulants , , … , of were exhaustively studied by Hennequin [11, 12]; these asymptotically satisfy

 κs∼[as+(−1)s+12s(s−1)!ζ(s)]ns

as , where

 {as}8s=2={7,−19,9379,−85981108,210965172700,−752724545381000,1928192240098914883750}.

Hoffman & Kuba  obtained a complicated recurrence for an associated sequence of rationals [14, 15]:

 {cs}8s=2={7,−19,22609,−229621108,742505172700,−3053275070381000,9055812623863914883750}

using what they called tiered binomial coefficients.  While they utilized notation , we adopt .  It suffices to say that and a rich theory about for awaits discovery.  We give Mathematica code for generating :

 {f[i\_,x\_,y\_] := (1/(i+1-x-y)) (Binomial[i-x,i]/Binomial[i-x-y,i])}{T[i\_,n\_,m\_] := If[n+m > 0, Coefficient[Normal[\ \ }{\ Series[f[i,x,y], {x,0,n}, {y,0,m}]], x\char 94n y\char 94m], 1/(1+i)]}{c[s\_] := c[s] = ((s+1)/(s-1)) *}{\ Sum[Sum[Sum[If[k1+k2+k3 == s, Multinomial[k1,k2,k3] c[k1] c[k2] *}{\ \ Sum[Sum[Sum[If[n+m+p == k3,}{\ \ \ Sum[Multinomial[n,m,p] Binomial[m+k2,j] (-1)\char 94j (-2)\char 94(n+m) n! m! T[n+k1+j,n,m],}{\ \ \ {j,0,m+k2}], 0], }{\ \ {p,0,k3}], {m,0,k3}], {n,0,k3}], 0], }{\ {k3,0,s}], {k2,0,s-1}], {k1,0,s-1}] }{c = 1;}{c = 0;}

and code for generating , given , , … , :

 {Sum[(-1)\char 94(j-1) (j-1)! BellY[s, j, Table[c[i], {i,1,s-j+1}]], {j,1,s}]}

This final line employs a well-known expression for cumulants in terms of partial (or incomplete) Bell polynomials of central moments.

## 2 Digital Search Trees

Consider the R program:

 {f <- function(x,M,p,k)}{{\ \ }{\ \ q <- NCOL(M)}{\ \ if(NROW(M)==0) k <- 0}{\ \ else {}{\ \ \ \ if(all(x==matrix(M[1,],ncol=q))) k <- 1}{\ \ \ \ else {}{\ \ \ \ \ \ M <- matrix(M[-1,],ncol=q)}{\ \ \ \ \ \ M <- matrix(M[M[,p]==x[p],],ncol=q) }{\ \ \ \ \ \ k <- 1+f(x,M,p <- p+1,k)}{\ \ \ \ \ \ }}{\ \ \ \ }}{\ \ k}{}}

where is a random binary matrix with distinct rows, is initially and is initially . It is usually assumed [2, 16] that , from which the row-distinctness requirement follows almost surely (imagining the rows as binary expansions of independent Uniform numbers).  If instead , as exploratively specified in , then the matrix would need to be generated carefully to avoid duplicate keys. To model successful searches, let be a random row of .  To model unsuccessful searches, let be a random binary

-vector that is not a row of

.

### 2.1 Unsuccessful Search

The probability generating function for , given , is

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩12z+12z2if n=2,14z+58z2+18z3if n=3,18z+1932z2+1764z3+164z4if n=4,116z+65128z2+195512z3+491024z4+11024z5if n=5

for and

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩23z+13z2if n=2,27z+23z2+121z3if n=3,865z+302455z2+22105z3+1273z4if n=4,52899z+738413485z2+3450294395z3+26899z4+16293z5if n=5

for .  A closed-form expression exists  for when , but a corresponding simple recursive formula does not evidently materialize.  Section 3 contains verification of these polynomial expressions.

### 2.2 Successful Search

The probability generating function for , given , is

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩12z+12z2if n=2,13z+12z2+16z3if n=3,14z+716z2+932z3+132z4% if n=4,15z+38z2+1132z3+564z4+1320z5if n=5

for and

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩12z+12z2if n=2,13z+1121z2+17z3if n=3,14z+920z2+39140z3+3140z4if n=4,15z+17074495z2+2356167425z3+465767425z4+3922475z5if n=5

for .  A closed-form expression exists [1, 2] for when , but a corresponding simple recursive formula again does not materialize.  Means and variances for and those for unsurprisingly become closer as increases.

### 2.3 Total Path Length

The total (internal) path length is the sum of taken over all rows of .  It is not surprising that calculations are more involved here than before.  Assume that . The probability generating function for , given , obeys a recursion 

 fn(z)=zn−121−nn−1∑k=0(n−1k)fk(z)fn−1−k(z),n≥1;
 f0(z)=1.

Note that always.  Differentiating with respect to :

 f′n(z) =(n−1)zn−221−nn−1∑k=0(n−1)kfk(z)fn−1−k(z) +zn−121−nn−1∑k=0(n−1)k[f′k(z)fn−1−k(z)+fk(z)f′n−1−k(z)]

we have first moment

 E(Ln)=f′n(1)=n−1+22−nn−1∑k=0(n−1k)f′k(1)

that is,

 gn=n−1+22−nn−1∑k=0(n−1k)gk

where and .  Clearly , , and .  Differentiating again:

 f′′n(z) =(n−1)(n−2)zn−321−nn−1∑k=0(n−1k)fk(z)fn−1−k(z) +(n−1)zn−222−nn−1∑k=0(n−1)k[f′k(z)fn−1−k(z)+fk(z)f′n−1−k(z)] +zn−121−nn−1∑k=0(n−1)k[f′′k(z)fn−1−k(z)+2f′k(z)f′n−1−k(z)+fk(z)f′′n−1−k(z)]

we have second factorial moment

 E(Ln(Ln−1)) =f′′n(1) =(n−1)(n−2)+2(n−1)[f′n(1)−n+1] +22−nn−1∑k=0(n−1k)f′k(1)f′n−1−k(1)+22−nn−1∑k=0(n−1k)f′′k(1),

that is,

 hn=−(n−1)n+2(n−1)gn+22−nn−1∑k=0(n−1)kgkgn−1−k+22−nn−1∑k=0(n−1k)hk

where and .  Clearly , , and .  Finally, we have variance which is when and when .

Define constants

 α=∞∑j=112j−1,β=∞∑j=11(2j−1)2,Q=∞∏j=1(1−12j).

Let denote the partial product of and

 φ(x)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩x−ln(x)−1(x−1)2if% x≠1,12if x=1.

It can be proved that [18, 19]

 E(Ln) =nln(n)ln(2)+n(γ−1ln(2)+12−α+δ1(n))+ln(n)ln(2) +(2γ−12ln(2)+52−α)+δ2(n)+O(ln(n)n),
 V(Ln)=n(C+δ3(n))+O(ln(n)2n)

as , where

 C=Qln(2)∑j,k,l≥0(−1)jQjQkQl2−j(j+1)/2−k−lφ(2−j−k+2−j−l)=0.2660036454....

This expression for is, needless to say, a stunning result.

Assuming instead that , all we currently possess are PGFs for small :

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩zif n=2,47z2+37z3if n=3,45z4+435z5+335z6if n=4,898413485z6+313613485z7+3644495z8+524495z9+394495z10if n=5.

A deeper understanding of finite-key DSTs would be welcome.

### 2.4 Some Combinatorics

We focus on unsuccessful searches, for both infinite keys () and finite keys ().  Let us examine the coefficients of and for simplicity.  The digital search trees appearing in Figure 1 for proceed from matrices

 (Mx)=⎛⎜ ⎜⎝ab1c1d⎞⎟ ⎟⎠,⎛⎜ ⎜⎝ab0c0d⎞⎟ ⎟⎠,⎛⎜ ⎜⎝ab0c1d⎞⎟ ⎟⎠,⎛⎜ ⎜⎝ab1c0d⎞⎟ ⎟⎠

respectively.  When , the indicated keys are merely abbreviations (two leading bits in an infinite sequence); hence the keys are automatically distinct; thus

 P{K2=2}=2⋅2426=12,
 P{K2=1}=2⋅2426=12

where is the count of binary matrices.  When , however, key-distinctness must be manually enforced.  We obtain the condition

 (a,b)≠(1,c),(a,b)≠(1,d)&(1,c)≠(1,d)

which is equivalent to and gives possibilities; also the condition

 (a,b)≠(0,c),(a,b)≠(1,d)&(0,c)≠(1,d)

which is equivalent to and gives possibilities; therefore

 P{~K2=2}=2⋅44!/1!=13,
 P{~K2=1}=2⋅84!/1!=23

where is the count of permutations of objects, taken at a time. Figure 1: Two linear cases and two triangular cases for n=2.

For and , using Figures 2 and 3, we have

 P{K3=3}=4⋅27212=18,
 P{K3=1}=2⋅29212=14

 P{~K3=3}=4⋅208!/4!=121,
 P{~K3=1}=2⋅2408!/4!=27. Figure 2: Four linear cases for n=3; note that two are reflections of the others. Figure 3: Two triangular cases for n=3; note that one is a reflection of the other.

For and , using Figures 4, 5 and 6, we have

 P{K4=4}=8⋅211220=164,
 P{K4=1}=4⋅214+4⋅214220=18

 P{~K4=4}=8⋅24016!/11!=1273,
 P{~K4=1}=4⋅6912+4⋅921616!/11!=865.

The emergence of bi-triangular cases at complicates our study for .  A similar argument for coefficients of , … , , as well as for successful searches, is possible. Figure 4: Eight linear cases for n=4 (these four cases plus their reflections). Figure 5: Four triangular cases for n=4 (these two cases plus their reflections). Figure 6: Four bi-triangular cases for n=4 (these two cases plus their reflections).

Third and fourth moment expressions appear in  for unsuccessful search on infinite keys.  The covariance between two random distinct successful search costs within the same tree is apparently as , where 

 D=C−112−π26ln(2)2+α+β=−0.4970105417....

Verifying this interesting result via simulation remains open.  What can be said about the cost covariance for two distinct unsuccessful searches?  What can be said about the cost covariance given a successful search and an unsuccessful search?

## 3 Acknowledgements

I am grateful to Markus Kuba and Sumit Kumar Jha for helpful discussions, and to David Penman for providing  (which at one time was available at http://algo.inria.fr/).

## References

•  G. Louchard, Exact and asymptotic distributions in digital and binary search trees, RAIRO Inform. Théor. Appl. 21 (1987) 479–495; MR0928772.
•  H. M. Mahmoud, Evolution of Random Search Trees, Wiley, 1992, pp. 71–91, 260–285; MR1140708.
•  S. R. Finch, Resolving conflicts and electing leaders, arXiv:1912.06545.
•  S. R. Finch, Binary search tree constants, Mathematical Constants, Cambridge Univ. Press, 2003, pp. 349–354; MR2003519.
•  R. Sedgewick and P. Flajolet, Introduction to the Analysis of Algorithms, Addison-Wesley, 1996, pp. 142, 162–163, 246–250.
•  D. E. Knuth, The Art of Computer Programming, v. 3, Sorting and Searching, 2 ed., Addison-Wesley, 1998, pp. 430–431, 455, 709; MR3077154.
•  W. C. Lynch, More combinatorial properties of certain trees, Computer J., v. 7 (1965) n. 4, 299–302; MR0172492.
•  G. D. Knott, Variance of calculation, unpublished note (1973).
•  P. F. Windley, Trees, forests and rearranging, Computer J., v. 3 (1960) n. 2, 84–88.
•  H. M. Mahmoud and R. Neininger, Distribution of distances in random binary search trees, Annals Appl. Probab. 13 (2003) 253–276; MR1951999.
• 

P. Hennequin, Combinatorial analysis of quicksort algorithm,

RAIRO Inform. Théor. Appl. 23 (1989) 317–333; MR1020477.
•  P. Hennequin, Analyse en moyenne d’algorithmes, tri rapide et arbres de recherche, Ph.D. thesis, École Polytechnique Palaiseau, 1991; http://www.mit.edu/~sfinch/Hennequin-thesis.pdf.
•  M. E. Hoffman and M. Kuba, Logarithmic integrals, zeta values, and tiered binomial coefficients, arXiv:1906.08347.
•  M. Cramer, A note concerning the limit distribution of the quicksort algorithm, RAIRO Inform. Théor. Appl. 30 (1996) 195–207; MR1415828.
•  S. B. Ekhad and D. Zeilberger, A detailed analysis of quicksort running time, arXiv:1903.03708; data output at http://sites.math.rutgers.edu/~zeilberg/tokhniot/oQuickSortAnalysis3.txt.
•  D. E. Knuth, The Art of Computer Programming, v. 3, Sorting and Searching, 2 ed., Addison-Wesley, 1998, pp. 500–505, 509, 726; MR3077154.
•  S. R. Finch, Digital search tree constants, Mathematical Constants, Cambridge Univ. Press, 2003, pp. 354–361; MR2003519.
•  P. Kirschenhofer, H. Prodinger and W. Szpankowski, Digital search trees again revisited: the internal path length perspective, SIAM J. Comput. 23 (1994) 598–616; MR1274646 (95i:68034).
•  H.-K. Hwang, M. Fuchs and V. Zacharovas, Asymptotic variance of random symmetric digital search trees, Discrete Math. Theor. Comput. Sci. 12 (2010) 103–165; MR2676668 (2012b:05232).
•  G. Louchard and H. Prodinger, Approximate counting with counters: a probabilistic analysis, J. Algebra Combin. Discrete Struct. Appl. 2 (2015) 191–209; MR3400765.  Steven Finch MIT Sloan School of Management Cambridge, MA, USA steven_finch@harvard.edu