A caterpillar tree is a tree in which every vertex has distance at most from a central path. The central path of a caterpillar tree is also called the spine of the tree and it is obtained by removing all endpoint vertices in the tree. There are different names for endpoint vertices in the literature; for example, terminal vertices, monovalent vertices, leaves, and legs. An explanatory example of a caterpillar tree is given in Figure 1, in which the central path consists of vertices, and there are respectively , , , , and leaves connected to each of them, from left to right.
We consider a caterpillar tree with (fixed) vertices which are enumerated (from left to right) on the central path. In this paper, the structure of a caterpillar tree is represented by , where is the number of leaves attached to the node labeled with on the central path, for . For example, we denote the caterpillar tree in Figure 1 as
. Caterpillar tree was probably first named by Arthur M. Hobbs in. Early works on caterpillar trees appeared in combinatorial graph theory [13, 17, 25]. The motivation of this paper originates from the fact that caterpillar trees have found applications in many scientific and applied disciplines. For instance, caterpillar graphs are used to uncover chemical and physical properties of benzenoid hydrocarbons in ; Caterpillar trees are adopted to model information flow trees in ; An polynomial algorithm which determines the total interval number of caterpillar trees is developed in ; Leaf realization problems for caterpillar trees are investigated in .
Due to the surge of interests in random graphs and algorithms, we incorporate randomness and caterpillar structure, and look into random caterpillar trees (RCTs), which evolve in the following manner. At time , we start with a central path consisting of vertices, which are enumerated from left to right. At each time point , a leaf vertex joins in the tree, and is linked to one of the vertices on the central path via an (undirected) edge according to certain rules, which will be introduced in detail in the sequel. Specifically in this paper, we investigate the degree profile of RCTs and propose a Gini-type index which quantitatively characterizes the evolution of RCTs.
The rest of the manuscript is organized as follows. We study the degree profile of RCTs in Section 2, which is divided into two subsections. In Section 2.1, we look into uniform RCTs, and find that the degree distribution is multinomial. In Section 2.2
, we place focus on preferential attachment RCTs. We develop stochastic recurrences to compute the first two moments of the degree variables exactly, and exploit a well-known probabilistic model—Pólya urn model—to determine the exact degree distribution. In what follows, we show that the asymptotic joint distribution for those degree variables (after properly scaled) is Dirichlet. In Section3, we propose a Gini-type index to characterize the evolution of the two classes of RCTs. We show that the proposed Gini index (versus another type of Gini index introduced in ) successfully distinguishes the two classes of RCTs via several simulation studies. Finally, we add some concluding remarks in Section 4 and propose some future work.
2 Degree profile of random caterpillar trees
In this section, we investigate the degree profile of RCTs. In graph theory, the degree of a vertex is the number of edges incident to the vertex. For each of the vertices on the central path of a RCT, its degree is composited by two parts: the number of adjoint leaves and the number of links on the central path. The former is random, while the latter is deterministic: for the vertices at the two endpoints of the spine, and for the rest. Let
be the random vector that represents the degree profile of thevertices on the central path of a RCT at time . The structure of a RCT is analogously represented by , where
are respectively the random variables which refer to the number of leaves attached to each of thevertices on the spine at time . The composition of these random variables is denoted by a random vector . There is an instant relationship between and for all :
We consider two types of RCTs: uniform random caterpillar trees () and preferential attachment random caterpillar trees (), which are distinguished by the features of their growth. The evolution of uniform RCTs is analogous to that of random recursive trees. At each time point , a vertex on the central path is chosen uniformly at random (all vertices being equally likely) and connected with a newcomer via an edge. Preferential attachment RCTs grow in a nonuniform way, inspired from the seminal paper . At time , the probability of a vertex on the central path being selected for a newcomer is proportional to its degree in the tree at time . Mathematically, given , the probability that the vertex labeled with , for , on the central path is chosen for the new leaf vertex (newcomer) at time is
where denotes an indicator function.
2.1 Uniform random caterpillar trees
The degree profile of uniform RCTs is trivial, recovered by some well-known results from fundamental probability theory. The growth of uniform RCTs coincides with an experiment ofindependent trials, each of which leads to a choice for one of candidates, with every candidate having a fixed success rate . The associated distribution is a multinomial distribution with parameters and . We thus obtain the joint probability mass function of :
for nonnegative integers , and . According to the relation between and in Equation (1), we get the joint distribution of ; namely,
for integers , and .
Several limiting distributions of linear functions of multinomial distributed random variables are given in . Following the results in [7, Theorem 1], we obtain the limiting distribution of . As , we have
where , and the dispersion matrix is
Accordingly, the asymptotic distribution is normal after properly scaled; that is,
2.2 Preferential attachment random caterpillar trees
In contrast to uniform RCTs, preferential attachment RCTs evolve in a flavor of the vertices with higher degrees being more attractive to newcomers. The first consideration of preferential attachment seems to appear in , and one of the most broad applications of preferential attachment is to model the growth of the World Wide Web in . In sociology, the phenomenon of preferential attachment is reflected in a well known manifestation: “the rich get richer and the poor get poorer.”
The recruiting candidates for newcomers in uniform RCTs are chosen independently from time to time, while the recruiting process in preferential attachment RCTs at each time point is dependent on the structure of the existing tree at the preceding time point. The strong dependency between the trees at two consecutive time points under the preferential attachment setting makes computation much more challenging.
In this section, we first compute the degree vector ’s first two moments, which would provide us an insight into the distribution of . Let be the -field that contains the history of the evolution of a preferential attachment RCT up to time (i.e., is the -field generated by ). At , the initial condition is
Let be a preferential attachment RCT at time , and let be the random vector that represents the degree profile of the vertices on the central path. The expectation of is
The dispersion matrix of , denoted by , is an square matrix such that
We look into each of the components in . For each and , there is an almost-sure relation between and :
where indicates the event of the vertex labeled with on the central path being selected for the newcomer at time . Taking expectations on both sides of Equation (3), we get
Taking another expectation, we obtain a recurrence relation for with respect to , i.e.,
Solving the recurrence relation with the initial condition given in Equation (2), we obtain the result for the first moment as stated in the proposition.
Towards the dispersion matrix of , we again appeal to the stochastic relation established in Equation (3) to compute the second moments of ’s for and the mixed moments of and for .
For each fixed , we square both sides of Equation (3) to get
The recurrence for is obtained by taking expectations on both sides of Equation (4) twice and plugging in the expectation of ,
Solving the stochastic recurrences with the initial condition (cf. Equation (2)), we get
Accordingly, we obtain the variances for’s, which form the diagonal of the variance and covariance matrix of :
To compute the covariances between and for , we need the mixed moments of and , i.e., . Recall the almost-sure relation between and in Equation (3). For , we have
In Equation (5), the term vanishes as the events and are mutually exclusive. In what follows, we obtain a recurrence for ; that is,
Solving the equation above recursively, we get
for and (or vice versa);
for and (or vice versa). Thus, we obtain other entries in the variance and covariance matrix of . For , we have
for and (or vice versa), we have
and for and (or vice versa), we have
Next, we look at the asymptotic distribution of for large . We exploit a Martingale Convergence Theorem to prove that the limiting distribution of (after properly scaled) exists. We first give some quick words about martingale. Martingale is a popular and powerful mathematical tool owing to its conceptual simplicity and versatility. A general definition of martingale can be found in [14, Section 1.1], which will be omitted here. Martingale has found applications in various research areas: theoretical probability theory , applied probability , stochastic processes  and financial modeling .
The -field sequence defined in Subsection 2.2 forms a filtration in our martingale setting. However, for each fixed , the random variables do not form a martingale sequence (with respect to ). We introduce a transformation to in the next lemma, and the new sequence is a martingale.
For each , the random variables
form a martingale sequence.
As , there exists a random vector such that
For each fixed , recall the martingale sequence established in Lemma 1. By the construct of and Proposition 1, it is obvious that is -bounded. According to the Martingale Convergence Theorem [14, Theorem 2.5], we conclude that there exits a random variable , to which converges almost surely, as . For each , we set . The random vector is the limit as stated in the theorem. ∎
We prove the existence of the limiting distribution of in Theorem 1. However, the limiting distribution is not determined. Next, we introduce a probabilistic model—Pólya urn model—to characterize the dynamics of the degree variables, and thus find the exact distribution of , followed by the limiting distribution. We refer the interested readers to  for the history, definition, and applications of Pólya urn models. In this paper, we focus on a Pólya urn generalized from a classical model—the Pólya-Eggenberger urn .
Consider an urn containing different types of balls (e.g., different colors). Initially, the urn contains a total number of balls, of which there are balls of color , for , and . At each time point , a ball is chosen from the urn uniformly at random, its color is observed, and the ball is placed back to the urn in addition with a ball of the same color. The dynamics of the urn scheme is governed by an replacement matrix:
where the rows are indexed with colors from top to bottom, and the columns are indexed with colors from left to right. The dynamic of the degree addition in a preferential attachment RCT is associated with an -color Pólya-Eggenberger urn with the initial condition： .
A remarkable property of the Pólya-Eggenberger urns is exchangeability, i.e., the probabilities of choosing balls of different colors in all -long sequences which have the same number of balls sampled for each color are identical, not depending on the order of those balls chosen in the sequence. The exact joint distribution of degree variables is given in the next theorem.
Let be a preferential attachment RCT at time , and let be the random vector that represents the degree profile of the vertices on the central path. Suppose that the balls of color are chosen times in the -long sampling sequence, we have
where , , , , and refers to the Pochhammer symbol of the rising factorial.
Consider an -color Pólya-Eggenberger urn starting with balls of color , for . A possible string (sequence) to obtain an urn containing balls of color is to sample balls of color in the first steps of the -long sequence, sample balls of color in the next steps, and continue sampling in this manner until the balls of color are selected in the last steps in the sequence. The probability of obtaining this particular sampling string is
There is a total of strings to achieve the outcome of the urn containing balls of color at time . By the property of exchangeability, we obtain the stated joint probability mass function of ’s. ∎
As , we have
where are the parameters of (-dimensional) Dirichlet distribution.
We write the joint probability mass function of presented in Theorem 2 in terms of gamma functions; that is,
Noting that and , we define , and find that the support of ’s is such that . Replace in Equation (6) by for each . We then apply the Stirling’s approximation [11, Equation (4.23)] to the ratio of gamma functions in Equation (6) as , and conclude that
, which is the probability density function of a Dirichlet distribution with parameters. ∎
3 Gini index of random caterpillar trees
In this section, we propose a Gini-type index to characterize the evolution of the two classes of RCTs considered in Section 2. The Gini index, named after the Italian statistician and sociologist Corrado Gini, arose from a problem of measuring statistical dispersion of wealth distribution of national residents in economics. In modern times, the Gini index is extended to a commonly-used measure of inequality of a distribution, which has found applications in medicine , public health , physics , chemistry  and complex networks 
, etc. Statisticians are committed to establishing and developing rigorous methods to calculate or estimate the Gini index, see representative papers such as[8, 21, 27]. Very recently, the Gini index was exploited to measure the sparsity of a network . One of the most effective ways to illustrate the Gini index may be to exploit a graphical representation—the Lorenz curve; see . In this paper, we propose a Gini-type index which quantifies the disparity within different classes of RCTs so as to characterize their evolution. We also compare the proposed Gini index with the one recently introduced in  via some numerical experiments. To distinguish the two Gini indicies in the rest of the manuscript, we call the Gini index from  type I Gini index, and the one proposed herein type II Gini index.
3.1 Type I Gini index
The first type of Gini index (i.e., type I Gini index) that we look into is proposed in , the authors of which considered a Gini-type topological index for several classes of random rooted trees. In particular, they illustrate the estimation of their measure via a class of uniform RCTs.
To begin with, we give a quick word about type I Gini index. Let be a class of rooted trees. For each vertex in an arbitrary tree , the geodesic distance between and the root (this measure is sometimes expressed as the depth of ) is the number of edges in a shortest path connecting them, denoted by . If we consider all the vertices in as our target population, and the “wealth” of each of them is represented by , then the associated Gini index, i.e., type I Gini index of , is given by
for all . The estimator of type I Gini index for an arbitrary class of rooted trees , denoted by , is developed on the procedures introduced by . Let be the cardinality of . The estimator of is given by
for all and an arbitrary .
According to the estimator in Equation (8), we calculate type I Gini indices of the classes of uniform RCTs and preferential attachment RCTs, respectively. Without loss of generality, we consider the vertex at the leftmost position on the central path as the root. For better readability, we only present the results in the main body of the paper, but more algebra can be found in the appendix (Section 5.1). The type I Gini index of uniform RCTs at time is given by
As goes to infinity, we see that converges to . For a large value of , this index is close to , which is consistent with the conclusion drawn in .
We verify our conclusion via a Monte-Carlo experiment, the graphical result of which is depicted in Figure 3. In the experiment, we simulate classes of uniform RCTs at time according to different values of : . For each class of uniform RCTs, the replication number is set at . Note that the size of an arbitrary RCT from any class is deterministic; that is, . For each simulated uniform RCT, we determine the depth of each vertex in our simulations, and compute type I Gini index via the formula in Equation (7). The estimate of this type of Gini index (for each class) is obtained by averaging over all type I Gini indices of the replications.
We next conduct an analogous analysis of . The analytic result of estimation is presented in Section 5.2. We find that approaches when is large. In what follows, also converges to for a large value of . This conclusion is also verified via a numerical experiment with the same parametric setting (as for the uniform case); see Figure 3.
According to our computations, we discover that type Gini indices proposed in  are asymptotically identical for two classes of RCTs which grow in completely different manners, suggesting this type of Gini index fails to distinguish the evolutionary behavior and construct feature of the two models. Our conjecture is further verified by four studies of Lorenz curves, depicted in Figure 4. We simulate uniform RCTs and preferential RCTs at time for each of the four values of , which are (top left), (top right), (bottom left) and (bottom right), respectively. We can see that the Lorenz curves of uniform RCTs and preferential attachment RCTs are close to each other for small values of , but they are indistinguishable for large values of . We thus conclude that type I Gini index cannot be used to quantify the inequality of the distribution of the distance measure of the vertices in different classes of RCTs.
3.2 Type II Gini index
Alternatively, we propose a new type of Gini index, called type II Gini index, which not only accounts for the structure of RCTs, but also precisely characterize the evolution of the RCTs from different classes.
Instead of including all vertices in our target population, we only consider the vertices on the central path of a tree at time , i.e., . The “wealth” of vertex (for ) is represented by the number of leaves attached to it, i.e., . Thus, we define type II Gini index of (at time ) which is given by
We conduct analogous numerical experiments to calculate type II Gini indices for uniform and preferential attachment RCTs. For each class, we simulate RCTs at time for different values of , and compute for each simulated RCT according to Equation (9). Within each class, we take the average of copies of as the estimate of type II Gini index, and the results of uniform and preferential attachment RCTs are respectively depicted in Figures 6 and 6.
We discover that type II Gini indicies of both classes of RCTs increase with respect to when time is large, and the speed of increase of type II Gini index of uniform RCTs is much higher than that of preferential attachment RCTs. We make pairwise comparisons and conclude that type II Gini index of preferential attachment RCTs is much large than that of uniform RCTs for the same value of , which is also reflected in the Lorenz curves presented in Figure 7.
Our numerical results show that type II Gini index of uniform RCTs is small in general when . The phenomenon that type II Gini index preferential attachment RCTs is larger than that of uniform RCTs (for fixed ) conforms to the evolution of these two classes of RCTs. The leaves are more likely to be evenly distributed among the vertices on the central path of uniform RCTs, whereas the leaves are inclined to being connected with the vertices with higher degrees in preferential attachment RCTs, which corresponds to the evolution of this class of RCTs. Conversely, the feature of the growth of preferential attachment RCTs strongly suggests inequality of the distribution of leaves, which is also reflected in the larger value of Gini index in our experiment. Therefore, we conclude that the proposed Gini index successfully characterizes the evolutionary behavior and distinguishes the structure of the two classes of RCTs considered in this paper, which makes it preferred to the one proposed in .
4 Concluding remarks
To sum up, we study several properties of RCTs in this manuscript. We consider RCTs which grow in both uniform and nonuniform ways. For a special type of nonuniform RCTs—preferential attachment RCTs, we exploit a generalized Pólya-Eggenberger urn model to determine the exact joint distribution of the degree variables, as well as the asymptotic joint distribution. Multicolor Pólya-Eggenberger urns have been well studied. Three versions of bivariate distributions generated from Pólya-Eggenberger urns are discussed in . The urn model I defined in [23, Section 3.1] is a special case of our model for . A general result of strong convergence in a multicolor Pólya urn model is given in , in which the asymptotic joint distribution for the proportions of different types of balls is determined. The asymptotic joint distribution in Corollary 1 also can be obtained by applying the result in [10, Theorem 3.1]. In addition, we would like to point out that we are able to determine the asymptotic marginal distributions for for each based on the fundamental property of Dirichlet distribution; that is,
For a special case of , and
both converge to uniform distributions onasymptotically.
For the Gini index proposed in this manuscript, we are able to apply the estimation developed in  to get
where is the corresponding random variable with realization of , for . In our future work, we would like to develop some rigorous estimators for in Equation (10). This is feasible for the uniform case, as ’s jointly follow a multinomial distribution. It is well known that an -dimensional multinomial distribution can be approximated by an -dimensional multivariate normal. On the other hand, the exact joint distribution of ’s is not tractable for the preferential attachment model. Note that the asymptotic joint distribution of ’s can be determined by Corollary 1. One possible approach is to consider gamma representations of Dirichlet random variables and establish approximations from there. Further investigations will be conducted and the results will be presented elsewhere.