The disciplines of algebra and combinatorics offer critical approaches and constructs for conducting statistical network analysis. Of particular interest for this paper are algebraic maps, denoted as , from graphs to summary statistics as well as the inverse images associated with these maps, denoted as where is the set of all simple networks with vertices. These inverse images of singleton sets have been referred to as fibers in algebraic statistics literature [petrovic2017survey]. We use to represent the mapping as well as the associated network property being calculated. Fibers have been shown to be a useful concept in social network analysis; for example, they have been used as reference sets to conduct exact tests for goodness-of-fit of exponential random graph models [gross2017goodness]. The focus of this paper is on estimating the size of a fiber, denoted as , which represents the number of graphs where network property equals ; this quantity has been referred to as a volume factor [Shalizi13].
Understanding how frequently a particular network feature occurs in a set of potential graphs is important for several types of social network analyses. For example, knowing the volume factor can be necessary to calculate the probability distribution onwhen network property values are known or can be estimated, but no data is known about individual edges. This scenario occurs in the analysis of sampled social network data–particularly in the setting of sexual contact networks. An example arose in the design of the Botswana Combination Prevention Program, a large cluster randomized trial of HIV prevention. For guiding the design, only estimates were available for crucial network properties, such as degree distribution (proportion of people with given numbers of sexual contacts) and proportion of sexual contacts that occur between residents of the same community [wang2014sample]. From knowledge of , however, we are able to estimate the probability of any given graph as multiplied by the probability estimate for . Such knowledge permits simulation of processes operating on networks and the impact of interventions on them–which can be useful to investigate the suitability of different study designs.
In addition, the ability to estimate volume factors enables investigation of the diversity of graphs in a fiber. For example, a fiber based on being only the number of edges in a graph might have a large number of unique degree distributions (high diversity) or few (low diversity). Knowledge about the size and diversity of a fiber can aid in understanding mixing times for MCMC schemes to sample graphs from the fiber (small size or low diversity can result in faster mixing times). Also, knowledge about the diversity can aid in the assessment of the generalizability of simulation studies based on networks sampled from a fiber (higher diversity may suggest greater generalizability of study results).
Graph enumeration is a well-established area of combinatorics for counting networks with a particular network feature. Processes for counting fall into two categories based on whether vertices in the graph are labeled or unlabeled. In the former, the vertices of the graph are labeled in a way that makes them distinguishable from one another. In the latter, all permutations of the vertices are considered to form the same graph. In social network analysis–our area of interest–vertices are most often distinguishable from each other; hence, we focus on labeled graph enumeration.
Graph enumeration dates back at least to 1889 when Cayley provided an equation to count the number of labeled trees–connected graphs that contains no cycles [cayley1889theorem]. Equations to calculate the number of labeled graphs with various characteristics have since been reported; these include rooted graphs, connected graphs, and directed graphs. Considerable research has been devoted to estimating the number of labeled graphs with a given degree sequence–a property important in social network analysis [read1959enumeration, bender1978asymptotic, bollobas1980probabilistic, liebenau2017asymptotic]. However, there has been little research focusing on other important properties in social network analysis, such as degree mixing and clustering. Harary and Palmer [harary2014graphical] provide an excellent introduction on graph enumeration.
This paper presents a general recursive formula to estimate the number of labeled graphs for given values of graph properties of relevance to social network analysis: number of edges (graph density), degree sequence, degree distribution, mixing by nodal covariates, and degree mixing. The next section presents a general recursive formula to estimate the number of labeled graphs as well as details to evaluate the formula for specific network properties. For settings in which alternative formulas exist, section 3 presents simulation studies demonstrating the degree of similarity among these methods. In section 4, we apply the proposed approach to estimate the number of labeled graphs associated with different values of degree distribution and degree mixing that arise from the Barabási–Albert model in order to investigate the diversity within the fibers associated with this model [BA99]. The paper concludes with a discussion and further research.
2 Recursive formula for graph enumeration
We represent a network, , as an adjacency matrix with dimensions equal to the size of set . Therefore, has dimensions , where denotes the size of set . Let represent the number of vertices in , i.e., . Let denote the vertices in set . Let indicate that there is an edge between and , where , while indicates that there is no edge and denote the neighbors of as , i.e., .
Equation 1 provides a recursive formula to estimate the number of graphs, , with specific value(s), , for particular network properties, :
where is the ratio between the sizes of fibers and , i.e.,
Goyal et al. [goyal2014sampling] provides equations to calculate for a range of network properties including number of edges, mixing by nodal covariates, degree distribution, degree mixing, and clustering when and are specified such that there exists graphs and where:
and differ by the presence or absence of a single edge,
Although there is no constraint on , it is often useful to set equal to the specific value of the network properties associated with the empty graph; hence, typically, = 1.
In the sections below, we provide details for calculating for a given number of edges, degree distribution, degree sequence, mixing by nodal covariates, and degree mixing. We make use of for specifying and set equal to the specific value of the network properties associated with the empty graph.
2.1 Number of edges
Although there is a closed-form expression for calculating the number of labeled graphs with the number of edges equal to , we use Equation 1 for this calculation to illustrate its use in graph enumeration.
Let denote the network property for the number of edges. Specify as . As only the empty graph has edges, . To prove that this specification satisfies , let be a set of distinct edges. Let denote the network formed with the first edges from , i.e., contains edges . Based on the definition of , , , and and differ by a single edge. Since satisfies , we can use results from Goyal et al. [goyal2014sampling] to calculate as shown below:
Using Equation 3 along with the specification of as and , it is possible to calculate . Section 4.1 provides a comparison between the recursive formula and a previously established formula.
2.2 Degree distribution
The degree of vertex , denoted as , is the number of edges the vertex has with other vertices in ; therefore . Let
represent the vector of degrees for nodes in set, commonly referred to as a degree sequence. The degree distribution, denoted as , is a vector representing the number of these degrees over all vertices in set ; the entry represents the number of vertices having degree , i.e., . Let denote the network property for the degree distribution.
To calculate , the number of labeled graphs with degree distribution , we specify by leveraging the Havel–Hakimi algorithm. Let be any degree sequence that is consistent with . The Havel–Hakimi algorithm permits identification of a set of edges, denoted as , that can be used to construct a graph with degree sequence [SLH62, VH55]. Algorithm 1, provides a procedure to identify .
Let denote the network formed with the first edges from , i.e., contains edges . Let denote the degree distribution associated with network , i.e., . Based on this definition of , as only the empty graph has degree distribution . In addition, satisfies . Let be the single edge that differs between and . Based on results from Goyal et al. [goyal2014sampling]:
and based on Newman [MN02],
Using Equation 4 along with the specification of as defined above and , it is possible to calculate .
2.3 Degree sequence
Let denote the network property for the degree sequence. The number of graphs with degree sequence , , can be computed by dividing the number of labeled graphs with the degree distribution consistent with , denoted as , by the number of permutations of assigning vertices to degrees. Specifically,
Section 4.2 provides a comparison between the presented recursive formula and a formula by Liebenau et al. [liebenau2017asymptotic].
2.4 Mixing by nodal covariates
Let represent the vector of discrete characteristics for individual in network . Let be a vector containing the characteristics of all vertices. The characteristic distribution, denoted as , is a vector representing the number of individuals with these characteristics over all vertices; the entry represents the number of vertices having characteristic pattern , i.e., . Let be a matrix representing the mixing pattern of network . is a symmetric matrix, where is the number of distinct characteristic patterns. The entry is the total number of edges between a vertex with characteristic and vertex with characteristic . Let denote the network property for mixing by nodal covariates.
To calculate the number of labeled graphs with mixing matrix , specify as the following for ( is symmetric):
where is the number of distinct characteristic patterns. Therefore, as only the empty graph has for all entries of the mixing matrix. Also, this specification satisfies . To show this, let be a set of distinct edges where the first are between vertices with covariate pattern , the next are between vertices with covariate patterns and , and so on. Let denote the network formed with the first edges from , i.e., contains edges . Based on the definition of , , , and and differ by a single edge; let be that edge. Therefore, is the following:
2.5 Degree mixing
Let be a matrix representing the degree mixing pattern of network . The entry is the total number of edges between vertices of degree and . Let denote the network property for degree mixing matrix.
To calculate the number of labeled graphs with degree mixing , we follow a similar approach as that for degree distribution. Specifically, we use a constructive proof for assessing whether a degree mixing matrix is graphical in order to specify a set of edges, , that can be used to construct a graph with degree mixing [goyal2014sampling]; Algorithm 2 provides a procedure to construct .
Let denote the network formed with the first edges from , i.e., contains edges . Let denote the degree mixing associated with network , i.e., . As with number of edges and degree distribution, satisfies conditions and takes on the value associated with the empty graph. Therefore, . Let be the single edge that differs between and . Based on results from Goyal et al. [goyal2014sampling]:
and based on concepts from Newman [MN02], if ,
where, if and if and and denote the number of vertices that are neighbors of and and equal to .
Using Equation 12 along with the specification of as defined above and , it is possible to calculate .
2.6 Additional network properties and bipartite graphs
The recursive formula and associated framework we propose can be used to calculate the number of labeled graphs for many additional network properties. In particular, Goyal et al. [goyal2014sampling] provide equations for for clustering (controlling for degree mixing) and mixing by nodal covariates (controlling for degree distribution). In addition, Goyal et al. [goyal2018inference] enables extending the calculation of to the setting of bipartite networks.
3 Comparison with previous research
3.1 Number of edges
The number of graphs of size with edges equals the following:
This holds because there are possible edges and of those edges are selected.
To illustrate the use and validity of the recursive formula, we compare the estimates of based on the proposed recursive formula to the formula in Equation 17; we consider values of ranging from for graphs of size .
For each value of , Table 1 provides the log values for and based on Equation 1 and Equation 17. The estimates obtained from the proposed recursive formula and the known formula are identical–an expected finding given that a closed-form equation for exists.
3.2 Degree sequence
Liebenau et al. [liebenau2017asymptotic] proved a general asymptotic formula–conjectured in 1990–for the number of graphs with given degree sequence. They also provide a formula that converges to the number of d-regular graphs for any values of d as ; a d-regular graph is one in which each vertex has exactly degree . We compare estimates of , where is the degree sequence and for all , obtained from this asymptotic formula to those from the proposed recursive formula. Our estimates in this section are based on being the degree sequence .
Figure 1 shows log estimates for for the two approaches. The red bars depict estimates based on the recursive formula introduced in this manuscript; the blue bars are estimates based on Liebenau et al. [liebenau2017asymptotic]. Each plot in Figure 1 shows log estimates for networks of size 1000, 5000, and 10000. The log estimates differ by less than .
4 Fiber sizes for the Barabási–Albert model
In this section, we estimate the fiber sizes associated with degree distributions and degree mixing matrices when graphs are generated using the Barabási–Albert (BA) model. In addition, we estimate fiber sizes of graphs that are generated under different models, but constrained to share specific values for network properties with the graphs from the BA. We compare these fiber sizes to investigate the sizes of the fibers associated with graphs generated from the BA model relative to the other models–specifically the Erdős-Rényi (ER) model and configuration (CONF) model. Finally, we estimate the diversity in degree mixing matrices consistent with degree distributions formed by the BA model. First, we provide details of the BA model.
The BA model can be initiated with a small seed graph that grows by the addition of new vertices one at a time. Each new vertex forms a new edge with an existing vertex based on preferential attachment rules. Vertices and edges, once introduced, are never deleted. The BA model fixes the number of (undirected) edges connected to each new vertex. (Note that the model can be modified in various ways; our focus here is conceptual, so we consider only the original versions of these models.) The BA model provides one mechanism to generate graphs with a fat-tailed degree distribution, specifically a power-law degree distribution, where the probability, , that a vertex in the graph has degree decays as a power-law
. By contrast, the degree distributions for the ER model follow a binomial distribution.
4.1 Comparison of the BA and ER models
To compare the sizes of the fibers based on degree distribution associated with the BA and ER models, we generate 100 graphs using the BA model (), denoted as , and 100 graphs using the ER model, denoted as , such that .
Figure 2 shows density plots for the log estimates for in the first panel; , second panel; and their differences, third panel. Although and have the same number of edges, the number of graphs with a power law distribution for their degrees derived from a BA model is much smaller that the number of graphs wherein degrees follow a binomial distribution derived from an ER model.
4.2 Comparison of the BA and CONF models
The BA model also produces non-random structure in other network properties besides degree distribution; these include correlations between the degrees of connected vertices [qu2015effects]. To compare the sizes of the fibers based on degree mixing between graphs generated using BA model and graphs sampled uniformly with the same degree distribution, we generate 100 graphs using the configuration model, denoted as ; such that [MN10].
Figure 3 shows density plots for the log estimates for (first panel), (second panel), and their difference (third panel). The number of graphs with a degree mixing matrix derived from a BA model are similar on the log scale to the number of graphs with a degree mixing matrix derived from the configuration model.
4.3 Diversity of degree mixing matrices
From the previous section, the average number of labeled graphs associated with a degree distribution generated from the BA model was estimated as (exponential of the mean of the first panel in Figure 2). This section explores the diversity within this large collection of graphs. Specifically, we estimate the number of distinct degree mixing matrices associated with a degree distribution generated from the BA model.
Figure 4 shows a density plot for the log estimates for the number of distinct degree mixing matrices associated with a degree distribution generated from the BA model. The exponential of the mean gives an estimate of distinct degree mixing matrices.
This paper presents a general recursive formula to estimate the number of labeled graphs with specific values for graph properties of interest. We consider those with particular relevance for social network analysis: number of edges (graph density), degree sequence, degree distribution, mixing by nodal covariates, and degree mixing. The proposed method can easily be extended to additional network properties, including clustering, as well as to bipartite graphs; the formulas for Equation 2 are currently available. The proposed recursive formula differs from other available approaches for graph enumeration both in its overall approach and in the breadth of network properties that can be considered; it may be profitable to investigate the theoretical connections between the proposed method and other approaches.
This research is supported by a grant from the National Institute of Health (R37 AI-51164)