ConnectedGraphSampler
Importance sampling for connected graphs with given degrees
view repo
We describe a new method for the random sampling of connected networks with a specified degree sequence. We consider both the case of simple graphs and that of loopless multigraphs. The constraints of fixed degrees and of connectedness are two of the most commonly needed ones when constructing null models for the practical analysis of physical or biological networks. Yet handling these constraints, let alone combining them, is non-trivial. Our method builds on a recently introduced novel sampling approach that constructs graphs with given degrees independently (unlike edge-switching Markov Chain Monte Carlo methods) and efficiently (unlike the configuration model), and extends it to incorporate the constraint of connectedness. Additionally, we present a simple and elegant algorithm for directly constructing a single connected realization of a degree sequence, either as a simple graph or a multigraph. Finally, we demonstrate our sampling method on a realistic scale-free example, as well as on degree sequences of connected real-world networks, and show that enforcing connectedness can significantly alter the properties of sampled networks.
READ FULL TEXT VIEW PDFImportance sampling for connected graphs with given degrees
From the active scaffolding of actomyosin in the cell’s cortex to the underlying gene expression machinery that regulates it, from the neighbourhood interactions of grains in a sand pile to those of the engineered struts and cables in a suspension bridge, and from the flow of virtual traffic on the internet to, critically in the time of COVID-19, the web of contacts that allow the spread of viral contagion, network structures underlie the vast majority of sufficiently complex real-world systems. Unsurprisingly, then, a great deal of focus has been placed on the furtherance of our understanding of how these network structures affect and ultimately determine the physical, biological, and social phenomena that play out over them. Indeed, the explosive growth of the fields of network science and complexity science in the last two decades is a direct consequence of this focus.
As is to be expected in such a young field, however, there remain fundamental challenges and difficulties with the approach. One such is the surprising difficulty in translating the simple concept of the null hypothesis into a network setting. Done directly such a translation would read something like: “there is no relationship between the network structure or properties and the observed or measured phenomena of interest.” But of course one cannot simply compare the case of phenomena potentially arising from some specific network structure with a case of
no network at all, forcing one to conclude that the correct operational statement of the null hypothesis in the complex network milieu must be: “there would be no difference in the observed or measured output phenomena if the specific underlying network were to be replaced by a generic network.” And herein lies the rub. What is a generic network? Surely one can demand that the generic network—or distribution of generic networks—satisfies some small set of constraints in order to ensure relevance to the biology, physics, or social dynamics under consideration. For example, in epidemiological viral spreading models it would be of no use to consider a heavily disconnected network with many small individual components to be among the generic networks. In fact, it is the unfortunate state of affairs that this simple issue is so tricky that many network and complexity science results are reported and accepted without reference to a test of the null hypothesis! But before we can fruitfully return to the question of membership amongst the relevant generic networks, we must first briefly discuss the generation of distributions of networks.Indeed, the so-called random graph models
are among the most powerful tools of network science. Essentially, a random graph model is simply a probability distribution defined over a set
of graphs. Often, such models are defined through an explicit stochastic graph construction process: the Watts–Strogatz model [1] and the preferential attachment model [2, 3] are some well-known examples. Usually, such constructive models are introduced and studied because the graphs they produce have some interesting or relevant property: The Watts–Strogatz model can produce graphs with the “small-world” property, which is famously present in social networks. The preferential attachment model can produce “scale-free” graphs, i.e. graphs with a power-law degree distribution, a much-studied property which many real-world networks possess [4, 5]. However, not all scale-free networks can be produced by the preferential attachment mechanism, and one cannot make general statements about all scale-free networks based only on those generated by a preferential attachment model. Therefore, for some purposes, it is useful to define random graph models not through a construction process, but by directly imposing a property of interest. The simplest way to define such a model (i.e. distribution) is to constrain its supportto include only those graphs that possess a given property, and assign the same probability to all elements of
. The graph ensemble obtained this way represents the property of interest the best. A related approach constrains the averages of some numerical graph properties and defines the distribution over to be the one with maximal entropy, which leads to exponential random graph models [6, 7, 8].Returning to the challenge of rendering the null model hypothesis in a network setting, constraint-based random graph models are particularly useful as null models. Null models are used to determine if an interesting observed feature of some empirical network can be explained by another simpler feature. The simpler feature is used as a constraint to define a random graph model, which is then compared to the empirical data. Another application is dealing with incomplete empirical data. Sometimes, it is not possible to fully map the connections of a real-world network, either due to practical limitations or, in the case of networks of people, due to privacy concerns [9]. In such cases, the known data can be incorporated as a constraint into a random graph model, and individual networks sampled from the model can be used as proxies for the (unknown) real network. Both applications require being able to computationally generate samples from the model. In the case of constraint-based models this means restricting the set of graphs to only those that satisfy the constraint, then performing uniform sampling. This is usually a difficult problem, as there are no general sampling methods that work with arbitrary constraints. Each constraint requires developing a sampling algorithm specific to it, and combining multiple constraints is a significant additional challenge.
In this paper we consider the problem of sampling connected graphs with a given degree sequence. Constraining the degrees has countless practical applications: It is a frequently used null model, for example when finding network motifs [10], detecting a so-called “rich-club structure” [11] or analysing the assortative structure of networks [12]. Degree-constrained random graphs are also useful as proxies when only the degrees of an empirical network are known, such as in the case of the famous web of sexual connections dataset [9]. The constraint of connectivity is a frequent additional requirement: Many real-world networks, such as vasculature, brain networks, or molecules (networks of atoms), are always connected. Commonly used network measures like closeness centrality are only meaningful for connected graphs. Processes such as epidemic spreading must be modelled on connected networks. In this work we present a novel method to handle these two constraints, degrees and connectivity, simultaneously.
The article is organized as follows: Section 2 introduces the mathematical background used in later parts of the article, and reviews existing sampling methods for graphs with constrained degrees. Section 3 presents a new and simple algorithm to construct a single connected graph with given degrees. Section 4 presents a recently introduced family of importance sampling methods for graphs with constrained degrees, and shows how to incorporate the additional constraint of connectivity. Finally, section 5 demonstrates the practical applicability of the method on both synthetic and real-world examples.
In this section we introduce the concepts, terminology and notations used in the rest of the work. We say that a graph is simple if it has no multi-edges or self-loops, i.e. if any two vertices have at most one connection between them, and no vertex is connected to itself. The degree of a vertex is the number of connections it has. The degree sequence of a graph on vertices is simply the collection of its vertex degrees, . If the degree sequence of a graph is , we say that the graph realizes the degree sequence .
Since each edge in a graph connects to a vertex at both of its endpoints, the sum of the degrees in a graph is twice the number of its edges, an even number. This statement is commonly known as the handshaking lemma. But not every even-sum sequence of integers is realizable as a simple graph. For example, can only be realized by either a graph that includes self-loops, , or one that includes multi-edges, .
A degree sequence is said to be graphical if there is a simple graph that realizes it.
The well-known Erdős–Gallai theorem provides a direct way to check if a degree sequence is graphical.
Let be a degree sequence. There is a simple graph that realizes this degree sequence if and only if is even and
(1) |
for every .
Tripathi and Vijay have shown that it is sufficient to check these inequalities for those where and for [14]. Using this stricter version of the theorem, it is possible to perform the checks in linear computational time. Király [15] and Cloteaux [16] describe two such linear-time algorithms for testing graphicality.
A degree sequence is said to be multigraphical if there is a graph, potentially containing multi-edges, but no self-loops, that realizes it. We refer to such a graph as a loopless multigraph.
Let be a degree sequence. There is a loopless multigraph that realizes if and only if is even and
(2) |
where denotes the largest degree in .
The proof of theorem 2 is given in Appendix A.1.
Not every graphical or multigraphical degree sequence has a connected realization. For example, the degree sequence is only realized by the non-connected graph .
A degree sequence is said to be potentially connected if it has a realization that is connected.
The concept of potential connectedness also applies to degree sequences which only have non-simple realizations. However, it can be shown that all potentially connected sequences that are graphical have connected realizations that are also simple.
In this paper we consider so-called labelled graphs, i.e. we consider the vertices to be distinguishable. Thus, the degree sequence is taken to have two isomorphic but distinct realizations as and .
There are two widely used approaches to uniformly sampling simple labelled graphs with a prescribed degree sequence: (1) “stub-matching” algorithms such as the configuration model and (2) Markov chain Monte Carlo sampling based on degree-preserving edge switches. We briefly review both families of methods, and consider how the additional constraint of connectivity can be incorporated into them.
The configuration model, also called the pairing model, is probably the simplest and most widely known approach to generating random graphs with a given degree sequence. The sampling algorithm proceeds as follows: Let us consider each vertex with as many unconnected stubs as its degree, as shown in figure 1. Then repeatedly pick two not-yet-connected stubs uniformly at random and connect them, until there are no unconnected stubs left. This algorithm may clearly produce graphs that are not simple (i.e. they have multi-edges or self-loops). Such graphs are simply rejected, and the generation procedure is restarted.
The configuration model’s algorithm produces each simple realization of the degree sequence with the same probability (although the same is not true for non-simple ones) [17]. Therefore, by rejecting the non-simple outcomes, the simple realizations can be sampled uniformly. It is important to note that if the outcome is non-simple, the generation procedure must be restarted from the beginning. It is not sufficient to merely reject any stub pairing that creates a non-simple edge and choose another one instead. Doing this would no longer produce each realization with the same probability.
The configuration model works well for sparse graphs that have small degrees. However, as the graph gets denser, the probability that the algorithm generates a non-simple graph, which must be rejected, increases quickly. For dense graphs, the rejection rate becomes too high for this sampling method to be computationally feasible. The same is true for degree sequences of sparse graphs that have a few very high degree vertices, such as scale-free and other heavy tail degree distributions, which are commonly observed in real-world networks [4, 5]. Therefore, the configuration model is only practical in some limited situations.
The constraint of connectivity can be incorporated trivially into the configuration model: simply reject any non-connected outcomes along with the non-simple ones. However, usually, most realizations of a sparse degree sequence are not connected, increasing the rejection rate further. This makes the connected version of the configuration model unfeasible for sparse graphs as well.
Edge-switching Markov chain Monte Carlo (MCMC) methods work by first building a single realization of the degree sequence, then repeatedly modifying the graph using random, degree sequence preserving edge switches like those shown in figure 2. It can be shown that even though not all pairs of edges can be switched without creating a non-simple graph, all simple realizations of a degree sequence can be reached with permissible edge switches. Consequently, a Markov chain constructed using edge switches is irreducible. It can be shown that if the edges to be switched are chosen uniformly at random, and the switch is simply not performed when it would create a multi-edge, then the stationary distribution of the Markov chain will be uniform. Details are given in Appendix B. Sampling can be performed as usual with MCMC, by recording states from the chain at certain intervals.
Incorporating the connectivity constraint into such a sampler is more involved than in the case of the configuration model. The Markov chain is still irreducible if edge switches that would disconnect the graph are forbidden [18]. However, testing whether an edge switch disconnects the graph takes computational time proportional to the size of the graph. Performing this test after every edge switch would make the method impractically slow. While there are published algorithms that make use of information from previous connectivity tests to achieve an average polylogarithmic complexity when a series of incremental changes are made to the graph [19, 20, 21], these algorithms are complicated and their implementation is involved. It is unclear if they would perform sufficiently well in practice. We are not aware of any MCMC-based graph sampler implementation that makes use of them. More practical approaches perform multiple edges switches between connectivity checks [22, 23]
. Frequent connectivity checks would result in bad computational performance, while an insufficient number of checks makes it more likely that the graph becomes disconnected, and therefore the last few edge switches must be reverted. These methods use heuristics to find an optimal number of switches to perform between connectivity checks, and maximize performance.
The disadvantage of MCMC-based methods is that there are no general theoretical results on the mixing time of this Markov chain [24]. Therefore, one must use heuristics to determine how many switches to perform between recording samples to ensure that the successive samples will be sufficiently statistically independent. In this sense, these algorithms are not exact.
In this section we present a new simple and elegant algorithm to build a connected realization of a degree sequence, if one exists. Constructing such a graph is the first step of any edge-switching MCMC sampling algorithm. We will show two versions of the construction process: to build either a simple graph, or a loopless multigraph.
Let us first consider constructing an arbitrary, not-necessarily-connected simple realization of a degree sequence. The Erdős–Gallai theorem provides a quick way to check whether a degree sequence is graphical, but not to construct a corresponding graph. To build a realization of the degree sequence, we can use the well-known Havel–Hakimi theorem.
The degree sequence is graphical if and only if after connecting vertex 1 to the largest-degree vertices, the remaining degree sequence is also graphical.
This theorem can be understood as an algorithm to construct a simple graph: As with the configuration model, we consider the vertices of the graph with as many stubs as their degrees (figure 1). In each step of the algorithm, we select an arbitrary vertex (the “hub”), and connect all of its stubs to the other vertices that have the most unconnected stubs left (highest remaining degree). The hub is then dropped from the degree sequence, along with any other vertices that have no remaining stubs. This step is repeated until no more degrees remain, or until no stubs can be connected without forming a non-simple graph. The theorem states that a degree sequence is graphical if and only if after performing a single step of the algorithm on it, the remaining degree sequence formed by the yet-unconnected stubs is also graphical. Thus, the algorithm will succeed in connecting up all the stubs if and only if the original degree sequence was graphical to begin with. This provides a way to both check the graphicality of a degree sequence and to build one of its realizations at the same time.
The Havel–Hakimi algorithm can construct a realization of a degree sequence, but how can we construct a connected realization? Previously, this has been done by first constructing an arbitrary, not necessarily connected realization, then using appropriately chosen edge switches (figure 2) to connect together the components of the graph [23]. This method is complicated and cumbersome to implement. Here we propose a simple and elegant alternative.
Note that the Havel–Hakimi theorem does not specify which vertex to choose as the hub in each step: any of them will do. Let us refer to choosing the vertex with the smallest remaining degree as a “HH*-step”.
Given a graphical degree sequence, the smallest-first Havel–Hakimi algorithm (i.e. consisting of HH*-steps) will produce a connected graph if and only if the starting degree sequence was potentially connected.
The key to the proof is to show that if the starting degree sequence is potentially connected, then every HH*-step reduces the number of vertices having non-zero remaining degree precisely by one, except in the very last step, when two vertices with remaining degree 1 each are connected to each other to complete the graph. Reversing the order of the steps would then correspond to building a graph by adding one vertex at a time and connecting it to some existing vertices. This clearly results in a connected graph.
Let us think about what kind of degree sequence we must apply a HH*-step to in order to reduce the number of vertices by more than one. The hub vertex is always removed. Additional vertices will only be removed if they only had one remaining stub (i.e. they had degree 1), which was then connected up to the hub vertex. Since we always choose a smallest-degree vertex as the hub, and connect it to the other vertices with the highest degrees, this situation is only possible when both the smallest and largest degree is 1. For example, the degree sequence is transformed to after one HH*-step, i.e. it decreases in size by 2.
Except for , such degree sequences consisting solely of 1s are not potentially connected. In the following, we will show that one HH*-step transforms any potentially connected degree sequence into another potentially connected one, and therefore never removes any other vertex from the degree sequence than the hub, except when connecting two degree-1 vertices at the very last step. Note that with an arbitrary graph construction process, it is not necessary to maintain the potential connectivity of intermediate degree sequences in order to arrive to a connected graph. Maintaining potential connectivity at intermediate stages is a sufficient, but not a necessary condition of obtaining a connected graph. To show that HH* maintains connectivity, we invoke the following lemma:
Let be the degree sequence of a (not necessarily simple) graph. There is a connected realization of this degree sequence if and only if and , or if .
The proof is given in Appendix A.2.
Will the inequality required for potential connectedness in lemma 5 stay valid after modifying the degree sequence with a HH*-step? The right-hand side will decrease by from to . If the selected hub vertex had degree 1, then the left-hand-side also decreases by 1, thus the inequality is maintained.
If the hub vertex had degree , then the sum of degrees is at least , . After one HH* step, the sum of degrees decreases by , thus we only need to show that , which is obviously true for . The inequality is maintained again. ∎
Let us now consider the case of loopless multigraphs, which may be constructed with a procedure analogous to the Havel–Hakimi algorithm.
The degree sequence is multigraphical if and only if after connecting vertex 1 to vertex 2 with a single edge, the remaining degree sequence is also multigraphical.
In simpler terms, in order to construct a loopless multigraph, we may simply select an arbitrary vertex and connect it to a largest-degree one among the other vertices. Repeating this step results in a loopless multigraph if and only if the starting degree sequence was multigraphical. Unlike in the case of the Havel–Hakimi theorem, connections are made one edge at a time.
Clearly, if is multigraphical, then so is . Thus we need only show that the multigraphicality condition of theorem 2, , is maintained after adding a connection between a maximal degree vertex and another vertex. Adding one connection decreases the left-hand-side of the inequality by 1. For the right-hand-side, there are two cases: (1) If only one vertex had maximal degree, or if precisely two vertices had maximal degree and they were connected to each other, then the right-hand-side (i.e. ) also decreases by 1, and the inequality is maintained. (2) If there is more than one maximal degree vertex and the connection was made between a maximal degree and a non-maximal-degree vertex, then does not decrease. However, in this case, the sum of degrees in includes twice, and at at least one more positive term due to the non-maximal-degree vertex. Therefore, , so decreasing the left-hand-side by 1 will not violate the inequality. ∎
We can also formulate the analogue of the theorem 4 for the loopless multigraph case:
Let be a multigraphical degree sequence, and let us repeatedly select the largest remaining degree vertex and the smallest non-zero remaining degree vertex, and connect them with a single edge. This construction procedure results in a connected graph if and only if was potentially connected.
The proof is completely analogous to that of theorem 4, and proceeds in three steps: (1) We will show that after applying a single step of the construction process, the remaining degree sequence stays potentially connected. (2) Therefore, when applying a single step of the construction process to a potentially connected degree sequence, the number of non-zero remaining degrees decreases by no more than one, except in the very last step. (3) Consequently, reversing the order of steps constructs a connected graph.
To show that a single construction step keeps the degree sequence potentially connected, we must prove that the condition of lemma 5, , is maintained. Since adding a single connection decreases the left-hand-side by 1, this inequality could only be violated if and (the number of non-zero degrees), does not decrease after a connection step. But since a smallest-degree vertex is always connected, this could only happen if none of the degrees are 1, i.e. , which would imply that . ∎
There is a simple intuition behind the statements of theorems 4 and 7. If we were to always choose the highest degree vertex as the hub, and connect it to other highest-degree vertices, it would quickly use up the available stubs. There would be insufficient stubs left towards the end of the procedure to connect all components together. Indeed, choosing highest-degree vertices as the hub tends to create graphs with multiple dense components (see figure 3). Conversely, choosing smallest-degree vertices as the hub and connecting them to highest-degree vertices leaves free stubs available. The same intuition raises the question: does the largest-first variant of the algorithm always build a non-connected realization, if one exists? The answer turns out to be no. A counterexample is , which can be split into two graphical degree sequences and , therefore it has a non-connected realization. Yet the largest-first Havel–Hakimi algorithm can only construct a connected one as it must connect the vertex of degree 3 to three degree-2 vertices. To the best of our knowledge, finding the computational complexity of deciding whether a degree sequence has a non-connected realization as a simple graph, i.e. whether it is forcibly connected, is still an open problem. We are not aware of any polynomial-time solutions. An exponential time algorithm was given by Wang [27].
We have contributed an implementation of the construction algorithms for connected simple graphs and connected loopless multigraphs to the igraph C library [28] as igraph_realize_degree_sequence(), and made it conveniently accessible through igraph’s Mathematica interface, IGraph/M [29], as the IGRealizeDegreeSequence function.
Recently, a new family of stub-matching sampling methods was proposed [30, 31, 32, 33]
, which construct each sample directly and independently (unlike edge-switching MCMC methods) and work efficiently in polynomial time (unlike the configuration model). These algorithms do not sample uniformly, but they can compute the exact probability of obtaining a sample at the same time as generating that sample. This makes it possible to “unbias” the samples and estimate any property that characterizes the entire set of realizations of a degree sequence, such as the averages of various graph metrics, similarly to how one might do with uniform sampling. Let
be the set of generated samples, and let denote some numerical property of the graph , such as its diameter, assortativity, clustering coefficient, etc. If the sampling is uniform, we can estimate the average of over all realizations as(3) |
If the sampling is biased, i.e. some graphs are generated with a higher probability than others, then we can re-weight them with to estimate as
(4) |
The same formula can be used if we do not have normalized probability values, but merely sampling weights which are proportional to the probabilities. This is the same principle as the one used in importance sampling.
To illustrate how this class of sampling methods works, let us consider the configuration model again, which pairs the stubs randomly. Along the same lines, we can exhaustively generate all realizations of a degree sequence by connecting up the stubs in all possible ways. This procedure can be thought of as a tree of decisions, like the one shown in figure 4: If there are stubs in total, there will be choices for connecting up the first stub. This is represented by the branches of the tree starting from its root. In the next step (corresponding to the next level of the tree), there will be choices, then
, and so on. The leaves of the decision tree represent the fully constructed graphs.
The configuration model’s algorithm can be thought of as traversing the decision tree randomly, starting at its root, choosing branches uniformly at random at each branching point, and finally arriving at a leaf. This decision tree is symmetric: all tree nodes steps away from the root (i.e. at level of the tree) have the same number of branches, . Therefore, each leaf is reached with the same probability , where denotes the double factorial. While each labelled graph appears as more than one tree leaf, all simple realizations appear the same number of times, with multiplicity . This explains why the configuration model samples uniformly if non-simple outcomes are rejected. If we admit loopless multigraph outcomes as well, then the number of leaves that a graph appears as decreases by a factor of , where denotes the number of edges between vertices and [17].
The part of the decision tree that leads to simple graphs is highlighted in red in figure 4. The core idea behind this new class of sampling methods is to traverse only this feasible subtree. The feasible subtree is not, in general, symmetric, therefore its leaves will not be sampled uniformly. However, the inverse sampling weight of a leaf can be computed by multiplying the number of feasible branches at each branching point on the path going from the tree root to the leaf. If not all graphs appear as the same number of leaves (as is the case with multigraphs), then the sampling weights used in equation (4) must be divided by the appropriate multiplicity. Through this approach, it is straightforward to take any algorithm that systematically generates all realizations of a degree sequence, and convert it into a random sampling algorithm. Instead of traversing all branches in its decision tree, simply pick a random branch to follow at every step. In order for such an algorithm to be efficient and practical, the following requirements must be met: (1) the multiplicity of each graph, i.e. the number of leaves that correspond to it, must be computable (2) it must be possible to count the feasible branches at each branching point, and select one of them efficiently. We note that depending on the exhaustive generation algorithm that the sampling is based on, it may be the case that some leaves of the decision tree that correspond to the same graph will have different sampling weights. However, equation (4) is still valid for estimating population averages.
A natural generalization of this method is to choose the branches of the decision tree non-uniformly. This gives an opportunity to reduce the bias of the sampling. If each branch were chosen with probability proportional to the number of leaves it contains, then the sampling would be uniform. While computing the exact number of leaves is a difficult combinatorial problem that may not be efficiently solvable, the sampling can be improved through heuristic choices of the branch probabilities. This idea is explored in more detail in [34]. For all the numerical examples discussed in section 5, we weighted the branches of the decision tree using a simple heuristic that is described in Appendix D.
Here, we choose to work with the decision tree of the exhaustive generation algorithm described above and illustrated in figure 4: take the stubs one-by-one, in order, processing all stubs of a vertex before moving on to the next, and consider all possible ways each stub can be connected. This decision tree has branches at each branching point of each level, where denotes the number of edges in the constructed graph. Since the stubs of a vertex are indistinguishable, only of these branches are distinct. Thus, enumerating each individual branch and testing it for feasibility becomes computationally tractable. Since there are levels in the tree, the sampling algorithm performs feasibility checks during the construction of a graph. For each branch, we must perform two checks: one of graphicality (or multigraphicality) and one of potential connectedness. In the following, we show that both of these checks can be done in constant computational time on average. Therefore, in summary, the computational time required to generate one sample is , where is the number of vertices and is the number of edges of the generated graph.
The constraint of graphicality. When examining the feasibility of a branch, first we must determine if it leads to any simple graphs. This check is similar to the usual graphicality test, with an important difference: Suppose that some stubs of vertex (the “hub vertex”) have already been connected to vertices , but it still has some free stubs. In order to obtain a simple graph, a second connection is not allowed to the vertices in the set . This restriction is referred to as a star constraint on , as the connections from to the elements of form a star graph. To check graphicality under this constraint, we use the following theorem:
Let be a degree sequence and let , with , be an “exclusion set” of vertices to which we forbid connections from vertex 1. Let us connect all stubs of vertex 1 to the largest-degree vertices not present in , obtaining the remaining degree sequence . The degree sequence can be realized by a simple graph respecting the exclusion set if and only if is graphical.
Notice that this is a generalization of the Havel–Hakimi theorem, which corresponds to the special case of , i.e. no exclusion. The graphicality of can be tested using the Erdős–Gallai theorem, making the entire test possible in computational time. In principle, theorem 8 could be used to test each branch of the decision tree separately, but this would not be efficient. A more sophisticated method is presented in [31], where it is shown that there exists a threshold degree that separates feasible branches from non-feasible ones. Connecting to a vertex with degree preserves graphicality while connecting to one with does not. may be determined in time, thus testing the graphicality of individual branches becomes constant time on average. For a detailed description of this testing procedure, we refer the reader to [31].
The constraint of multigraphicality. If we wish to sample loopless multigraphs instead of simple graphs, theorem 2 can be used directly. This requires computing the degree sum, as well as the maximum degree. Instead of recomputing these quantities at each step, their values can be updated incrementally in amortized constant time after the addition of each new edge.
The constraint of connectivity. In order to incorporate the constraint of connectivity, we must find a way to detect branches of the decision tree which do not lead to any connected graphs. In other words, we must detect when adding a specific connection would make it impossible to build a connected graph. We do this by tracking the groups of vertices (components) which have so far been connected (figure 5). These components can also be thought of as the nodes of a multigraph, which we term the “supergraph”. We refer to the components as “supernodes”. Then the construction process can be completed to a connected graph if and only if the supergraph is potentially connected.
Let be a degree sequence of supernodes, and let be the degree sequence of the vertices making up these supernodes. If is graphical (resp. multigraphical), the number of edges satisfies and or , then there is a simple graph (resp. loopless multigraph) realization of in which the supernodes form a connected graph.
The proof is given in Appendix A.2.
Note that there are two ways in which the potential connectivity of the supergraph can be broken: (1) there may no longer be a sufficient number of edges left to make the graph connected, i.e. , or (2) one of the supernodes (components) may become “closed”, i.e. its degree may become zero before the graph is fully constructed. To check whether adding a connection would give rise to either of these two conditions, we must consider several cases: If there is only one supernode, then the graph is already connected, therefore all connections are allowed. Otherwise, if , then only connections between different supernodes are allowed. Two supernodes with degree 1 each may not connect to each other, and a supernode with degree 2 may not connect to itself, except as the very last step that completed the graph. To check for these cases, we must determine if the two vertices to be connected are within the same supernode. This can be done in constant amortized time, as described in Appendix C.
To demonstrate the practical applicability of our proposed sampling method for connected graphs, we performed numerical experiments on degree sequences sampled from a power-law distribution. Networks with similarly heavy-tailed degree distributions commonly occur in the real world [4, 5]. The exponent of the power-law distribution was adjusted so as to obtain a degree sequence which, while potentially connected, has overwhelmingly many non-connected realizations. Sampling its connected realizations is therefore not feasible at all with the configuration model: in practice it never generates any connected samples. Thus, we compare results with MCMC samplers.
Figure 6a shows one typical simple connected realization of such a degree sequence. This degree sequence was used to generate the results shown in the subsequent panels of the same figure. We chose assortativity, a measure of degree correlations [12], as the graph property to study. Figure 6b illustrates how the value of this measure develops while running an edge-switching MCMC sampler for simple connected graphs. Two trajectories are shown: one starting with a high- and one with a low-assortativity graph. In this experiment, at least 1500-2000 edge switches were needed before the two trajectories converged, an indicator of reaching statistical independence. Based on this, in the following numerical experiments 2500 steps were performed between taking samples from the Markov chain. In general, the number of steps which are required to guarantee a given level of independence cannot be determined exactly—this is precisely the problem that the biased stub-matching sampler introduced in this work is meant to overcome.
Figure 6c compares the distribution of assortativity estimated using the MCMC sampler (blue curves) with the one obtained using the biased stub-matching sampler (yellow/brown curves), and demonstrates that both methods produce the same result. This validates our implementation of the method. The histogram of a biased sample is formed not by counting the number of data points in each bin, but by adding up their inverse sampling weights. The result shown in panel figure 6c comes from three separate experiments: In the first, only connected realizations were sampled. In the second, connectivity was not constrained. In the third, connectivity was also not constrained, but assortativity was measured only on the largest connected component (the “giant component”) of the graph. We included the third case because retaining only the giant component is often used as a practical substitute for incorporating a connectivity constraint into random graph models. The assortativity distributions are markedly different for all three cases, demonstrating the importance of taking connectivity into account when the problem at hand demands it. We note that with some degree distributions, simply taking the giant component of non-connected samples produces results similar to enforcing connectedness. However, as figure 6c demonstrates, with some other degree sequences there can be a significant difference.
Connected realizations | All realizations | |||
---|---|---|---|---|
MCMC | biased | MCMC | biased | |
mean | ||||
std. dev. | ||||
skewness | ||||
kurtosis |
The first four statistical moments of the assortativity distributions shown in
figure 6c, as estimated with MCMC and with the biased stub-matching sampler. Standard errors obtained with bootstrapping and are indicated in parentheses.
Estimates of four statistical moments of the distributions—their mean, standard deviation, skewness and kurtosis—are reported in
table 1 along with their standard errors. We note that the number of samples required for an accurate estimate of statistical quantities is larger when using biased sampling than with uniform sampling. This is not dissimilar from how the effective sample size of the correlated output of an MCMC sampler is also smaller than the number of generated data points. Therefore, when generating the histograms in figure 6c, we took 10 000 samples from the Markov chain (at intervals of 2500 steps) and 100 000 samples from the biased sampler. In the case of the biased sampler described here, the distribution of sample weights is typically bell-shaped on a logarithmic scale, as shown in figure 6d. This is expected, since sample weights are the inverse products of the number of feasible branches encountered at each level while traversing the decision tree. If the number of branches were random, the distribution of weights would be log-normal according to the central limit theorem.
Finally, as an example application of the method, we investigate the properties of two connected real-world networks by comparing them to a null model with degree and connectivity constraints (figure 7). Both of these networks are sufficiently sparse so that most realizations of their degree sequences are disconnected. Therefore, the connectivity constraint cannot be handled with simple rejection. The first network is the equivalenced representation of the Western US power network, from the Harwell–Boeing sparse matrix collection [35]. As a power grid, it is naturally connected. We investigate its global efficiency, defined as the average of the inverse of pairwise shortest path lengths between its vertices [36]. Figure 7a shows that the efficiency of this network is significantly lower than that of typical realizations of its degree sequence. This hints at the existence of another dominant constraint, which we surmise to be the spatially embedded nature of the network. Typical connected realizations have higher efficiency than non-connected ones. As the second example, we investigate the degree assortativity in the largest connected component of the protein-protein interaction network of the yeast Saccharomyces cerevisiae [37, 38]. Random networks with the same degrees become more disassortative (have higher negative assortativity) when forced to be connected, but still not as disassortative as the empirical network. This shows that high disassortativity is a special property of this network.
In this paper we considered the problem of constructing a single realization of a connected graph with a given degree sequence, as well as random sampling from the set of all connected realizations. We tackled both the case of simple graphs, as well as loopless multigraphs. The main contribution of this work is incorporating the constraint of connectivity.
Building a not-necessarily-connected realization of a degree sequence as a simple graph can be accomplished using the well-known Havel–Hakimi construction. Until now, the usual method to construct a connected realization was to first build an arbitrary realization, then rewire its edges to make it connected. This method is complicated and cumbersome to implement. With theorem 4, we show that a specific variant of the Havel–Hakimi construction is guaranteed to produce a connected realization, if one exists. Furthermore, in theorem 7
we generalize the construction to the case of loopless multigraphs. This provides a simple and elegant algorithm for building connected graphs with given degrees. We contributed an implementation of these algorithms to the open-source igraph library
[28] and its Mathematica interface, IGraph/M [29].We have also extended a new family of biased-sampling stub matching methods so that they incorporate the constraint of connectivity without a performance penalty, allowing for fast, efficient rendering of null models and random sampling. Indeed, our approach is significantly faster than the configuration model, which is simply infeasible to use in some regimes of degree sequences. Unlike edge-switching MCMC methods, the mixing time of which are not currently known, our method suffers no uncertainty or ambiguity in the independence of the samples. This is of particular importance, again, for the rendering of reliable null models that faithfully represent generic networks of a certain type. An implementation of our sampling method is made freely available at https://github.com/szhorvat/ConnectedGraphSampler. Finally, we have demonstrated these methods both on generated scale-free degree sequences, as well as on degree sequences of real-world networks. The connected realizations of all of these are markedly different from the non-connected ones, illustrating the relevance of the connectivity constraint. In all these examples, the use of the configuration model would have been simply infeasible.
We reiterate that these approaches are crucially important due to the pressing need for efficient, appropriate null models across the network and complexity sciences. While the general problem of multi-constraint null model construction and random sampling in random graph models remains open, connectedness is such a ubiquitous feature of real networks and graphs of potential interest that we hope our simple and powerful approach to building connected null models and performing random sampling will find wide applicability. Ultimately, reaching a state in which validation of new findings against numeric control “experiments” is the standard must be a critical goal for the field as a whole, and further progress in multi-constraint sampling is the only way forward.
In order to avoid self-loops, no degree may be larger than the sum of all other degrees: . This condition can be rewritten as . Therefore, is a necessary condition of multigraphicality. Now we show it is also sufficient, assuming that the degree sequence has an even sum. Any even-sum degree sequence can be realized as a graph that contains self-loops. Let us consider such a loopy realization and assume that vertex , with degree , has at least one self-loop. This self-loop may be eliminated by performing a degree-preserving edge switch (figure 2) between it and another edge that is not incident to . Such a non-incident edge exists because the total number of edges in the graph is , while has at most incident edges due to its self-loop. Therefore, all self-loops can be eliminiated using a sequence of degree-preserving edge switches. ∎
The condition of potential connectedness and its proof are well-known, and generally found in graph theory textbooks. We reproduce the proof here because of its relevance to lemma 9, which is a generalization of lemma 5. The requirement that there be no zero-degree (i.e. isolated) vertices is trivial, so from here on we assume that all degrees are positive. The number of edges in a graph with degree sequence is , thus we must show that potential connectivity is equivalent to having edges. First we show necessity. The smallest connected graph on vertices is a tree, and has precisely edges. This is because any tree with can be extended from a smaller tree by connecting a vertex to it with a single edge. Similarly, it is easy to see that a connected graph with more than edges must contain an edge whose removal does not disconnect it, called a cycle edge.
Now we show that any non-connected graph with at least edges can be transformed into a connected one by a sequence of degree-preserving edge switches (figure 2). Let us assume that we have a realization of with two connected components, having vertex counts and edge counts . Since , and since both components are connected implying and , one of the two must satisfy or . Thus, one of the two components has a cycle edge. Switching this cycle edge with an arbitrary edge in the other component will connect the two components together, without creating multi-edges. Therefore, given a non-connected realization of a degree sequence satisfying , we can connect its components together pair-by-pair, using degree-preserving edge switches. This concludes the proof of sufficiency. ∎
Remark. Potential connectedness is independent from graphicality and multigraphicality. These two properties do not constrain each other in the sense that the existence of a connected realization is the same regardless of whether we consider simple graphs, loopless multigraphs, or loopy multigraphs. This is not true for the related concept of forcibly connectedness: a degree sequence that only has connected simple realizations may have non-connected multigraph realizations. An example is , which is forcibly connected over simple graphs, but not over multigraphs.
Lemma 9 is completely analogous to lemma 5 with the difference that connectedness is considered over the supernodes (with degrees ), while graphicality (or multigraphicality) is considered over the vertices (with degrees ). The proof is analogous as well: we consider a simple graph (or loopless multigraph) realization of . This is also a realization of , although usually a non-simple one. It can be shown that there is a sequence of edge switches that preserve both and and make the supergraph connected. ∎
In this appendix we formulate an edge-switching MCMC method for sampling graphs, and show that it samples uniformly. One must take great care when formulating such a method, as uniformity hinges on seemingly small details of the algorithm. In our experience, similar methods are frequently implemented incorrectly, resulting in non-uniform sampling.
The algorithm for propagating the Markov chain is as follows:
Choose two distinct edges and of the graph , each uniformly at random.
Switch the edge-pair into either , or (figure 2), with probability each, obtaining the graph . The edge switch may potentially create self-loops (if the edges were not disjoint) or multi-edges.
If satisfies our chosen constraints (e.g. simple graph, loopless multigraph, connected graph, etc.), choose it as the next state of the Markov chain. Otherwise, the next state will be , the same as the current one.
In order for sampling to be uniform, it is necessary that the Markov chain be irreducible, i.e. that all graphs are reachable with edge switches that preserve our chosen constraints. This is true for simple graphs, loopless multigraphs, as well as for the connected version of both [18]. Notice that detailed balance is fulfilled because edge switches are reversible. Finally, the number of available edge-pair choices in step (1) is the same in all states, therefore each choice in each state is made with the same probability. This implies that the transition probability matrix of the Markov chain is symmetric, ensuring that its stationary distribution is uniform.
We note that if in step (1) only those edge-pairs are selected whose switching would maintain the constraints (i.e. keep the graph simple or loopless), preventing the Markov chain from staying in the same state for more than one step, then the sampling may not be uniform. For example, one may be tempted to only choose disjoint pairs of edges, as this would avoid the creation of self-loops and speed up the procedure. It turns out that with this choice, the sampling of simple graphs would still be uniform, but the sampling of loopless multigraphs would not. This is because the number of disjoint edge-pairs in a simple graph only depends on its degree sequence, and can be computed as the total number of edge-pairs minus the number of non-disjoint edge-pairs at each vertex:
However, this is no longer true for loopless multigraphs, in which two distinct vertices may be incident to the same (parallel) edge-pair. In a loopless multigraph, the number of disjoint edge-pairs depends not only on the graph’s degree sequence, but also on its specific structure. The transition matrix is no longer symmetric in general, and the sampling is not uniform. If in step (1) we entirely avoid edge-switches that would create a non-simple graph, then the sampling of simple graphs would not be uniform either.
During the process of graph construction that we use for sampling connected graphs, we must track the evolution of connected components, i.e. that of the “supernodes” and their degrees. Here we describe a possible efficient implementation of connectivity tracking.
To maintain the supergraph’s degree sequence, we need to be able to efficiently check if two vertices belong to the same supernode, i.e. if they are in the same connected component, at any point during the graph construction. This information can be encoded into a rooted forest whose nodes are identical to the vertices of our graph. Each tree in the forest corresponds to a component. To determine if two vertices are in the same component, we can simply find the roots of their trees and check if they are the same. This way, each supernode will be represented by a tree root, and any associated information (such as the supernode’s degree) can be stored directly in the root. When two vertices of the graph are connected to each other, the forest is updated as follows: If the two vertices were in the same component, nothing needs to be done. If they were in different components, then the two corresponding trees are joined by connecting their roots, and designating one of them as the root of the newly created larger tree.
Finding which component (supernode) a vertex belongs to requires walking up its tree until the root is reached. This takes as many operations as the distance of that vertex from the root. To speed up subsequent queries, we take all vertices on the path from the original vertex to the root, and make them direct descendants of the root. This makes checking if two vertices are in the same component a constant-time operation on average. To see why, consider that while building the graph, the component of each vertex will be checked before adding a connection. This reduces the depth of each tree in the forest to one. The subsequent joining of trees can thus never create a tree depth greater than two, i.e. no component check will take more than two operations.
We employ two simple heuristics to reduce the bias of the sampling distribution: (1) re-ordering the degree sequence and (2) choosing branches of the decision tree non-uniformly.
Note that the structure of the decision tree depends on the order in which vertices are connected up, i.e. the ordering of the degree sequence. We observed empirically that when using the connectivity constraint, an increasing ordering of degrees produces a narrower weight distribution, i.e. results in “more uniform” sampling. This is consistent with the intuition described in section 3: connecting small degree vertices to larger degree ones favours creating a connected graph.
In the most basic version of the sampling algorithm, each feasible branch of the decision tree is chosen with the same probability, i.e. allowed stubs are picked uniformly. This is equivalent to picking vertices with probability proportional to their degrees, . We introduce a simple one-parameter heuristic to choose branches non-uniformly: pick vertices with probability proportional to , or equivalently, pick stubs with probability proportional to . The parameter effectively tunes the affinity of connecting to high- versus low-degree vertices. corresponds to uniform stub choice. This choice of weighting the branches of the decision tree is purely heuristic, and is motivated both by its simplicity and the observation that both graphicality and connectivity are affected by a preference to choose larger or small degrees (see sections 3 and 4). For a more detailed exploration of branch weighting, see [34].
The parameter must be adjusted to reduce the bias of the sampler as much as possible. We do this based on the observation that the bias manifests itself in two important ways. First, the distribution of the sampling weights (figure 6d
) has a large variance. If sampling were uniform, its variance would be zero. Therefore,
could be chosen so as to minimize the variance of the sampling weight distribution. Second, when measuring a certain graph property such as assortativity, the biased sampler may produce property values that should be common with a vanishingly low probability. Figure A8a shows the biased distributions of assortativity values obtained with various different choices of (blue, yellow, red) and compares it to the values obtained with a non-biased MCMC sampler (grey). Notice that the biased distribution obtained with (blue) overlaps with the non-biased one only partially, and, for the sample size used here, includes almost no values lower than . Therefore, the bias cannot be effectively corrected without increasing the sample size significantly. However, the range of values frequently produced by the biased sampler may be adjusted through : increasing shifts assortativity values to a lower range (figure A8a, red and yellow). In the spirit of importance sampling, we choose to sample “important” values with high probability, i.e. maximize the overlap of the biased distribution with the non-biased one, and thus minimize the amount of bias correction that is necessary. How can this be achieved without knowing the non-biased distribution a priori? Notice that bias correction will cause a shift in the range of values only if there is a correlation between the values and the sampling weights. Figure A8b shows their joint distributions: the correlation is negative for , positive for and mostly vanishes for . When the distributions are unimodal, as is typically the case, the lowest correlation can be achieved by minimizing the mean logarithmic sampling weight, i.e. finding the minimum of the black dashed curve in figure A8b. Notice that this may be done without reference to any particular graph measure, such as assortativity. In the examples considered here, we observed that minimizing the mean of the logarithmic sampling weights also reduced their variance. In summary, minimizing either the variance or the mean of the logarithmic sampling weight distribution are practical ways to improve the performance of the sampling method.For all examples presented here, we used the Kiefer–Wolfowitz stochastic approximation algorithm to find the optimal . The values used for figure 6 were when sampling from the connected realizations of the degree sequence and when sampling from all realizations. The degree sequence was ordered increasingly in both cases.
Comments
There are no comments yet.