Connectivity in Random Annulus Graphs and the Geometric Block Model

04/12/2018 ∙ by Sainyam Galhotra, et al. ∙ University of Massachusetts Amherst 0

Random geometric graphs are the simplest, and perhaps the earliest possible random graph model of spatial networks, introduced by Gilbert in 1961. In the most basic setting, a random geometric graph G(n,r) has n vertices. Each vertex of the graph is assigned a real number in [0,1] randomly and uniformly. There is an edge between two vertices if the corresponding two random numbers differ by at most r (to mitigate the boundary effect, let us consider the Lee distance here, d_L(u,v) = {|u-v|, 1-|u-v|}). It is well-known that the connectivity threshold regime for random geometric graphs is at r ≈ n/n. In particular, if r = a n/n, then a random geometric graph is connected with high probability if and only if a > 1. Consider G(n,(1+ϵ)n/n) for any ϵ >0 to satisfy the connectivity requirement and delete half of its edges which have distance at most n/2n. It is natural to believe that the resultant graph will be disconnected. Surprisingly, we show that the graph still remains connected! Formally, generalizing random geometric graphs, we define a random annulus graph G(n, [r_1, r_2]), r_1 <r_2 with n vertices. Each vertex of the graph is assigned a real number in [0,1] randomly and uniformly as before. There is an edge between two vertices if the Lee distance between the corresponding two random numbers is between r_1 and r_2, 0<r_1<r_2. Let us assume r_1 = b n/n, and r_2 = a n/n, 0 <b <a. We show that this graph is connected with high probability if and only if a -b > 1/2 and a >1. That is G(n, [0,0.99 n/n]) is not connected but G(n,[0.50 n/n,1+ϵ n/n]) is. This result is then used to give improved lower and upper bounds on the recovery threshold of the geometric block model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Models of random graphs are ubiquitous with Erdős-Rényi graphs (Erdös and Rényi, 1959; Gilbert, 1959) at the forefront. Studies of the properties of random graphs have led to many fundamental theoretical observations as well as many engineering applications. In an Erdős-Rényi graph , the randomness lies in how the edges are chosen: each possible pair of vertices forms an edge independently with probability . It is also possible to consider models of graphs where randomness lies in the vertices.

Vertex Random Graphs.

Keeping up with the simplicity of the Erdős-Rényi model, let us define a vertex-random graph (VRG) in the following way. Given two reals , the vertex-random graph is a random graph with vertices. Each vertex is assigned a random number selected randomly and uniformly from . Two vertices and are connected by an edge, if and only if , where can be taken to be the absolute difference , however to curtail the boundary effect we define here.

This definition is by no means new. For the case of , this is the random geometric graphs (RGG) in one dimension. Random Geometric graphs were defined first by (Gilbert, 1961) and constitute the first and simplest model of spatial networks. The definition of VRG has been previously mentioned in (Dettmann and Georgiou, 2016). The interval is called the connectivity interval in VRG. Random geometric graphs have several desirable properties that model real human social networks, such as vertices with high modularity and the degree associativity property (high degree nodes tend to connect). This has led RGGs to be used as models of disease outbreak in social network (Eubank et al., 2004) and flow of opinions (Zhang et al., 2014). RGGs are an excellent model for wireless (ad-hoc) communication networks (Dettmann and Georgiou, 2016; Haenggi et al., 2009). From a more mathematical stand-point, RGGs act as a bridge between the theory of classical random graphs and that of percolation (Bollobás, 2001, 2006). Recent works on RGGs also include hypothesis testing between an Erdős-Rényi graph and a random geometric graph (Bubeck et al., 2016).

Threshold properties of Erdős-Rényi graphs have been at the center of much theoretical interest, and in particular it is known that many graph properties exhibit sharp phase transition phenomena

(Friedgut and Kalai, 1996). Random geometric graphs also exhibit similar threshold properties (Penrose, 2003).

Consider a defined above with . It is known that is connected with high probability if and only if 111That is, is connected for any . We will ignore this and just mention connectivity threshold as .. Now let us consider the graph . Clearly this graph has less edges than . Is this graph still connected? Surprisingly, we show that the above modified graph remains connected as long as . Note that, on the other hand, is not connected for any .

To elaborate, consider a when and . We show that when , the vertex-random graph is connected with high probability if and only if and . This means the graphs and are not connected with high probability, whereas is connected. For a depiction of the connectivity regime for the vertex-random graph see Figure 1.

Figure 1: The shaded area in the - plot shows the regime where an VRG is connected with high probability.

Can we explain this seemingly curious shift in connectivity interval, when one goes from to ? Compare the VRG with the . The former one can be thought of being obtained by deleting all the ‘short-distance’ edges from the later. It turns out the ‘long-distance’ edges are sufficient to maintain connectivity, because they can connect points over multiple hops in the graph. Another possible explanation is that connectivity threshold for VRG is not dictated by isolated nodes as is the case in Erdős-Rényi graphs. Thus, after the connectivity threshold has been achieved, removing certain short edges still retains connectivity.

The Geometric Block Model.

We are motivated to study the threshold phenomena of vertex-random graphs, because it appears naturally in the analysis of the geometric block model (GBM) (Galhotra et al., 2018). The geometric block model is a probabilistic generative model of communities in a variety of networks and is a spatial analogue to the popular stochastic block model (SBM) (Holland et al., 1983; Dyer and Frieze, 1989; Decelle et al., 2011; Abbe and Sandon, 2015; Abbe et al., 2016; Hajek et al., 2015; Chin et al., 2015; Mossel et al., 2015). The SBM generalizes the Erdős-Rényi graphs in the following way. Consider a graph , where is a disjoint union of clusters denoted by The edges of the graph are drawn randomly: there is an edge between and with probability Given the adjacency matrix of such a graph, the task is to find exactly (or approximately) the partition of .

This model has been incredibly popular both in theoretical and practical domains of community detection. Recent theoretical works focus on characterizing sharp threshold of recovering the partition in the SBM. For example, when there are only two communities of exactly equal sizes, and the inter-cluster edge probability is and intra-cluster edge probability is , it is known that exact recovery is possible if and only if (Abbe et al., 2016; Mossel et al., 2015). The regime of the probabilities being has been put forward as one of most interesting ones, because in an Erdős-Rényi random graph, this is the threshold for graph connectivity (Bollobás, 1998). Note that the results are not only of theoretical interest, many real-world networks exhibit a “sparsely connected” community feature (Leskovec et al., 2008), and any efficient recovery algorithm for sparse SBM has many potential applications.

While SBM is a popular model (because of its apparent simplicity), there are many aspects of real social networks, such as “transitivity rule” (‘friends having common friends’) inherent to many social and other community structures, are not accounted for in SBM. Defining a block model over a random geometric graph, the geometric block model (GBM), circumvents this since GBM naturally inherits the transitivity property of a random geometric graph. In a previous work (Galhotra et al., 2018), we showed GBM models community structures better than an SBM in many real world networks (e.g. DBLP, Amazon purchase network etc.). The GBM depends on the basic definition of the random geometric graph in the same way the SBM depends on Erdős-Rényi graphs. The two-cluster GBM with vertex set , is a random graph defined in the following way. Suppose, be two real numbers. For each vertex randomly and independently choose a number

according to uniform distribution. There will be an edge between

if and only if,

Let us denote this random graph as . Given this graph , the main problem of community detection is to find the parts and . It has been shown in (Galhotra et al., 2018) that GBM accurately represents (more so than SBM) many real world networks. Given a geometric random graph our main objective is to recover the partition (i.e., and ).

Motivated by SBM literature, we here also look at GBM in the connectivity regime, i.e., when . Our first contribution in this part is to provide a lower bound that shows that it is impossible to recover the parts from when We also derive a relation between and that defines a sufficient condition of recovery in (see, Theorem 4). To analyze the algorithm proposed, we need to crucially use the results obtained for the connectivity of vertex-random graphs.

It is possible to generalize the GBM to include different distributions, different metric spaces and multiple parts. It is also possible to construct other type of spatial block models such as the one very recently being put forward in (Sankararaman and Baccelli, 2018) which rely on the random dot product graphs (Young and Scheinerman, 2007). In (Sankararaman and Baccelli, 2018)

, edges are drawn between vertices randomly and independently as a function of the distance between the corresponding vertex random variables. In contrast, in GBM edges are drawn deterministically given the vertex random variables, and edges are dependent unconditionally.

(Sankararaman and Baccelli, 2018) also considers the recovery scenario where in addition to the graph, values of the vertex random variables are provided. In GBM, we only observe the graph. In particular, it will be later clear that if we are given the corresponding random variables (locations) to the variables in addition to the graph, then recovery of the partitions in is possible if and only if .

VRG in Higher Dimension: The Random Annulus Graphs.

It is natural to ask similar question of connectivity for VRGs in higher dimension. In a VRG at dimension , we may assign -dimensional random vectors to each of the vertices, and use a standard metric such as the Euclidean distance to decide whether there should be an edge between two vertices. Formally, let us define the -dimensional sphere as . Given two reals , the random annulus graph is a random graph with vertices. Each vertex is assigned a random vector selected randomly and uniformly from . Two vertices and are connected by an edge, if and only if Note that, for an is nothing but a VRG as defined above, where we need to convert the Euclidean distance to the geodesic distance and scale the probabilities by a factor of . The gives the standard definition of random geometric graphs in dimensions (for example, see (Bubeck et al., 2016) or (Penrose, 2003)).

We give the name random annulus graph (RAG) because two vertices are connected if one is within an ‘annulus’ centered at the other. For the high dimensional random annulus graphs we extend our connectivity results of to general . In particular we show that there exists an isolated vertex in the with high probability if and only if

where

is the gamma function. Computing the connectivity threshold of RAG exactly is highly challenging, and we have to use several approximations of high dimensional geometry. Our arguments crucially rely on VC dimensions of sets of geometric objects such as intersections of high dimensional annuluses and hyperplanes. Overall we find that the

is connected with high probability if

Using the connectivity result for , the results for the geometric block model can be extended to high dimensions. The latent feature space of nodes in most networks are high-dimensional. For example, road networks are two-dimensional whereas the number of features used in a social network may have much higher dimensions. In a ‘high-dimensional’ GBM: for any , instead of assigning a random variable from we assign a random vector to each vertex ; and two vertices in the same part is connected if and only if their Euclidean distance is less than , whereas two vertices from different parts are connected if and only if their distance is less than . We show the algorithm developed for one dimension, extends to higher dimensions as well with nearly tight lower and upper bounds.

In this paper, we consistently refer to the case for RAG as vertex-random graph.

The paper is organized as follows. In Section 2, we provide the main results of the paper formally. In Section 3, the sharp connectivity phase transition results for vertex-random graphs are proven (details in Section 6). In Section 4, the connectivity results are proven for high dimensional random annulus graphs (details in Section 7). Finally, in Section 5, a lower bound for the geometric block model as well as the main recovery algorithm are presented (details in Section 8).

2 Main Results

We formally define the random graph models, and state our results here.

Definition 1 (Vertex-Random Graph).

A vertex-random graph on vertices has parameters , and a pair of real numbers . It is defined by assigning a number to vertex where s are independent and identical random variables uniformly distributed in . There will be an edge between vertices and if and only if where .

One can think of the random variables , to be uniformly distributed on the perimeter of a circle with radius and the distance to be the geodesic distance. It will be helpful to consider vertices as just random points on . Note that every point has a natural left direction (if we think of them as points on a circle then this is the counterclockwise direction) and a right direction. As a shorthand, for any two vertices , let denote where are corresponding random values to the vertices respectively. We can extend this notion to denote the distance between a vertex (or the embedding of that vertex in ) and a point naturally.

Our main result regarding vertex-random graphs is given in the following theorem. The base of the logarithm is here and everywhere else in the paper unless otherwise mentioned.

Theorem 1 (Connectivity threshold of vertex-random graphs).

The is connected with probability if and . On the other hand, the is not connected with probability if or .

For the special case of , the result was known ((Muthukrishnan and Pandurangan, 2005; Penrose, 2003) See also (Penrose, 2016)). However, note that the case of is neither a straightforward generalization (i.e., the connectivity region is not defined by ) nor intuitive.

Definition 2 (The Random Annulus Graph).

Let us define the -dimensional unit sphere as . A random annulus graph on vertices has parameters , and a pair of real numbers . It is defined by assigning a number to vertex where s are independent and identical random vectors uniformly distributed in . There will be an edge between vertices and if and only if where denote the norm.

When from the context it is clear that we are in high dimensions, we use to denote or just the distance between the arguments.

If we substitute , then is a random graph where each vertex is associated with a random variable uniformly distributed in the unit circle. The distance between two vertices is the length of the chord connecting the random variables corresponding to the two vertices. If the length of the chord is , then the length of the corresponding (smaller) chord is . If we normalize the circumference of the circle by we obtain a random graph model that is equivalent to our definition of the vertex-random graphs. Since handling geodesic distances is more cumbersome in the higher dimensions, we resorted to Euclidean distance.

We derived the following results about the existence of isolated vertices in random annulus graphs.

Theorem 2 (Zero-One law for Isolated Vertex in RAG).

For a random annulus graph where and , there exists isolated nodes with probability if

where is the gamma function, and there does not exist an isolated vertex with probability if .

An obvious deduction from this theorem is that an is not connected with probability if . Our main result here gives a condition that guarantees connectivity in this regime.

Theorem 3.

A dimensional random annulus graph is connected with probability if

All these connectivity results find immediate application in analyzing the algorithm that we propose for the geometric block model (GBM). A GBM is a generative model for networks (graphs) with underlying community structure.

Definition 3 (Geometric Block Model).

Given , choose a random variable uniformly distributed in for all . The geometric block model with parameters is a random graph where an edge exists between vertices and if and only if,

As a consequence of the connectivity lower bound on VRG, we are able to show that recovery of the partition is not possible with high probability in by any means whenever or (see, Theorem 10). Another consequence of the vertex-random graph results is that we show that if in addition to a GBM graph, all the locations of the vertices are also provided, then recovery is possible if and only if or (formal statement in Theorem 11).

Coming back to the actual recovery problem, our main contribution for GBM is to provide a simple and efficient algorithm that performs well in the sparse regime (see, Algorithm 8.2).

Theorem 4 (Recovery algorithm for GBM).

Suppose we have the graph generated according to . Define

Then there exists an efficient algorithm which will recover the correct partition in the GBM with probability if OR .

Some example of the parameters when the proposed algorithm (Algorithm 8.2) can successfully recover is given in Table 1.

0.01 1 2 3 4 5 6 7
Minimum value of 3.18 8.96 12.63 15.9 18.98 21.93 24.78 27.57
Table 1: Minimum value of , given for which Algorithm 8.2 resolves clusters correctly in .

As can be anticipated the connectivity results for RAG applies to the ‘high dimensional’ geometric block model.

Definition 4 (The GBM in High Dimensions).

Given , choose a random vector independently uniformly distributed in for all . The geometric block model with parameters is a random graph where an edge exists between vertices and if and only if,

We extend the algorithmic results to high dimensions.

Theorem 5.

There exists a polynomial time efficient algorithm that recovers the partition from with probability if and . Moreover, any algorithm fails to recover the parts with probability at least if or .

3 Connectivity of Vertex-Random Graphs

In this section we give a sketch of the proof of sufficient condition for connectivity of VRG. The full details along with the proof of the necessary condition are given in Section 6.

3.1 Sufficient condition for connectivity of VRG

Theorem 6.

The vertex-random graph is connected with probability if and .

To prove this theorem we use two main technical lemmas that show two different events happen with high probability simultaneously.

Figure 2: Each vertex having two neighbors on either direction implies the graph is a union of cycles. The cycles can be interleaving in .
Lemma 1.

A set of vertices is called a cover of , if for any point in there exists a vertex such that . A is a union of cycles such that every cycle forms a cover of (see Figure 2) as long as and with probability .

This lemma also shows effectively the fact that ‘long-edges’ are able to connect vertices over multiple hops. Note that, the statement of Lemma 1 would be easier to prove if the condition were . In that case what we prove is that every vertex has neighbors (in the VRG) on both of the left and right directions. To see this for each vertex , assign two indicator -random variables and , with if and only if there is no node to the left of node such that . Similarly, let if and only if there is no node to the right of node such that . Now define . We have,

and,

If then which implies, by invoking Markov inequality, that with high probability every node will have neighbors (connected by an edge in the VRG) on either side. This results in the interesting conclusion that every vertex will lie in cycle that covers . This is true for every vertex, hence the graph is simply a union of cycles each of which is a cover of . The main technical challenge is to show that this conclusion remains valid even when , which is proved in Lemma 1 in Section 6.

Lemma 2.

Set two real numbers and . In an , with probability there exists a vertex and nodes to the right of such that and nodes to the right of such that , for . The arrangement of the vertices is shown in Figure 3 (pg. 15).

With the help of these two lemmas, we are in a position to prove Theorem 6. The proofs of the two lemmas are given in Section 6 and contain the technical essence of this section.

Proof of Theorem 6.

We have shown that the two events mentioned in Lemmas 1 and 2 happen with high probability. Therefore they simultaneously happen under the condition and . Now we will show that these events together imply that the graph is connected. To see this, consider the vertices and that satisfy the conditions of Lemma 2. We can observe that each vertex has an edge with and , . This is because (see Figure 3 for a depiction)

Similarly,

This implies that is connected to and for all . The first event implies that the connected components are cycles spanning the entire line . Now consider two such disconnected components, one of which consists of the nodes and . There must exist a node in the other component (cycle) such that is on the right of and . If , (see Figure 8). When , we can calculate the distance between and as

and

Therefore is connected to when If then is already connected to . Therefore the two components (cycles) in question are connected.This is true for all cycles and hence there is only a single component in the entire graph. Indeed, if we consider the cycles to be disjoint super-nodes, then we have shown that there must be a star configuration. ∎

4 Connectivity of High Dimensional Random Annulus Graphs: Proof of Theorem 3

In this section we show a proof sketch of Theorem 3 to establish the sufficient condition of connectivity of random annulus graphs. The details of the proof and the necessary conditions are provided in Section 7.

Note, here and . We show the upper bound for connectivity of a Random Annulus Graphs in dimension as shown in Theorem 3. For this we first define a pole as a vertex which is connected to all vertices within a distance of from itself. In order to prove Theorem 3, we first show the existence of a pole with high probability in Lemma 3.

Lemma 3.

In a , with probability there exists a pole.

Next, Lemma 4 shows that for every vertex and every hyperplane passing through and not too close to the tangent hyperplane at , there will be a neighbor of on either side of the plane. Therefore, there should be a neighbor towards the direction of the pole. In order to formalize this, let us define a few regions associated with a node and a hyperplane passing through .

Informally, and represents the partition of the annulus on either side of the hyperplane and represents the region on the sphere lying on .

Lemma 4.

If we sample nodes from according to , then for every node and every hyperplane passing through such that is not all within distance of , node has a neighbor on both sides of the hyperplane with probability at least provided and .

We found the proof of this lemma to be challenging. Since, we do not know the location of the pole, we need to show that every point has a neighbor on both sides of the plane no matter what the orientation of the plane. Since the number of possible orientations is uncountably infinite, we cannot use a union-bound type argument. To show this we have to rely on the VC Dimension of the family of sets for all hyperplanes (which can be shown to be less than ). We rely on the celebrated result of Haussler and Welzl (1987) (we derived a continuous version of it), see Theorem 9, to deduce our conclusion.

For a node , define the particular hyperplane which is normal to the line joining and the origin and passes through . We now need one more lemma that will help us prove Theorem 3.

Lemma 5.

For a particular node and corresponding hyperplane , if every point in is within distance from , then must be within of .

For now, we assume that the Lemmas 3, 4 and 5 are true and show why these lemmas together imply the proof of Theorem 3.

Proof of Theorem 3.

We consider an alternate (rotated but not shifted) coordinate system by multiplying every vector by a orthonormal matrix such that the new position of the pole is the -dimensional vector where only the first co-ordinate is non-zero. Let the dimensional vector describing a node in this new coordinate system be . Now consider the hyperplane and if is not connected to the pole already, then by Lemma 4 and Lemma 5 the node has a neighbor which has a higher first coordinate. The same analysis applies for and hence we have a path where the first coordinate of every node is higher than the previous node. Since the number of nodes is finite, this path cannot go on indefinitely and at some point, one of the nodes is going to be within of the pole and will be connected to the pole. Therefore every node is going to be connected to the pole and hence our theorem is proved. ∎

5 The Geometric Block Model

In this section, we prove the necessary condition for exact cluster recovery of GBM and give an efficient algorithm that matches that within a constant factor. The details are provided in Section 8.

5.1 Immediate consequence of VRG connectivity

The following lower bound for GBM can be obtained as a consequence of Theorem 8.

Theorem 7 (Impossibility in GBM).

Any algorithm to recover the partition in will give incorrect output with probability if or .

Proof.

Consider the scenario that not only the geometric block model graph was provided to us, but also the random values for all vertex in the graph were provided. We will show that we will still not be able to recover the correct partition of the vertex set with probability at least (with respect to choices of and any randomness in the algorithm).

In this situation, the edge where does not give any new information than . However the edges where are informative, as existence of such an edge will imply that and are in the same part. These edges constitute a vertex-random graph . But if there are more than two components in this vertex-random graph, then it is impossible to separate out the vertices into the correct two parts, as the connected components can be assigned to any of the two parts and the VRG along with the location values () will still be consistent.

What remains to be seen that will have components with high probability if or . This is certainly true when as we have seen in Theorem 8, there can indeed be isolated nodes with high probability. On the other hand, when , just by using an analogous argument it is possible to show that there are vertices that do not have any neighbors on the left direction (counterclockwise). We delegate the proof of this claim as Lemma 23 in the appendix. If there are such vertices, there must be at least disjoint candidates.This completes the proof. ∎

Indeed, when the locations associated with every vertex is provided, it is also possible to recover the partition exactly when and , matching the above lower bound exactly (see Theorem 11).

Similar impossibility result extends to higher dimensional GBM from the necessary condition on connectivity of RAG.

5.2 A recovery algorithm for GBM

We now turn our attention to an efficient recovery algorithm for GBM. Intriguingly, we show a simple triangle counting algorithm works well for GBM and recovers the clusters in the sparsity regime. Triangle counting algorithms are popular heuristics applied to social networks for clustering

Easley et al. (2012), however they fail in SBM. Hence, this serves as another validation why GBM are well-suited to model community structures in social networks.

The algorithm is as follows. Given a graph with two disjoint parts, generated according to , the algorithm (see Algorithm  8.2) goes over all edges . It counts the number of triangles containing the edge and leave the edge intact if and only if the number of triangles are not within two specified thresholds and . It then returns the connected components of the redacted graph. Having two thresholds is somewhat non-intuitive. We show two vertices in different components can only have number of common neighbors within and , and thus all those edges get removed during the first iteration. In this process, some intra-cluster edges also get removed, but using the connectivity property of VRG, we are able to show the clusters still remain connected.

The same algorithm extends to higher dimensions as well, showing irrespective of underlying dimensionality, there exists a good algorithm. The proof here relies on the connectivity of random annulus graphs.

6 Connectivity of Vertex-Random Graphs: Details

In this section, we prove the necessary and sufficient condition for connectivity of VRG in full details.

6.1 Necessary condition for connectivity of VRG

Theorem 8 (VRG connectivity lower bound).

The is not connected with probability if or .

Proof.

First of all, it is known that is not connected with high probability when (Muthukrishnan and Pandurangan, 2005; Penrose, 2003). Therefore must not be connected with high probability when as the connectivity interval is a strict subset of the previous case, and can be obtained from by deleting all the edges that has the two corresponding random variables separated by distance less than .

Next we will show that if then there exists an isolated vertex with high probability. It would be easier to think of each vertex as a uniform random point in . Define an indicator variable for every node which is 1 when node is isolated and otherwise. We have,

Define , and hence

Therefore, when ,

. To prove this statement with high probability we can show that the variance of

is bounded. Since

is a sum of indicator random variables, we have that