The generalised random dot product graph

09/16/2017
by   Patrick Rubin-Delanchy, et al.
0

This paper introduces a latent position network model, called the generalised random dot product graph, comprising as special cases the stochastic blockmodel, mixed membership stochastic blockmodel, and random dot product graph. In this model, nodes are represented as random vectors on R^d, and the probability of an edge between nodes i and j is given by the bilinear form X_i^T I_p,q X_j, where I_p,q = diag(1,..., 1, -1, ..., -1) with p ones and q minus ones, where p+q=d. As we show, this provides the only possible representation of nodes in R^d such that mixed membership is encoded as the corresponding convex combination of latent positions. The positions are identifiable only up to transformation in the indefinite orthogonal group O(p,q), and we discuss some consequences for typical follow-on inference tasks, such as clustering and prediction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/04/2018

On estimation and inference in latent structure random graphs

We define a latent structure model (LSM) random graph as a random dot pr...
10/03/2015

Maximum Likelihood Latent Space Embedding of Logistic Random Dot Product Graphs

A latent space model for a family of random graphs assigns real-valued v...
09/16/2015

Group Membership Prediction

The group membership prediction (GMP) problem involves predicting whethe...
07/20/2020

The multilayer random dot product graph

We present an extension of the latent position network model known as th...
11/25/2020

Mixed Membership Graph Clustering via Systematic Edge Query

This work considers clustering nodes of a largely incomplete graph. Unde...
03/31/2020

On Two Distinct Sources of Nonidentifiability in Latent Position Random Graph Models

Two separate and distinct sources of nonidentifiability arise naturally ...
12/21/2019

Persistent Homology of Graph Embeddings

Popular network models such as the mixed membership and standard stochas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While graphs are long-established objects of study in Mathematics and Computer Science, the contemporary proliferation of observable networks has made their analysis relevant in almost every branch of academia, government and industry. Yet, instead of being straightforward, translating graph theory into principled statistical procedures has produced challenges requiring a number of new insights.

An example pertinent to this paper is the spectral clustering algorithm

(Von Luxburg, 2007). Broadly speaking, given an undirected graph, this algorithm first computes the spectral decomposition of the corresponding adjacency or normalised Laplacian matrix. Next, the graph is spectrally embedded into by picking out the

main eigenvectors — in our case scaled according to their corresponding eigenvalue — to obtain a

-dimensional vector representation of each node. Finally, these points are input into a clustering algorithm such as -means (Steinhaus, 1956) to obtain communities. The most popular justification for this algorithm, put forward by Shi and Malik (2000) based on earlier work by Donath and Hoffman (1973); Fiedler (1973), is of solving a convex relaxation of the normalised cut problem. A more principled statistical justification was finally found by Rohe et al. (2011), see also Lei and Rinaldo (2015), showing that the spectral clustering algorithm provides a consistent estimate for the stochastic block model. Their proof however required a simple but substantial modification of the algorithm — to use eigenvectors from both the high and the low ends of the spectrum — of great relevance here.

The present paper provides a finer understanding of spectral embedding as a standalone procedure, i.e. without the subsequent clustering step, in terms of a statistical model. We employ a latent position model structure (Hoff et al., 2002), meaning that each node is mapped to a vector in some space, and two nodes and connect with probability given by a function, , sometimes called a kernel. Aside from its connection to spectral embedding, our proposal can be motivated as making the following geometric interpretation of latent space possible: that a probabilistic mixture of connectivity behaviours should be represented as the corresponding convex combination of latent positions. We find that the only way to achieve this in , up to affine transformation, is to let , where , with ones followed by minus ones on its diagonal, and where and are two integers satisfying . A generalised random dot product graph (GRPDG) model is a latent position model with this choice of kernel.

The vector representations of nodes obtained by spectral embedding can be interpreted as estimates of the latent positions of a GRDPG

. We will show that, subject to an unidentifiable transformation described below, the error of any individual latent position estimate obtained by spectral embedding is asymptotically Gaussian with elliptical contours (a central limit theorem), and the maximum error over the full node set is bounded with high probability (a strong consistency theorem). This has immediate methodological consequences on the estimation of both the mixed membership

(Airoldi et al., 2008) and standard stochastic block models (Holland et al., 1983), two currently popular network models. This is because either can be written as a GRDPG model via a judicious choice of vectors representing the communities, for some . Under the stochastic block model, each latent position is equal to one of these vectors (reflecting the node’s community membership), whereas under mixed membership each lives inside their convex hull (reflecting the node’s mixed membership). Community identification via spectral embedding therefore reduces to a clustering problem under the standard stochastic block model, and a support estimation problem under mixed membership. Our central limit theorem shows that, under the stochastic block model, fitting a Gaussian mixture model with elliptical components should be preferred over applying -means, as is commonly recommended in the spectral clustering algorithm. Our strong consistency theorem serves to prove that, under the mixed membership stochastic block model, the minimum volume enclosing convex -polytope provides a consistent estimate of the support. It is thereafter straightforward to obtain consistent estimates for the full parameter set of either model.

The strong consistency and central limit theorems hold after the spectrally embedded nodes are jointly transformed according to an unidentifiable matrix Q in the indefinite orthogonal group . The presence of this matrix creates the initially perturbing complication that inter-point distance is not identifiable in general. The application of distance-based inference procedures including

-means to the spectrally embedded nodes of a GRDPG is unsound, since two equivalent point clouds, i.e. equal up to indefinite orthogonal transformation, could yield different conclusions. This might at first glance also cast doubt over the use of the Gaussian clustering and minimum volume enclosing procedures suggested above. All such concerns are dispelled, mainly by appealing to simple statistical insights on the effect of this (full-rank) linear transformation on volumes and Gaussian contours, ultimately leaving all inferentially relevant quantities invariant. A more technical point is to ensure that the matrix

Q does not blow up, and we show that its spectral norm is bounded with high probability. There are therefore in effect two arguments against using -means within the spectral clustering algorithm: first, the clusters are asymptotically Gaussian with elliptical and not circular contours; second, if we accept that the specific point configuration obtained by spectral embedding is not special among equivalent point clouds for producing higher quality clusters, the algorithm is in a sense giving an arbitrary answer.

1.1 Related models

Hoff et al. (2002) considered a number of latent position models corresponding to different kernels, including perhaps the most natural choice based on distance, . Hoff (2008) later considered the kernel , where is a diagonal matrix, showing that this so-called eigenmodel generalises the stochastic block model and in a weak sense the afore-mentioned distance-based model. The eigenmodel can be made identical to the GRDPG model by rescaling the axes. Our apparently new results on spectral embedding and reproducing mixtures of connectivity behaviours (including mixed community membership) in latent space are therefore highly relevant. When estimation is discussed, including the matter of identifiability, the eigenmodel and our proposal diverge. Hoff (2009) restricts so that is orthonormal. Next, within a Bayesian treatment, this matrix is assumed a priori

to follow a uniform distribution on the

Stiefel manifold. In our estimation setup, are independent and identically distributed (i.i.d.) according to an unknown but estimable distribution.

Our model is named after the Random Dot Product Graph (RDPG) (Nickel, 2006; Young and Scheinerman, 2007; Athreya et al., 2016) where and , yielding the standard Euclidean inner product. What the GRDPG model adds is the possibility of modelling disassortative connectivity behaviour (Khor, 2010), e.g. where ‘opposites attract’. It is worth noting more generally that in community- and latent-position-based approaches to network data analysis, the possible presence of disassortative connectivity behaviour is often implicitly ruled out, e.g. within the distance-based latent position model mentioned above, the highly referenced tutorial on spectral clustering (Von Luxburg, 2007), and the notion of modularity (Newman, 2006)

. Yet, in our (collectively diverse) experience of real networks, disassortativity is not rare and is often spmewhat recklessly overlooked. The presence of a single significant negative eigenvalue in the spectrum of the adjacency matrix is reason to reject the RDPG model in favour of the GRDPG, but in fact real networks abound where the number of positive and negative eigenvalues are of the same order, as many simple theoretical arguments about the spectrum of a random matrix would predict. Reasons why disassortativity occurs and consequent opportunities for improvements in link prediction are later demonstrated in a cyber-security example (Section

5).

In a contemporaneously written paper, Lei (2018) proposes the kernel , where and live on the direct sum of two Hilbert spaces with respective inner products and . The GRDPG model is a special case where the Hilbert spaces are and , equipped with the usual inner product. Adjacency spectral embedding, as defined here (Definition 1), is there shown to provide consistent latent position estimates according to a form of Wasserstein distance. Our suggested methodological improvements for inference based on spectral embedding rely heavily on our strong consistency and central limit theorems, neither of which appears to be implied.

If the latent positions of the GRDPG are independent and identically distributed (i.i.d.), as will be assumed when we discuss estimation via spectral embedding, the model also admits an Aldous-Hoover representation (Aldous, 1981; Hoover, 1979), whereby each node is instead independently assigned a uniform latent position on the unit interval, and connections occur conditionally independently according to an appropriately modified kernel , which is often called a graphon (Lovász, 2012). Conversely, under continuity assumptions on , any Aldous-Hoover graph model can be approximated up to any precision by a GRDPG model with sufficiently large . This is not true of the RDPG or distance-based latent position models mentioned above, generally speaking, when is not positive definite.

1.2 Relation to prior work

A central limit theorem was earlier derived for the spectrally embedded nodes of an RDPG obtained from both the adjacency (Athreya et al., 2016) and Laplacian (Tang and Priebe, 2016) matrices, and strong consistency results, phrased in terms of a certain two-to-infinity norm, are available in Lyzinski et al. (2017); Cape et al. (2017, 2018).

Broadly speaking, extending prior results on adjacency spectral embedding from the RDPG to the GRDPG requires new methods of analysis, that together represent the main technical contribution of this paper (mainly Theorems 5 and 7). Further extending results to the case of Laplacian spectral embedding, while mathematically involved, follows mutatis mutandis the machinery developed in Tang and Priebe (2016). Analogous Laplacian-based results (Theorems 6 and 8) are therefore stated without proof.

We will therefore discuss primarily adjacency spectral embedding, which has the added benefit of allowing us to treat estimation of the mixed membership and standard stochastic block models as two alternative statistical analysis procedures, respectively support estimation and clustering, applicable to the same spectrally embedded nodes. Our discussion surrounding the stochastic block model, particularly relating to the importance of fitting a Gaussian mixture model over -means, is just as valid when the nodes are embedded using the Laplacian.

The connection between the mixed membership stochastic block model and RDPG was identified by Rubin-Delanchy et al. (2017) and used to prove that adjacency spectral embedding, followed by fitting the minimum volume enclosing convex -polytope, leads to a consistent estimate of the mixed membership stochastic blockmodel when the inter-community link probability matrix is non-negative definite. The (very common) non-definite case requires a two-to-infinity norm bound for the GRDPG as well as a bound on the spectral norm of Q, both derived here. With those two points established, the consistency of the enclosing polytope and resulting mixed membership stochastic block model parameter estimates follow by arguments analogous to those of Rubin-Delanchy et al. (2017).

The rest of this article is organised as follows. Section 2 provides some pedagogical examples of the implications of our theory, using two simple network model examples. Next, in Section 3, we define the generalised random dot product graph, and discuss several of its properties including special cases of interest, the representation of mixtures of connectivity behaviour as convex combinations in latent space, and identifiability. Section 4 presents asymptotic theory supporting the interpretation of spectral embedding as estimating the latent positions of a GRDPG, and methodological implications on the estimation of the mixed membership and standard stochastic block models. Section 5 provides a real data example from a cyber-security application, and Section 6 concludes.

2 Methodological applications

The spectral embedding procedures considered in this paper are:

Definition 1 (Adjacency and Laplacian spectral embeddings).

Given an undirected graph with (symmetric) adjacency matrix , consider the spectral decomposition , where is a diagonal matrix containing the largest eigenvalues of A in magnitude, and contains the corresponding orthonormal eigenvectors. Define the adjacency spectral embedding of the graph into by .

Similarly, let denote the (normalised) Laplacian of the graph, where is the degree matrix, a diagonal matrix where for the entry contains the degree of the th node. Now consider the spectral decomposition , where is a diagonal matrix containing the largest eigenvalues of L in magnitude, and contains the corresponding orthonormal eigenvectors. Define the Laplacian spectral embedding of the graph into by .

In the above and hereafter, and denote the element-wise absolute value and power of a diagonal matrix M.

The interpretation of these spectral embedding procedures as estimating the latent positions of a GRDPG model is the subject of this paper. Methodological implications are now illustrated using two very simple network model examples. The object of each is to motivate a different theoretical aim: in the first, to prove a central limit theorem on individual latent position estimates, and in the second, to bound the maximum error over the full set. Ultimately this leads to two algorithms, which are presented and analysed in more generality in Section 4.

2.1 A two-community stochastic block model

We first consider an undirected random graph from a two-community stochastic block model on nodes. Every node is independently assigned to the first or second community, with probabilities and respectively. With this assignment held fixed, an edge between the th and th node occurs independently with probability , where denote the communities of the two corresponding nodes, and

which has one positive and one negative eigenvalue. The adjacency matrix of this graph is a symmetric matrix , where if there is an edge between the th and th node, and is zero otherwise.

Figure 1: Spectral embedding and analysis of simulated graphs from the mixed membership (MM — right) and standard (S — left) stochastic block models. Detailed discussion in Section 2.

The point cloud shown in Figure 1a) was generated by adjacency spectral embedding a simulated graph from the stochastic block model described above into . The effect of this procedure is to map the th node of the graph to a two-dimensional vector, denoted , given by the tranpose of the th row of .

While the points appear to separate into two clusters, the distribution of each is visibly non-circular. The implication is that applying

-means could give spurious results, and an obvious potential improvement would be to instead fit a mixture of two (non-circular) Gaussian distributions. This is implemented in Figure 

1b) using the MCLUST algorithm (Fraley and Raftery, 1999). The estimated cluster assignment of each point is indicated by colouring. The empirical cluster centres are shown as small circles and corresponding empirical 95% level curves in dashed lines.

What we will discover is that the clusters are approximately Gaussian, but the precise sense in which this is true is quite subtle, and gives additional reason to be wary of -means. Choose to satisfy , for , so that knowledge of implies knowledge of , but not vice-versa. Without loss of generality, restrict attention to the first nodes. In the next statement, the community membership of those nodes is held fixed while the number nodes, , goes to infinity.

The sense in which the clusters are ‘approximately Gaussian’ is that there exists a sequence of random matrices in the indefinite orthogonal group , such that the vectors are asymptotically independent and Gaussian as , and each has mean and covariance , for . The function , if somewhat complicated, is easily computable given and the community membership probabilities. This central limit theorem, formally given in Theorem 7, is illustrated in Figure 1c). There is, incidentally, a non-obvious way of generating given A and . In this example,

which, as in general, is not distance preserving. Analogous plots such as Figure 1 of Tang and Priebe (2016)

had relied on finding an orthogonal matrix

by Procrustes super-imposition, a technique not available here. In Figure 1c) the points, empirical centres (small circles) and empirical level curves (dashed lines) have been transformed according to , for comparison with the asymptotic centres (crosses) and 95% level curves (solid lines) predicted by the theory.

The important complication that is added by the presence of this indefinite orthogonal transformation in the central limit theorem is that is unidentifiable and materially affects interpoint distances (see Figure 3). Translating theory into methodology therefore runs into the following apparent issue. Asymptotics aside, instead of observing two Gaussian clusters centered about and , the two clusters are observed only after they have together been distorted by a transformation that, when only A is observed, cannot be identified and therefore undone. How can we then meaningfully cluster the points?

The issue is resolved with a simple observation: if fitting a Gaussian mixture model with components of varying volume, shape, and orientation, this linear pre-transformation of the data is (in principle) immaterial.

This is because under a Gaussian mixture model the value of likelihood is unchanged if, while the component weights are held fixed, the data, component means and covariances are respectively transformed as , , , where M is a matrix, such as or indeed any element of , satisfying . Fitting a two-component Gaussian mixture model to the data by maximum likelihood therefore results in cluster mean vector estimates , where are the mean vectors that would have been estimated had we instead been able to fit the model to . But , and so for , so that the pairs and provide equivalent estimates of . Similarly, estimated component memberships (giving the nodes’ estimated community memberships) are also invariant. In practice, regularisation parameters in the clustering method may give results that are not invariant to indefinite transformation, but such effects should be small for large , especially taken alongside an additional result, in Lemma 9, that the spectral norm of is with high probability bounded. The use of -means, on top of being suboptimal, is also unsound since its output is dependent on what is arguably an arbitrary choice of point configuration.

2.2 A three-community mixed membership stochastic block model

We now simulate an undirected graph from a three-community mixed membership stochastic block model on nodes. Every node is first independently assigned a 3-dimensional probability vector , for , representing its community membership preferences. With this assignment held fixed, an edge between the th and th node now occurs independently with probability , where , and

which has one positive and two negative eigenvalues. The point cloud shown in Figure 1d) was generated by adjacency spectral embedding a simulated graph from this model into . As before, the th node of the graph is mapped to a vector, denoted , given by the tranpose of the th row of .

Again the point cloud shows a significant pattern, this time resembling a ‘noisy simplex’. The positioning of a point within the simplex might be expected to reflect the corresponding node’s community membership preferences, with the simplex vertices representing communities, and this intuition is now made formal.

Choose to satisfy , for , and assign to the th node the latent position , for . Analogously to Section 2.1, hold fixed while the number nodes, , goes to infinity. When applied to this model, our central limit theorem (Theorem 7) guarantees the existence of a sequence of random matrices , such that the vectors are asymptotically independent and Gaussian, where has mean and covariance , for . Those transformed points are shown in Figure 1d), with the simplex about , i.e. the support of , shown as a solid line.

While useful for intepretation, the central limit theorem does not in this case provide an obvious practical estimation procedure. For example, recycling an earlier idea (Rubin-Delanchy et al., 2017), we might have hoped to estimate (and therefore ) by fitting the minimum volume 2-simplex (MVS) enclosing the two principal components (PC) of the points. The resulting simplex is shown in Figure 1e) in dashed lines. By the theory developed in Rubin-Delanchy et al. (2017), it would converge and allow consistent parameter estimation if was positive definite.

In fact, the procedure remains consistent in the present indefinite case, but extending the proof involves two non-trivial challenges. Since, technicalities aside, the simplex enclosing clearly converges to the -- simplex, the first challenge is to control the asymptotic worst-case deviation of any of the latent position estimates from its true value, rather than the fixed finite subset considered in the central limit theorem. This is guaranteed by a second, strong consistency result, given Theorem 5, showing that , in Euclidean norm, with high probability.

The argument presented in Rubin-Delanchy et al. (2017) would then prove that this estimation procedure was consistent if it was instead applied to the (unobservable) . The second challenge is to determine whether the consistency properties of the actual procedure, i.e. applied to , are affected by the indefinite transformation. The minimum volume simplex fitting step is unaffected because , so that all relevant volumes are unchanged. Less trivially, the principal components are also estimated consistently because the spectral norm of is bounded (Lemma 9).

3 The generalised random dot product graph

The theory supporting the discussion of Section 2 is based on the following latent position network model. In the remainder of this article, the variable will always refer to a positive integer indicating dimension, and , two integers such that .

Definition 2 (Generalised random dot product graph model).

Let be a subset of such that for all , and

a joint distribution on

. We say that , with signature , if the following hold. First, let with . Then, is a symmetric hollow matrix such that, conditional on , for all ,

(1)

The graphs generated by this model are undirected with no self-loops. Allowing the latter, i.e. letting above, makes no difference to the asymptotic theory. The extension to directed graphs, however, is a larger endeavour not attempted here.

As we next show, the mixed membership and standard stochastic block models are special cases of Definition 2.

3.1 Special case 1: the stochastic block model

Generalising the example of Section 2.1 to communities, an undirected graph with adjacency matrix A follows a stochastic block model if there is a partition of the nodes into communities, conditional upon which , for , where and is an index denoting the community of the th node.

To represent this as a GRDPG model, let denote the number of strictly positive and strictly negative eigenvalues of B respectively, put , and choose such that , for . One choice is to use the rows of , where is the spectral decomposition of B. It will help to remember that .

If is then restricted so that with probability one , for , we have a stochastic block model. For example, in the two-community model of Section 2.1, we have , where

denotes the probability distribution placing all mass on

.

3.2 Special case 2: the mixed membership stochastic block model

Now, instead of fixing communities, assign (at random or otherwise) to the th node a probability vector where denotes the standard -simplex. Conditional on this assignment, let

where

for . The resulting graph is said to follow a mixed membership stochastic block model.

Averaging over and , we can equivalently write that, conditional on ,

But if and are as defined previously, then , where , for . Therefore, conditional on , Equation (1) holds, and the graph follows a model, with implicitly determined by construction of and supported within the convex hull of . If B is full-rank (), this support is a -simplex.

Figure 2: Illustration of the mixed membership (MM) and standard (S) stochastic block models as special cases of the GRDPG model (Definition 2), with . The points represent communities. Under mixed membership, if the th node has a community membership probability vector , then its position in latent space, , is the corresponding convex combination of . Under the standard stochastic block model, the th node is assigned to a single community so that .

The GRDPG model therefore gives the mixed membership and standard stochastic block models a natural spatial representation whereby represent communities, and latent positions in between them represent nodes with mixed membership. This is illustrated in Figure 2.

Originally, Airoldi et al. (2008) set for some , just as in Section 2.2. The corresponding latent positions are then a) also i.i.d., and b) fully supported on the convex hull of . The proof of consistency of our spectral estimation procedure, given in Algorithm 2 (Section 4) and illustrated in Figure 1e), relies on these two points only, allowing other distributions than Dirichlet.

3.3 Uniqueness

There are a number of reasonable alternative latent position models which, broadly described, assign the nodes to elements of a set and, with this assignment held fixed, set

for , where is some symmetric function. For example, Hoff et al. (2002) considered the choice . What is special about the GRDPG?

One argument for considering the GRDPG as a practical model, and not only a theoretical device for studying spectral embedding, is that it provides essentially the only way of faithfully reproducing mixtures of connectivity behaviour as convex combinations in latent space. This idea is now made formal.

Property 3 (Reproducing mixtures of connectivity behaviour).

Suppose that is a convex subset of a real vector space, and that is a subset of whose convex hull is . We say that a symmetric function reproduces mixtures of connectivity behaviours from if, whenever , where , and , we have

for any in .

This property helps interpretation of latent space. For example, suppose , and . In a latent position model where satisfies the above, we can either think of as being directly generated through , or by first flipping a coin, and generating an edge with probability if it comes up heads, or with probability otherwise.

In choosing a latent position model to represent the mixed membership stochastic block model, it would be natural to restrict attention to kernels satisfying Property 3, since they allow the simplex representation illustrated in Figure 2, with vertices representing communities and latent positions within it reflecting the nodes’ community membership preferences.

We now find that in finite dimension, any such choice amounts to a GRDPG model in at most one extra dimension:

Theorem 4.

Suppose is a subset of , for some . The function reproduces mixtures of connectivity behaviours if and only if there exist integers , , , a matrix , and a vector so that , for all .

The mixed membership stochastic block model is an example where this additional dimension is required: in Figure 2 the model is represented as a GRDPG model in dimensions, but the latent positions live on a -dimensional subset. The proof of Theorem 4 is relegated to the appendix.

3.4 Identifiability

In the definition of the GRDPG, it is clear that the conditional distribution of A given would be unchanged if each was replaced by , for any . The vectors are therefore identifiable from A only up to such transformation.

The property of identifiability up to orthogonal transformation is encountered in many statistical applications and corresponds to the case . This unidentifiability property will often turn out to be irrelevant because in this case inter-point distances are invariant. This ceases to be true when .

Figure 3 illustrates the distance distorting effect of indefinite orthogonal transformations on the latent positions of a GRDPG with signature . The group contains rotation matrices

but also hyperbolic rotations

as can be verified analytically. A rotation is applied to the points to get from the top-left to the top-right figure. Hyperbolic rotations (, chosen arbitrarily) and take the points from the top-left to the bottom-left and from the top-right to the bottom-right figures, respectively. The colour of each point is preserved across the figures.

While the shapes on the bottom row look symmetric, the inter-point distances are in fact materially altered. On the left, the blue vertex is closer to the green; on the right it is closer to the red; whereas all three vertices are equidistant in the top row.

This inter-point distance non-identifiability implies that, for example, when using spectral embedding to estimate latent positions for subsequent inference, distance-based inference procedures such as classical -means are to be avoided.

Figure 3: Identifiability of the latent positions of a GRDPG with signature . In each figure, the three coloured points represent latent positions , and . Transformations in the group include some rotations (e.g. that used to go from the top-left to top-right triangle), but also hyberbolic rotations (e.g. the two shown going from top-left to bottom-left and top-right to bottom-right). There are therefore group elements which change inter-point distances. On the left, the blue position is closer to the green, whereas on the right it is closer to the red; all three positions are equidistant in the top row. Further details in main text.

4 Estimation via spectral embedding

This section describes the asymptotic statistical properties of GRDPG latent position estimates obtained via spectral embedding. When restricted to special cases that are of current popular interest, these results suggest and formally justify the use of the following algorithms.

1:input adjacency matrix A, dimension , number of communities
2:compute spectral embedding of the graph into (see Definition 1)
3:fit a Gaussian mixture model (ellipsoidal, varying volume, shape, and orientation) with components
4:return cluster centres and node memberships
Algorithm 1 Spectral estimation of the stochastic block model (spectral clustering)

To accomplish step 3 we have been employing the MCLUST algorithm (Fraley and Raftery, 1999), which has a user-friendly R package. In step 1, either adjacency or Laplacian spectral embedding can be used (see Definition 1). If the latter, the resulting node memberships can be interpreted as alternative estimates of but note that the output cluster centres do not estimate directly. Where this algorithm differs most significantly from Rohe et al. (2011) is in the use of a Gaussian mixture model over -means.

1:input adjacency matrix A, dimension , number of communities
2:compute adjacency spectral embedding of the graph into (see Definition 1)
3:compute the principal components, giving transformed points , and fit the minimum volume enclosing convex -polytope, with vertices
4:obtain convex combinations , for
5:return vertices of the reconstructed convex -polytope in , and community membership probability vectors
Algorithm 2 Spectral estimation of the mixed membership stochastic block model

To fit the minimum volume enclosing convex

-polytope in step 3, we have been using the hyperplane-based algorithm by

Lin et al. (2016) and are grateful to the authors for providing code.

4.1 Asymptotics

Let be a random vector distributed as , where is some distribution supported on

with an invertible second moment matrix

. Here is viewed as fixed and constant, so for simplicity we suppress -dependent factors in the statements of our theorems. Our proofs, however, keep track of .

We will characterise the asymptotic latent position estimation error under the assumption that for each the latent positions are independent replicates of the random vector , where either or . The generic joint distribution occurring in Definition 2 is therefore assumed to factorise into a product, denoted , of identical marginal distributions that are equal to up to scaling. Since the average degree of the graph grows as , the cases and can be thought to respectively produce dense and sparse regimes and is called a sparsity factor.

Remark 1 (Probabilistic convention).

For ease of presentation, many bounds in this paper are said to hold “with high probability”. We say that a random variable

is if, for any positive constant there exists an integer and a constant (both of which possibly depend on ) such that for all , with probability at least . In addition, we write that the random variable is if for any positive constant and any there exists an such that for all , with probability at least . This notational convention is upheld when, for example, we specify the norm of a random vector or of a random matrix.

Theorem 5 (Adjacency spectral embedding two-to-infinity norm bound).

Consider the generalised random dot product graph with signature . There exists a universal constant such that, provided the sparsity factor satisfies , there exists a random matrix such that

(2)
Theorem 6 (Laplacian spectral embedding two-to-infinity norm bound).

Consider the generalised random dot product graph with signature . There exists a universal constant such that, provided the sparsity factor satisfies , there exists a random matrix such that

(3)

Let

denote the multivariate normal cumulative distribution function with mean zero and covariance matrix

, evaluated at the vector .

Theorem 7 (Adjacency spectral embedding central limit theorem).

Consider the sequence of generalised random dot product graphs with signature , where satisfies for the universal constant as in Theorem 5. For any integer , choose points in the support of , and points . There exists a sequence of random matrices so that

(4)

where

Theorem 8 (Laplacian spectral embedding central limit theorem).

Consider the sequence of generalised random dot product graphs with signature , where satisfies for the universal constant as in Theorem 6. For any integer , choose points in the support of , and points . There exists a sequence of random matrices so that

where

and , .

Remark 2 (GRDPG proof overview).

Theorems 5 and 7 are proved in succession within a unified framework. Within the proof, we consider the edge probability matrix and its (low-rank) spectral decomposition representation given by , where has orthonormal columns and is a diagonal matrix of eigenvalues. By the underlying GRDPG model unidentifiability with respect to indefinite orthogonal transformations, there exists such that . The proof of Theorem 5 begins with a collection of matrix perturbation decompositions which eventually yield the relation

for some specified transformation and residual matrix , where denotes the orthogonal group in dimension . Appropriately manipulating the above display equation subsequently yields the important identity

such that the matrix that appears in Theorems 5 and 7 is in fact . Theorem 5 is then established by bounding the maximum Euclidean row norm (equivalently, the two-to-infinity norm (Cape et al., 2017)) of the right-hand side of the above display equation sufficiently tightly. Theorem 7 is established with respect to the same transformation by showing that, conditional on the th latent position, i.e., th row of X, the classical multivariate central limit theorem can be invoked for the th row of the matrix , whereas the remaining residual term satisfies in probability as . The technical tools involved include a careful matrix perturbation analysis involving an infinite matrix series expansion of via Eq. (11), probabilistic concentration bounds for , delicately passing between norms, and indefinite orthogonal matrix group considerations.

The joint proof of Theorems 5 and 7 captures the novel techniques and necessary additional considerations for moving beyond random dot product graphs considered in previous work to generalised random dot product graphs. The proofs of Theorem 6 and Theorem 8 (for Laplacian spectral embedding), while laborious, follow mutatis mutandis by applying the aforementioned proof considerations within the earlier work and context of the Laplacian spectral embedding limit theorems proven in Tang and Priebe (2016). For this reason, we elect to state those theorems without proof.

The condition implies that . Moreover, it is possible to construct sequences of matrices such that as . In light of this, the following technical lemma is needed to ensure that the indefinite orthogonal transformation of interest is well-behaved, i.e., does not grow arbitrarily in spectral norm.

Lemma 9.

Let be a generalised random dot product graph with signature and sparsity factor satisfying , and define and as above. The matrix satisfying has bounded spectral norm almost surely

Proof of Lemma 9.

The matrices S and have common spectrum by definition which is further equivalent to the spectrum of , since for any conformable matrices ,

, excluding zero-valued eigenvalues. By the law of large numbers,