Statistical inference on random dot product graphs: a survey

09/16/2017 ∙ by Avanti Athreya, et al. ∙ Johns Hopkins University 0

The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the analogues, in graph inference, of several canonical tenets of classical Euclidean inference: in particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 28

page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random graph inference is an active, interdisciplinary area of current research, bridging combinatorics, probability, statistical theory, and machine learning, as well as a wide spectrum of application domains from neuroscience to sociology. Statistical inference on random graphs and networks, in particular, has witnessed extraordinary growth over the last decade: for example,

[41, 51] discuss the considerable applications in recent network science of several canonical random graph models.

Of course, combinatorial graph theory itself is centuries old—indeed, in his resolution to the problem of the bridges of Königsberg, Leonard Euler first formalized graphs as mathematical objects consisting of vertices and edges. The notion of a random graph, however, and the modern theory of inference on such graphs, is comparatively new, and owes much to the pioneering work of Erdős, Rényi, and others in the late 1950s. E.N. Gilbert’s short 1959 paper [40]

considered a random graph for which the existence of edges between vertices are independent Bernoulli random variables with common probability

; roughly concurrently, Erdős and Rényi provided the first detailed analysis of the probabilities of the emergence of certain types of subgraphs within such graphs [30], and today, graphs in which the edges arise independently and with common probability are known as Erdős-Rényi (or ER) graphs.

The Erdős-Rényi (ER) model is one of the simplest generative models for random graphs, but this simplicity belies astonishingly rich behavior ([6], [15]). Nevertheless, in many applications, the requirement of a common connection probability is too stringent: graph vertices often represent heterogeneous entities, such as different people in a social network or cities in a transportation graph, and the connection probability between vertex and may well change with and or depend on underlying attributes of the vertices. Moreover, these heterogeneous vertex attributes may not be observable; for example, given the adjacency matrix of a Facebook community, the specific interests of the individuals may remain hidden. To more effectively model such real-world networks, we consider latent position random graphs [43]. In a latent position graph, to each vertex in the graph there is associated an element of the so-called latent space , and the probability of connection between any two edges and is given by a link or kernel function . That is, the edges are generated independently (so the graph is an independent-edge graph) and .

The random dot product graph (RDPG) of Young and Scheinerman [106] is an especially tractable latent position graph; here, the latent space is an appropriately constrained subspace of Euclidean space , and the link function is simply the dot or inner product of the pair of -dimensional latent positions. Thus, in a -dimensional random dot product graph with vertices, the latent positions associated to the vertices can be represented by an matrix whose rows are the latent positions, and the matrix of connection probabilities is given by . Conditional on this matrix , the RDPG has an adjacency matrix whose entries are Bernoulli random variables with probability . For simplicity, we will typically consider symmetric, hollow RDPG graphs; that is, undirected, unweighted graphs in which , so there are no self-edges. In our real data analysis of a neural connectome in Section 6.3, however, we describe how to adapt our results to weighted and directed graphs.

In any latent position graph, the latent positions associated to graph vertices can themselves be random; for instance, the latent positions may be independent, identically distributed random variables with some distribution on . The well-known stochastic blockmodel (SBM), in which each vertex belongs to one of subsets known as blocks, with connection probabilities determined solely by block membership [44], can be represented as a random dot product graph in which all the vertices in a given block have the same latent positions (or, in the case of random latent positions, an RDPG for which the distribution is supported on a finite set). Despite their structural simplicity, stochastic block models are the building blocks for all independent-edge random graphs; [105] demonstrates that any independent-edge random graph can be well-approximated by a stochastic block model with a sufficiently large number of blocks. Since stochastic block models can themselves be viewed as random dot product graphs, we see that suitably high-dimensional random dot product graphs can provide accurate approximations of latent position graphs [99]

, and, in turn, independent-edge graphs. Thus, the architectural simplicity of the random dot product graph makes it particularly amenable to analysis, and its near-universality in graph approximation renders it expansively applicable. In addition, the cornerstone of our analysis of randot dot product graphs is a set of classical probabilistic and linear algebraic techniques that are useful in much broader settings, such as random matrix theory. As such, the random dot product graph is both a rich and interesting object of study in its own right and a natural point of departure for wider graph inference.

A classical inference task for Euclidean data is to estimate, from sample data, certain underlying distributional parameters. Similarly, for a latent position graph, a classical graph inference task is to infer the graph parameters from an observation of the adjacency matrix

. Indeed, our overall paradigm for random graph inference is inspired by the fundamental tenets of classical statistical inference for Euclidean data. Namely, our goal is to construct methods and estimators of graph parameters or graph distributions; and, for these estimators, to analyze their (1) consistency; (2) asymptotic distributions; (3) asymptotic relative efficiency; (4) robustness to model misspecification; and (5) implications for subsequent inference including one- and multi-sample hypothesis testing. In this paper, we summarize and synthesize a considerable body of work on spectral methods for inference in random dot product graphs, all of which not only advance fundamental tenets of this paradigm, but do so within a unified and parsimonious framework. The random graph estimators and test statistics we discuss all exploit the

adjacency spectral embedding (ASE) or the Laplacian spectral embedding (LSE), which are eigendecompositions of the adjacency matrix and normalized Laplacian matrix , where is the diagonal degree matrix .

The ambition and scope of our approach to graph inference means that mere upper bounds on discrepancies between parameters and their estimates will not suffice. Such bounds are legion. In our proofs of consistency, we improve several bounds of this type, and in some cases improve them so drastically that concentration inequalities and asymptotic limit distributions emerge in their wake. We stress that aside from specific cases (see [39], [102], [56]

), limiting distributions for eigenvalues and eigenvectors of random graphs are notably elusive. For the adjacency and Laplacian spectral embedding, we discuss not only consistency, but also asymptotic normality, robustness, and the use of the adjacency spectral embedding in the nascent field of multi-graph hypothesis testing. We illustrate how our techniques can be meaningfully applied to thorny and very sizable real data, improving on previously state-of-the-art methods for inference tasks such as community detection and classification in networks. What is more, as we now show, spectral graph embeddings are relevant to many complex and seemingly disparate aspects of graph inference.

A bird’s-eye view of our methodology might well start with the stochastic blockmodel, where, for an SBM with a finite number of blocks of stochastically equivalent vertices, [90] and [34] show that -means clustering of the rows of the adjacency spectral embedding accurately partitions the vertices into the correct blocks, even when the embedding dimension is misspecified or the number of blocks is unknown. Furthermore, [67] and [68] give a significant improvement in the misclassification rate, by exhibiting an almost-surely perfect clustering in which, in the limit, no vertices whatsoever are misclassified. For random dot product graphs more generally, [92] shows that the latent positions are consistently estimated by the embedding, which then allows for accurate learning in a supervised vertex classification framework. In [99] these results are extended to more general latent position models, establishing a powerful universal consistency result for vertex classification in general latent position graphs, and also exhibiting an efficient embedding of vertices which were not observed in the original graph. In [8] and [98]

, the authors supply distributional results, akin to a central limit theorem, for both the adjacency and Laplacian spectral embedding, respectively; the former leads to a nontrivially superior algorithm for the estimation of block memberships in a stochastic block model (

[94]), and the latter resolves, through an elegant comparison of Chernoff information, a long-standing open question of the relative merits of the adjacency and Laplacian graph representations.

Morever, graph embedding plays a central role in foundational work of Tang et al. [96] and [97] on two-sample graph comparison: these papers provide theoretically justified, valid and consistent hypothesis tests for the semiparamatric problem of determining whether two random dot product graphs have the same latent positions and the nonparametric problem of determining whether two random dot product graphs have the same underlying distributions. This, then, yields a systematic framework for determining statistical similarity across graphs, which in turn underpins yet another provably consistent algorithm for the decomposition of random graphs with a hierarchical structure [68]. In [58], distributional results are given for an omnibus embedding of multiple random dot product graphs on the same vertex set, and this embedding performs well both for latent position estimation and for multi-sample graph testing. For the critical inference task of vertex nomination, in which the inference goal is to produce an ordering of vertices of interest (see, for instance, [24]) Fishkind and coauthors introduce in [32] an array of principled vertex nomination algorithms ––the canonical, maximum likelihood and spectral vertex nomination schemes—and demonstrate the algorithms’ effectiveness on both synthetic and real data. In [66] the consistency of the maximum likelihood vertex nomination scheme is established, a scalable restricted version of the algorithm is introduced, and the algorithms are adapted to incorporate general vertex features.

Overall, we stress that these principled techniques for random dot product graphs exploit the Euclidean nature of graph embeddings but are general enough to yield meaningful results for a wide variety of random graphs. Because our focus is, in part, on spectral methods, and because the adjacency matrix of an independent-edge graph can be regarded as a noisy version of the matrix of probabilities [76], we rely on several classical results on matrix perturbations, most prominently the Davis-Kahan Theorem (see [9] for the theorem itself, [81] for an illustration of its role in graph inference, and [107] for a very useful variant). We also depend on the aforementioned spectral bounds of Oliveira in [76] and a more recent sharpening due to Lu and Peng [62]. We leverage probabilistic concentration inequalities, such as those of Hoeffding and Bernstein [103]. Finally, several of our results do require suitable eigengaps for and lower bounds on graph density, as measured by the maximum degree and the size of the smallest eigenvalue of . It is important to point out that in our analysis, we assume that the embedding dimension of our graphs is known and fixed. In real data applications, such an embedding dimension is not known, and in Section 6.3, we discuss approaches (see [18] and [109]) to estimating the embedding dimension. Robustness of our procedures to errors in embedding dimension is a problem of current investigation.

In the study of stochastic blockmodels, there has been a recent push to understand the fundamental information-theoretic limits for community detection and graph partitioning [2, 72, 1, 71]. These bounds are typically algorithm-free and focus on stochastic blockmodels with constant or logarithmic average degree, in which differences between vertices in different blocks are assumed to be at the boundary of detectability. Our efforts have a somewhat different flavor, in that we seek to understand the precise behavior of a widely applicable procedure in a more general model. Additionally, we treat sparsity as a secondary concern, and typically do not broach the question of the exact limits of our procedures. Our spectral methods may not be optimal for stochastic models [52, 50] but they are very useful, in that they rely on well-optimized computational methods, can be implemented quickly in many standard languages, extend readily to other models, and serve as a foundation for more complex analyses.

Finally, we would be remiss not to point out that while spectral decompositions and clusterings of the adjacency matrix are appropriate for graph inference, they are also of considerable import in combinatorial graph theory: readers may recall, for instance, the combinatorial ratio-cut problem, whose objective is to partition the vertex set of a graph into two disjoint sets in a way that minimizes the number of edges between vertices in the two sets. The minimizer of a relaxation to the ratio-cut problem [31] is the eigenvector associated to the second smallest eigenvalue of the graph Laplacian . While we do not pursue more specific combinatorial applications of spectral methods here, we note that [22] provides a comprehensive overview, and [63] gives an accessible tutorial on spectral methods.

We organize the paper as follows. In Section 2, we define random dot product graphs and the adjacency spectral embedding, and we recall important linear algebraic background. In Section 4, we discuss consistency, asymptotic normality, and hypothesis testing, as well as inference for hierarchical models. In Section 6, we discuss applications of these results to real data. Finally, in Section 7 we discuss current theoretical and computational difficulties and open questions, including issues of optimal embedding dimension, model limitations, robustness to errorful observations, and joint graph inference.

2 Definitions, notation, and background

2.1 Preliminaries and notation

We begin by establishing notation. For a positive integer , we let

. For a vector

, we let denote the Euclidean norm of

. We denote the identity matrix, zero matrix, and the square matrix of all ones by,

, , and , respectively. We use to denote the Kronecker product. For an matrix , we let denote its th entry; we denote by the column vector formed by the -th column of ; and we denote by the row vector formed by the -th row of . For a slight abuse of notation, we also let denote the column vector formed by transposing the -th row of . That is, . Given any suitably specified ordering on eigenvalues of a square matrix , we let denote the -th eigenvalue (under such an ordering) of and the

-th singular value of

. We let denote the spectral norm of and denote the Frobenius norm of . We let denote the maximum of the Euclidean norms of the rows of , i.e. . We denote the trace of a matrix by . For an symmetric matrix whose entries are all non-negative, we will frequently have to account for terms related to matrix sparsity, and we define and as follows:

(1)

In a number of cases, we need to consider a sequence of matrices. We will denote such a sequence by , where is typically used to denote the index of the sequence. The distinction between a particular element in a sequence of matrices and a particular row of a matrix will be clear from context, and our convention is typically to use to denote the index of a sequence and or to denote a particular row of a matrix. In the case where we need to consider the th row of a matrix that is itself the th element of a sequence, we will use the notation .

We define a graph

to be an ordered pair of

where is the so-called vertex or node set, and , the set of edges, is a subset of the Cartesian product of . In a graph whose vertex set has cardinality , we will usually represent as , and we say there is an edge between and if . The adjacency matrix provides a compact representation of such a graph:

Where there is no danger of confusion, we will often refer to a graph and its adjacency matrix interchangeably.

Our focus is random graphs, and thus we will let denote our sample space, the -algebra of subsets of and our probability measure . We will denote the expectation of a (potentially multi-dimensional) random variable with respect to this measure by . Given an event , we denote its complement by , and we let denote the probability of . As we will see, in many cases we can choose to be subset of Euclidean space. Because we are interested in large-graph inference, we will frequently need to demonstrate that probabilities of certain events decay at specified rates. This motivates the following definition.

Definition 1 (Convergence asymptotically almost surely and convergence with high probability).

Given a sequence of events , where , we say that occurs asymptotically almost surely if as . We say that occurs with high probability, and write , if for any , there exists finite positive constant depending on such that for all . We note that occurring w.h.p. is stronger than occurring asymptotically almost surely. Morever, occurring with high probability implies, by the Borel-Cantelli Lemma [23], that with probability there exists an such that holds for all .

Moreover, since our goal is often to understand large-graph inference, we need to consider asymptotics as a function of graph size . As such, we recall familiar asymptotic notation:

Definition 2 (Asymptotic notation).

If is a quantity depending on , we will say that is of order and use the notation to denote that there exist positive constants such that for sufficiently large,

When the quantity is clear and , we sometimes simply write “ is of order ”. We write if there exists a constant such that for sufficiently large, . We write if as , and if as . We write if there exists a constant such that for all sufficiently large, .

Throughout, we will use to denote a constant, not depending on , which may vary from one line to another.

2.2 Models

Since our focus is on -dimensional random dot product graphs, we first define an an inner product distribution

as a probability distribution over a suitable subset of

, as follows:

Definition 3 ( -dimensional inner product distribution).

Let be a probability distribution whose support is given by . We say that is a -dimensional inner product distribution on if for all , we have .

Next, we define a random dot product graph as an independent-edge random graph for which the edge probabilities are given by the dot products of the latent positions associated to the vertices. We restrict our attention here to graphs that are undirected and in which no vertex has an edge to itself.

Definition 4 (Random dot product graph with distribution ).

Let be a -dimensional inner product distribution with , collected in the rows of the matrix . Suppose is a random adjacency matrix given by

(2)

We then write and say that is the adjacency matrix of a random dot product graph of dimension or rank at most and with latent positions given by the rows of . If is, in fact, a rank matrix, we say is the adjacency matrix of a rank random dot product graph.

While our notation for a random dot product graph with distribution is , we emphasize that in this paper the latent positions are always assumed to be unobserved. An almost identical definition holds for random dot product graphs with fixed but unobserved latent positions:

Definition 5 (RDPG with fixed latent positions).

In the definition  4 given above, the latent positions are themselves random. If, instead, the latent positions are given by a fixed matrix and, given this matrix, the graph is generated according to Eq.(2), we say that is a realization of a random dot product graph with latent positions , and we write .

Remark 1 (Nonidentifiability).

Given a graph distributed as an RDPG, the natural task is to recover the latent positions that gave rise to the observed graph. However, the RDPG model has an inherent nonidentifiability: let be a matrix of latent positions and let

be a unitary matrix. Since

, it is clear that the latent positions and give rise to the same distribution over graphs in Equation (2). Note that most latent position models, as defined below, also suffer from similar types of non-identifiability as edge-probabilities may be invariant to various transformations.

As we mentioned, the random dot product graph is a specific instance of the more general latent position random graph with link or kernel function . Indeed, the latent positions themselves need not belong to Euclidean space per se, and the link function need not be an inner product.

Definition 6 (Latent position random graph with kernel ).

Let be a set and a symmetric function. Suppose to each there is associated a point . Given consider the graph with adjacency matrix defined by

(3)

Then is the adjacency matrix of a latent position random graph with latent position and link function .

Similarly, we can define independent edge graphs for which latent positions need not play a role.

Definition 7 (Independent-edge graphs).

For a matrix symmetric matrix of probabilities, we say that is distributed as an independent edge graph with probabilities if

(4)

By their very structure, latent position random graphs, for fixed latent positions, are independent-edge random graphs. In general, for any latent position graph the matrix of edge probabilities is given by Of course, in the case of an random dot product graph with latent position matrix , the probability of observing an edge between vertex and vertex is simply . Thus, for an RDPG with latent positions , the matrix is given by .

In order to more carefully relate latent position models and RDPGs, we can consider the set of positive semidefinite latent position graphs. Namely, we will say that a latent position random graph is positive semidefinite if the matrix is positive semidefinite. In this case, we note that an RDPG can be used to approximate the latent position random graph distribution. The best rank- approximation of , in terms of the Frobenius norm [28], will correspond to a RDPG with -dimensional latent positions. In this sense, by allowing to be as large as necessary, any positive semi-definite latent position random graph distribution can be approximated by a RDPG distribution to arbitrary precision [100].

While latent position models generalize the random dot product graph, RDPGs can be easily related to the more limited stochastic blockmodel graph [44]. The stochastic block model is also an independent-edge random graph whose vertex set is partitioned into groups, called blocks, and the stochastic blockmodel is typically parameterized by (1) a matrix of probabilities of adjacencies between vertices in each of the blocks, and (2) a block-assignment vector which assigns each vertex to its block. That is, for any two vertices , the probability of their connection is

and we typically write . Here we present an alternative definition in terms of the RDPG model.

Definition 8 (Positive semidefinite -block stochastic block model).

We say an RDPG with latent positions is an SBM with blocks if the number of distinct rows in is , denoted In this case, we define the block membership function to be a function such that if and only if . We then write

In addition, we also consider the case of a stochastic block model in which the block memberships of each vertex is randomly assigned. More precisely, let with and suppose that are now i.i.d. random variables with distribution , i.e., for all . Then we say is an SBM with i.i.d block memberships, and we write

We also consider the degree-corrected stochastic block model:

Definition 9 (Degree Corrected Stochastic Blockmodel (DCSBM) [48]).

We say an RDPG is a DCSBM with blocks if there exist unit vectors such that for each , there exists and such that .

Remark 2.

The degree-corrected stochastic blockmodel model is inherently more flexible than the standard SBM because it allows for vertices within each block/community to have different expected degrees. This flexibility has made it a popular choice for modeling network data [48].

Definition 10 (Mixed Membership Stochastic Blockmodel (MMSBM) [3]).

We say an RDPG is a MMSBM with K blocks if there exists unit vectors such that for each , there exists such that and .

Remark 3.

The mixed membership SBM is again more general than the SBM by allowing for each vertex to in a mixture of different blocks. Additionally, note that every RDPG is a MMSBM for some choice of .

Our next theorem summarizes the relationship between these models.

Theorem 1.

Considered as statistical models for graphs, i.e. sets of probability distributions on graphs, the positive-semidefinite -block SBM is a subset of the -block DCSBM and the -block MMSBM. Both the positive semidefinite -block DCSBM and -block MMSBM are subsets of the RDPG model with -dimensional latent positions. Finally, the union of all possible RDPG models, without restriction of latent position dimension, is dense in the set of positive semidefinite latent position models.

2.3 Embeddings

Since we rely on spectral decompositions, we begin with describing the notations for the spectral decomposition of the rank positive semidefinite matrix .

Definition 11 (Spectral Decomposition of ).

Since is symmetric and positive semidefinite, let denote its spectral decomposition, with having orthonormal columns and a diagonal matrix with nonincreasing entries .

As with the spectral decomposition of the matrix , given an adjancency matrix , we define its adjacency spectral embedding as follows;

Definition 12 (Adjacency spectral embedding (ASE)).

Given a positive integer , the adjacency spectral embedding (ASE) of into is given by where

is the spectral decomposition of and is the diagonal matrix of the largest eigenvalues of and is the matrix whose columns are the corresponding eigenvectors.

Remark 4.

The intuition behind the notion of adjacency spectral embedding is as follows. Given the goal of estimating , had we observed then the spectral embedding of , given by , will be a orthogonal transformation of . Of course, is not observed but instead we observe , a noisy version of . The ASE will be a good estimate of provided that the noise does not greatly impact the embedding. As we will see shortly, one can show that with high probability [76, 62, 103, 57]. That is to say, can be viewed as a “small” perturbation of . Weyl’s inequality or the Kato-Temple inequality [16, 49] then yield that the eigenvalues of are “close” to the eigenvalues of . In addition, by the Davis-Kahan theorem [26], the subspace spanned by the top eigenvectors of is well-approximated by the subspace spanned by the top eigenvectors of .

We also define the analogous Laplacian spectral embedding which uses the spectral decomposition of the normalized Laplacian matrix.

Definition 13.

Let denote the normalized Laplacian of where is the diagonal matrix whose diagonal entries . Given a positive integer , the Laplacian spectral embedding (LSE) of into is given by where

is the spectral decomposition of and is the diagonal matrix containg the largest eigenvalues of on the diagonal and is the matrix whose columns are the corresponding eigenvectors.

Finally, there are a variety other matrices for which spectral decompositions may be applied to yield an embedding of the graph [54]. These are often dubbed as regularized embeddings and seek to improve the stability of these methods in order to accommodate sparser graphs. While we do not analyze these embeddings directly, many of our approaches can be adapted to these other embeddings.

3 Core proof techniques: probabilistic and linear algebraic bounds

In this section, we give a overview of the core background results used in our proofs. The key tools to several of our results on consistency and normality of the adjacency spectral embedding depend on a triumvirate of matrix concentration inequalities, the Davis-Kahan Theorem, and detailed bounds via the power method.

3.1 Concentration inequalities

Concentration inequalities for real- and matrix-valued data are a critical component to our proofs of consistency for spectral estimates. We make use of classical inequalities, such as Hoeffding’s inequality, for real-valued random variables, and we also exploit more recent work on the concentration of sums of random matrices and matrix martingales around their expectation. For a careful study of several important matrix concentration inequalities, see [103].

We begin by recalling Hoeffding’s inequality, which bounds the deviations between a sample mean of independent random variables and the expected value of that sample mean.

Theorem 2.

Let , , be independent, bounded random variables defined on some probability space . Suppose are real numbers such that . Let be their sample mean:

Then

(5)

and

(6)

For an undirected, hollow RDPG with probability matrix , for all . As such, one can regard as a “noisy” version of . It is tempting to believe that and are close in terms of the Frobenius norm, but this is sadly not true; indeed, it is easy to see that

To overcome this using only Hoeffding’s inequality, we can instead consider the difference , which is a sum of independent random variables. Hence, Hoeffding’s inequality implies that

Since the eigenvectors of and coincide, this is itself sufficient to show concentration of the adjacency spectral embedding [91, 81]. However, somewhat stronger and more elegant results can be shown by considering the spectral norm instead. In particular, a nontrivial body of recent work on matrix concentration implies that, under certain assumptions on the sparsity of , the spectral norm of can be well-controlled. We focus on the following important result of Oliveira [76] and Tropp [103] and further improvements of Lu and Peng [62] and Lei and Rinaldo [57], all of which establish that the and are close in spectral norm.

Theorem 3 (Spectral norm control of from [76, 103]).

Suppose Let be the adjacency matrix of an independent-edge random graph on with matrix of edge probabilities . For any constant , there exists another constant , independent of and , such that if , then for any ,

(7)

In [62], the authors give an improvement under slightly stronger density assumptions111A similar bound is provided in [57], but with defined as and a density assumption of the form .:

Theorem 4 (Spectral norm control of [62]).

With notation as above, suppose there exist positive constants such that for sufficiently large, . Then for any there exists a constant depending on such that

(8)

3.2 Matrix perturbations and spectral decompositions

The above results formalize our intuition that provides a “reasonable” estimate for . Moreover, in the RDPG case, where is of low rank and is necessarily positive semidefinite, these results have implications about the nonnegativity of the eigenvalues of . Specifically, we use Weyl’s Theorem to infer bounds on the differences between the eigenvalues of and from the spectral norm of their difference, and the Gerschgorin Disks Theorem to infer lower bounds on the maximum row sums of from assumptions on the eigengap of (since both and are nonnegative matrices, one could also obtain the same lower bounds by invoking the Perron-Frobenius Theorem). For completeness, we recall the Gerschgorin Disks Theorem and Weyl’s Theorem. The former relates the eigenvalues of a matrix to the sums of the absolute values of the entries in each row, and the latter establishes bounds on the differences in eigenvalues between a matrix and a perturbation.

Theorem 5 (Gerschgorin Disks [45]).

Let be a complex matrix, with entries  . For let . Let the th Gerschgorin disk be the closed disk centered at with radius . Then every eigenvalue of lies within at least one of the Gershgorin discs .

Theorem 6 (Weyl [45]).

Let and be Hermitian matrices, and suppose . Suppose and have eigenvalues and , respectively. Suppose the eigenvalues of are given by . Then

From our random graph model assumptions and our graph density assumptions, we can conclude that with for sufficiently large , the top eigenvalues of will be nonnegative:

Remark 5 (Nonnegativity of the top eigenvalues of ).

Suppose . Since , it is necessarily positive semidefinite, and thus has nonnegative eigenvalues. If we now assume that for some constant , then along with the Gershgorin Disks Theorem, guarantee that the top eigenvalues of are all of order , and our rank assumption on mandates that the remaining eigenvalues be zero. If , the spectral norm bound in (8) applies, ensuring that for sufficiently large, with high probability. Thus, by Weyl’s inequality, we see that the top eigenvalues of are, with high probability, of order , and the remaining are, with high probability, within of zero.

Since and is close to , it is intuitively appealing to conjecture that, in fact, should be close to some rotation of . That is, if is the matrix of true latent positions—-so

for some orthogonal matrix

—then it is plausible that ought to be comparatively small. To make this precise, however, we need to understand how both eigenvalues and eigenvectors of a matrix behave when the matrix is perturbed. Weyl’s inequality [45]

addresses the former. The impact of matrix perturbations on associated eigenspaces is significantly more complicated, and the Davis-Kahan Theorem

[26, 9] provides one approach to the latter. The Davis-Kahan has a significant role in several approaches to spectral estimation for graphs: for example, Rohe, Chatterjee, and Yu leverage it in [81] to prove the accuracy of spectral estimates in high-dimensional stochastic blockmodels. The version we give below is from [107], which is a user-friendly guide to the the Davis-Kahan Theorem and its statistical implications.

The Davis-Kahan Theorem is often stated as a result on canonical angles between subspaces. To that end, we recall that if and are two matrices with orthonormal columns, then we define the vector of canonical or principal angles between their column spaces to be the vector such that

where are the singular values of . We define the matrix to be the diagonal matrix for which .

Theorem 7 (A variant of Davis-Kahan [107]).

Suppose and are two symmetric matrices with real entries with spectrum given by and , respectively; and let and denote their corresponding orthonormal eigenvectors. Let be fixed, and let be the matrix of whose columns are the eigenvectors , and similarly the matrix whose columns are the eigenvectors . Then

Observe that if we assume that is of rank and has a sufficient eigengap, the Davis-Kahan Theorem gives us an immediate bound on the spectral norm of the difference between and in terms of this eigengap and the spectral norm difference of , namely:

Recall that the Frobenius norm of a matrix satisfies

and further that if is of rank , then

and hence for rank matrices, spectral norm bounds are easily translated into bounds on the Frobenius norm. It is worth noting that [81] guarantees that a difference in projections can be transformed into a difference between eigenvectors themselves: namely, given the above bound for , there is a constant and an orthonormal matrix such that

(9)

While these important results provide the backbone for much of our theory, the detailed bounds and distributional results described in the next section rely on a decomposition of in terms of and a remainder. This first term can be viewed as an application of the power method for finding eigenvectors. Additionally, standard univariate and multivariate concentration inequalities and distributional results can be readily applied to this term. On the other hand, the remainder term can be shown to be of smaller order than the first, and much of the technical challenges of this work rely on carefully bounding the remainder term.

4 Spectral embeddings and estimation for RDPGs

There is a wealth of literature on spectral methods for estimating model parameters in random graphs, dating back more than half a century to estimation in simple Erdős-Rényi models. More specifically, for Erdős-Rényi graphs, we would be remiss not to point to Furedi and Komlos’s classic work [39] on the eigenvalues and eigenvectors of the adjacency matrix of a E-R graph. Again, despite their model simplicity, Erdős-Rényi graphs veritably teem with open questions; to cite but one example, in a very recent manuscript, Arias-Castro and Verzhelen [7] address, in the framework of hypothesis testing, the question of subgraph detection within an ER graph.

Moving up to the slightly more heterogeneous stochastic block model, we again find a rich literature on consistent block estimation in stochastic block models. Fortunato [36] provides an overview of partitioning techniques for graphs in general, and consistent partitioning of stochastic block models for two blocks was accomplished by Snijders and Nowicki [87] and for equal-sized blocks by Condon and Karp in 2001. For the more general case, Bickel and Chen [11] demonstrate a stronger version of consistency via maximizing Newman-Girvan modularity [74] and other modularities. For a growing number of blocks, Choi et al. [14] prove consistency of likelihood based methods, and Bickel et al. [13] provide a method to consistently estimate the stochastic block model parameters using subgraph counts and degree distributions. This work and the work of Bickel and Chen [12] both consider the case of very sparse graphs. In [3], Airoldi et al define the important generalization of a mixed-membership stochastic blockmodel, in which block membership may change depending on vertex-to-vertex interactions, and the authors demonstrate methods of inference for the mixed membership and block probabilities.

Rohe, Chatterjee and Yu show in [81] that spectral embeddings of the Laplacian give consistent estimates of block memberships in a stochastic block model, and one of the earliest corresponding results on the consistency of the adjacency spectral embedding is given by Sussman, Tang, Fishkind, and Priebe in [90]. In [90], it is proved that for a stochastic block model with blocks and a rank block probability matrix , clustering the rows of the adjacency spectral embedding via -means clustering (see [77]) results in at most vertices being misclassified. An improvement to this can be found in [35], where consistency recovery is possible even if the rank of the embedding dimension is unknown.

In [67]

, under the assumption of distinct eigenvalues for the second moment matrix

of a random dot product graph, it is shown that clustering the rows of the adjacency spectral embedding results in asymptotically almost surely perfect recovery of the block memberships in a stochastic blockmodel—i.e. for sufficiently large , the probability of all vertices being correctly assigned is close to 1. An especially strong recovery is exhibited here: it is shown that in the norm, is sufficiently close to a rotation of the the true latent positions. In fact, each row in is within of the corresponding row in . Unlike a Frobenius norm bound, in which it is possible that some rows of may be close to the true positions but others may be significantly farther away, this bound implies that the adjacency spectral embedding has a uniform consistency.

Furthermore, [96] gives a nontrivial tightening of the Frobenius norm bound for the difference between the (rotated) true and estimated latent positions: in fact the Frobenius norm is not merely bounded from above by a term of order , but rather concentrates around a constant. This constant-order Frobenius bound forms the basis of a principled two-sample hypothesis test for determining whether two random dot product graphs have the same generating latent positions (see Section 5.2).

In [68], the -norm bound is extended even to the case when the second moment matrix does not have distinct eigenvalues. This turns out to be critical in guaranteeing that the adjacency spectral embedding can be effectively deployed for community detection in hierarchical block models. We present this bound for the norm in some detail here; it illustrates the confluence of our key techniques and provides an effective roadmap for several subsequent results on asymptotic normality and two-sample testing.

4.1 Consistency of latent position estimates

We state here one of our lynchpin results on consistency, in the norm, of the adjacency spectral embedding for latent position estimation of a random dot product graph. We given an outline of the proof here, and refer the reader to the Appendix 8 for the details, which essentially follow the proof given in [68]. We emphasize our setting is a sequence of random dot product graphs for increasing and thus we make the following density assumption on as increases:

Assumption 1.

Let for be a sequence of random dot product graphs with being a adjacency matrix. Suppose that is of rank for all sufficiently large. Suppose also that there exists constants and such that for all sufficiently large,

Our consistency result for the norm is Theorem 8 below. In the proof of this particular result, we consider a particular random dot product graph with non-random (i.e. fixed) latent positions, but our results apply also to the case of random latent positions. In Section 4.2, where we provide a central limit theorem, we focus on the case in which the latent positions are themselves random. Similarly, in Section 5.2, in our analysis of the semiparametric two-sample hypothesis test for the equality of latent positions in a pair of random dot product grahs, we return to the setting in which the latent positions are fixed, and in the nonparametric hypothesis test of equality of distributions, we analyze again the case when the latent positions are random. It is convenient to be able to move fluidly between the two versions of a random dot product graph, adapting our results as appropriate in each case.

In the Appendix (8), we give a detailed proof of Theorem 8 and we point out the argument used therein also sets the stage for the central limit theorem for the rows of the adjacency spectral embedding given in Subsection 4.2.

Theorem 8.

Let for be a sequence of random dot product graphs where the is assumed to be of rank for all sufficiently large. Denote by the adjacency spectral embedding of and let and be the -th row of and , respectively. Let be the event that there exists an orthogonal transformation such that

where is some fixed constant and . Then occurs asymptotically almost surely; that is, as .

Under the stochastic blockmodel, previous bounds on implied that -means applied to the rows of would approximately correctly partition the vertices into their the true blocks with up to

errors. However, this Frobenius norm bound does not imply that there are no large outliers in the rows of

, thereby precluding any guarantee of zero errors. The improvements provided by Theorem 8 overcome this hurdle and, as shown in [67], under suitable sparsity and eigengap assumptions, -means applied to will exactly correctly partition the vertices into their true blocks. This implication demonstrates the importance of improving the overall bounds and in focusing on the correct metrics for a given task—in this case, for instance, block identification.

For a brief outline of the proof of this result, we note several key ingredients. First is a lemma guaranteeing the existence of an orthogonal matrix such that

That is, there is an approximate commutativity between right and left multiplication of the corresponding matrices of eigenvalues by this orthogonal transformation. The second essential component is, at heart, a bound inspired by the power method. Specifically, we show that there exists an orthogonal matrix

Finally, from this point, establishing the bound on the norm is a consequence of Hoeffding’s inequality applied to sums of the form

.

The bound in Theorem 8 has several important implications. As we mentioned, [67] establishes an earlier form of this result, with more restrictive assumptions on the the second moment matrix, and shows how this can be used to cluster vertices in an SBM perfectly, i.e. with no vertices misclassified. In addition, [67] shows how clustering the rows of the ASE can be useful for inference in a degree-corrected stochastic block model as well. In Section 6, we see that because of Theorem 8, the adjacency spectral embedding and a novel angle-based clustering procedure can be used for accurately identifying subcommunities in an affinity-structured, hierarchical stochastic blockmodel [68]. In the next section, we see how our proof technique here can be used to obtain distributional results for the rows of the adjacency spectral embedding.

4.2 Distributional results for the ASE

In the classical statistical task of parametric estimation, one observes a collection of i.i.d observations from some family of distributions , where is some subset of Euclidean space, and one seeks to find a consistent estimator for . As we mentioned in Section 1, often a next task is to determine the asymptotic distribution, as , of a suitable normalization of this estimator

. Such distributional results, in turn, can be useful for generating confidence intervals and testing hypotheses.

We adopt a similar framework for random graph inference. In the previous section, we demonstrate the consistency of the adjacency spectral embedding for the true latent position of a random dot product graph. In this section, we establish the asymptotic normality of the rows of this embedding and the Laplacian spectral embedding. In the subsequent section, we examine how our methodology can be deployed for multisample graph hypothesis testing.

We emphasize that distributional results for spectral decompositions of random graphs are comparatively few. The classic results of Füredi and Komlós [39] describe the eigenvalues of the Erdős-Rényi random graph and the work of Tao and Vu [102] is focused on distributions of eigenvectors of more general random matrices under moment restrictions, but [8] and [98]

are among the only references for central limit theorems for spectral decompositions of adjacency and Laplacian matrices for a wider class of independent-edge random graphs than merely the Erdős-Rényi model. Apart from their inherent interest, these limit theorems point us to current open questions on efficient estimation and the relative merits of different estimators and embeddings, in part by rendering possible a comparison of asymptotic variances and allowing us to quantify relative efficiency (see

[101] and to precisely conjecture a decomposition of the sources of variance in different spectral embeddings for multiple graphs (see [58]).

Specifically, we show that for a -dimensional random dot product graph with i.i.d latent positions, there exists a sequence of orthogonal matrices such that for any row index , converges as to a mixture of multivariate normals.

Theorem 9 (Central Limit Theorem for the rows of the ASE).

Let be a sequence of adjacency matrices and associated latent positions of a -dimensional random dot product graph according to an inner product distribution . Let denote the cdf of a (multivariate) Gaussian with mean zero and covariance matrix , evaluated at . Then there exists a sequence of orthogonal -by- matrices such that for all and for any fixed index ,

where

(10)

We also note the following important corollary of Theorem 9 for when is a mixture of point masses, i.e., is a -block stochastic blockmodel graph. Then for any fixed index , the event that is assigned to block has non-zero probability and hence one can condition on the block assignment of to show that the conditional distribution of converges to a multivariate normal. This is in contrast to the unconditional distribution being a mixture of multivariate normals as given in Theorem 9.

Corollary 1 (Sbm).

Assume the setting and notations of Theorem 9 and let

be a mixture of point masses in where is the Dirac delta measure at . Then there exists a sequence of orthogonal matrices such that for all and for any fixed index ,

(11)

where is as defined in Eq. (51).

We relegate the full details of the proof of this central limit theorem to the Appendix, in Section 8.1, but a few points bear noting here. First, both Theorem 9 and Corollary 1 are very similar to results proved in [8], but with the crucial difference being that we no longer require that the second moment matrix of have distinct eigenvalues (for more details, see [98]). As in [8], our proof here depends on writing the difference between a row of the adjacency spectral embedding and its corresponding latent position as a pair of summands: the first, to which a classical Central Limit Theorem can be applied, and the second, essentially a combination of residual terms, which we show, using techniques similar to those in the proof of Theorem 8, converges to zero. The weakening of the assumption of distinct eigenvalues necessitates significant changes from [8] in how to bound the residual terms, because [8] adapts a result of [10]—the latter of which depends on the assumption of distinct eigenvalues—to control these terms. Here, we resort to somewhat different methodology: we prove instead that analogous bounds to those in [68, 98]

hold for the estimated latent positions and this enables us to establish that here, too, the rows of the adjacency spectral embedding are also approximately normally distributed.

We stress that this central limit theorem depends on a delicate bounding of a sequence of so-called residual terms, but its essence is straightforward. Specifically, there exists an orthogonal transformation such that we can write the th row of the matrix

as

(12)

where the residual terms are all of order in probability. Now, to handle the first term in Eq.(12), we can condition on a fixed latent position , and when this is fixed, the classical Lindeberg-Feller Central Limit Theorem establishes the asymptotic normality of this term. The order of the residual terms then guarantees, by Slutsky’s Theorem, the desired asymptotic normality of the gap between estimated and true latent positions, and finally we need only integrate over the possible latent positions to obtain a mixture of normals.

4.3 An example under the stochastic block model

To illustrate Theorem 9, we consider random graphs generated according to a stochastic block model with parameters

(13)

In this model, each node is either in block 1 (with probability 0.6) or block 2 (with probability 0.4). Adjacency probabilities are determined by the entries in based on the block memberships of the incident vertices. The above stochastic blockmodel corresponds to a random dot product graph model in where the distribution of the latent positions is a mixture of point masses located at

(with prior probability

) and (with prior probability ).

We sample an adjacency matrix for graphs on vertices from the above model for various choices of . For each graph , let denote the embedding of and let denote the th row of . In Figure 1, we plot the rows of for the various choices of . The points are denoted by symbols according to the block membership of the corresponding vertex in the stochastic blockmodel. The ellipses show the 95% level curves for the distribution of for each block as specified by the limiting distribution.

(a)
(b)
(c)
(d)
Figure 1: Plot of the estimated latent positions in a two-block stochastic blockmodel graph on vertices. The points are denoted by symbols according to the blockmembership of the corresponding vertices. Dashed ellipses give the 95% level curves for the distributions as specified in Theorem 9.

We then estimate the covariance matrices for the residuals. The theoretical covariance matrices are given in the last column of Table 1, where and are the covariance matrices for the residual when is from the first block and second block, respectively. The empirical covariance matrices, denoted and , are computed by evaluating the sample covariance of the rows of corresponding to vertices in block 1 and 2 respectively. The estimates of the covariance matrices are given in Table 1. We see that as increases, the sample covariances tend toward the specified limiting covariance matrix given in the last column.

2000 4000 8000 16000
Table 1: The sample covariance matrices for for each block in a stochastic blockmodel with two blocks. Here . In the last column are the theoretical covariance matrices for the limiting distribution.

We also investigate the effects of the multivariate normal distribution as specified in Theorem 9 on inference procedures. It is shown in [90, 93] that the approach of embedding a graph into some Euclidean space, followed by inference (for example, clustering or classification) in that space can be consistent. However, these consistency results are, in a sense, only first-order results. In particular, they demonstrate only that the error of the inference procedure converges to as the number of vertices in the graph increases. We now illustrate how Theorem 9 may lead to a more refined error analysis.

We construct a sequence of random graphs on vertices, where ranges from through in increments of , following the stochastic blockmodel with parameters as given above in Eq. (13). For each graph on vertices, we embed and cluster the embedded vertices of

via Gaussian mixture model and K-means clustering. Gaussian mixture model-based clustering was done using the MCLUST implementation of

[37]. We then measure the classification error of the clustering solution. We repeat this procedure 100 times to obtain an estimate of the misclassification rate. The results are plotted in Figure 2. For comparison, we plot the Bayes optimal classification error rate under the assumption that the embedded points do indeed follow a multivariate normal mixture with covariance matrices and as given in the last column of Table 1. We also plot the misclassification rate of as given in [90] where the constant was chosen to match the misclassification rate of -means clustering for . For the number of vertices considered here, the upper bound for the constant from [90] will give a vacuous upper bound of the order of for the misclassification rate in this example. Finally, we recall that the norm bound of Theorem 8 implies that, for large enough , even the -means algorithm will exactly recover the true block memberships with high probability [65].

Figure 2:

Comparison of classification error for Gaussian mixture model (red curve), K-Means (green curve), and Bayes classifier (cyan curve). The classification errors for each

were obtained by averaging 100 Monte Carlo iterations and are plotted on a scale. The plot indicates that the assumption of a mixture of multivariate normals can yield non-negligible improvement in the accuracy of the inference procedure. The log-bound curve (purple) shows an upper bound on the error rate as derived in