The classical statistical pattern recognition setting involves
are observed feature vectors and the
are observed class labels for some probability space. We define as the training set. The goal is to learn a classifier such that the probability of error approaches Bayes optimal as for all distributions – universal consistency (Devroye et al., 1996). Here we consider the case wherein the feature vectors are unobserved, and we observe instead a latent position graph on vertices. We show that a universally consistent classification rule (specifically, -nearest neighbors) remains universally consistent for this extension of the pattern recognition set up to latent position graph models.
Latent space models for random graphs (Hoff et al., 2002) offer a framework in which a graph structure can be parametrized by latent vectors associated with each vertex. Then, the complexities of the graph structure can be characterized usings well-known techniques for vector spaces. One approach, which we adopt here, is that given a latent space model for a graph, we first estimate the latent positions and then use the estimated latent positions to perform subsequent analysis. When the latent vectors determine the distribution of the random graph, accurate estimates of the latent positions will often lead to accurate subsequent inference.
In particular, this paper considers the random dot product graph model introduced in Nickel (2006) and Young and Scheinerman (2007). This model supposes that each vertex is associated with a latent vector in . The probability that two vertices are adjacent is then given by the dot product of their respective latent vectors. We investigate the use of an eigen-decomposition of the observed adjacency matrix to estimate the latent vectors. The motivation for this estimator is that, had we observed the expected adjacency matrix (the matrix of adjacency probabilities), then this eigen-decomposition would return the original latent vectors (up to an orthogonal transformation).
Provided the latent vectors are i.i.d. from any distribution on a suitable space , we show that we can accurately recover the latent positions. Because the graph model is invariant to orthogonal transformations of the latent vectors, note that the distribution is identifiable only up to orthogonal transformations. Consequently, our results show only that we estimate latent positions which can then be orthogonally transformed to be close to the true latent vectors. As many subsequent inference tasks are invariant to orthogonal transformations, it is not necessary to achieve a rotationally accurate estimate of the original latent vectors.
For this paper, we investigate the inference task of vertex classification. This supervised or semi-supervised problem supposes that we have observed class labels for some subset of vertices and that we wish to classify the remaining vertices. To do this, we train a -nearest-neighbor classifier on estimated latent vectors with observed class labels, which we then use to classify vertices with un-observed class labels. Our result states that this classifier is universally consistent, meaning that regardless of the distribution for the latent vectors, the error for our classifier trained on the estimated vectors converges to Bayes optimal for that distribution.
The theorems as stated can be generalized in various ways without much additional work. For ease of notation and presentation, we chose to provide an illustrative example for the kind of results that can be achieved for the specific random dot product model. In the discussion we point out various ways that this can be generalized.
The remainder of the paper is structured as follows. Section 2 discusses previous work related to the latent space approach and spectral properties of random graphs. In section 3, we introduce the basic framework for random dot product graphs and our proposed latent position estimator. In section 4, we argue that the estimator is consistent, and in section 5 we show that the -nearest-neighbors algorithm yields consistent vertex classification. In section 6 we consider some immediate ways the results presented herein can be extended and discuss some possible implications. Finally, section 7 provides illustrative examples of applications of this work through simulations and a graph derived from Wikipedia articles and hyper-links.
2 Related Work
The latent space approach is introduced in Hoff et al. (2002). Generally, one posits that the adjacency of two vertices is determined by a Bernoulli trial with parameter depending only on the latent positions associated with each vertex, and edges are independent conditioned on the latent positions of the vertices.
If we suppose that the latent positions are i.i.d. from some distribution, then the latent space approach is closely related to the theory of exchangeable random graphs (Bickel and Chen, 2009; Kallenberg, 2005; Aldous, 1981). For exchangeable graphs, we have a (measurable) link function and each vertex is associated with a latent i.i.d. uniform random variable denoted . Conditioned on the , the adjacency of vertices and is determined by a Bernoulli trial with parameter
. For a treatment of exchangeable graphs and estimation using the method of moments, seeBickel et al. (2011).
The latent space approach replaces the latent uniform random variables with random variables in some , and the link function has domain . These random graphs still have exchangeable vertices and so could be represented in the i.i.d. uniform framework. On the other hand, -dimensional latent vectors allow for additional structure and advances interpretation of the latent positions.
In fact, the following result provides a characterization of finite-dimensional exchangeable graphs as random dot product graphs. First, we say is rank and positive semi-definite if can be written as for some linearly independent functions . Using this definition and the inverse probability transform, one can easily show the following.
An exchangeable random graph has rank and positive semi-definite link function if and only if the random graph is distributed according to a random dot product graph with i.i.d. latent vectors in .
Put another way, random dot products graphs are exactly the finite-dimensional exchangeable random graphs, and hence, they represent a key area for exploration when studying exchangeable random graphs.
An important example of a latent space model is the stochastic blockmodel (Holland et al., 1983), where each latent vector can take one of only distinct values. The latent positions can be taken to be for some positive integer , the number of blocks. Two vertices with the same latent position are said to be members of the same block, and block membership of each vertex determines the probabilities of adjacency. Vertices in the same block are said to be stochastically equivalent. This model has been studied extensively, with many efforts focused on unsupervised estimation of vertex block membership (Snijders and Nowicki, 1997; Bickel and Chen, 2009; Choi et al., 2012). Note that Sussman et al. (In press) discusses the relationship between stochastic blockmodels and random dot product graphs. The value of the stochastic blockmodel is its strong notions of communities and parsimonious structure; however the assumption of stochastic equivalence may be too strong for many scenarios.
Many latent space approaches seek to generalize the stochastic blockmodel to allow for variation within blocks. For example, the mixed membership model of Airoldi et al. (2008) posits that a vertex could have partial membership in multiple blocks. In Handcock et al. (2007)
, latent vectors are presumed to be drawn from a mixture of multivariate normal distributions with the link function depending on the distance between the latent vectors. They use Bayesian techniques to estimate the latent vectors.
prove that the eigenvectors of the normalized Laplacian can be orthogonally transformed to closely approximate the eigenvectors of the population Laplacian. Their results do not use a specific model but rather rely on assumptions for the Laplacian.Sussman et al. (In press) shows that for the directed stochastic blockmodel, the eigenvectors/singular vectors of the adjacency matrix can be orthogonally transformed to approximate the eigenvectors/singular vectors of the population adjacency matrix. Fishkind et al. (2012) extends these results to the case when the number of blocks in the stochastic blockmodel are unknown. Marchette et al. (2011) also uses techniques closely related to those presented here to investigate the semi-supervised vertex nomination task.
Finally, another line of work is exemplified by Oliveira (2009). This work shows that, under the independent edge assumption, the adjacency matrix and the normalized Laplacian concentrate around the respective population matrices in the sense of the induced
norm. This work uses techniques from random matrix theory. Other work, such asChung et al. (2004), investigates the spectra of the adjacency and Laplacian matrices for random graphs under a different type of random graph model.
Let and denote the set of and matrices with values in for some set . Additionally, for , let
denote the eigenvalue ofwith the largest magnitude. All vectors are column vectors.
Let be a subset of the unit ball such that for all where denotes the standard Euclidean inner product. Let be a probability measure on and let . Define and .
We assume that the (second moment) matrix is rank and has distinct eigenvalues . In particular, we suppose there exists such that
The distinct eigenvalue assumption is not critical to the results that follow but is assumed for ease of presentation. The theorems hold in the general case with minor changes.
Additionally, we assume that the dimension of the latent positions is known.
Let be a random symmetric hollow matrix such that the entries are independent Bernoulli random variables with for all , . We will refer to as the adjacency matrix that corresponds to a graph with vertex set . Let be the eigen-decomposition of where with having positive decreasing diagonal entries. Let be given by the first columns of and let be given by the first rows and columns of . Let and be defined similarly.
4 Estimation of Latent Positions
The key result of this section is the following theorem which shows that, using the eigen-decomposition of , we can accurately estimate the true latent positions up to an orthogonal transformation.
With probability greater than , there exists an orthogonal matrix
, there exists an orthogonal matrixsuch that
Let be as above and define with row denoted by . Then, for each and all ,
We now proceed to prove this result. First, the following result, proved in Sussman et al. (In press), provides a useful Frobenius bound for the difference between and .
Proposition 4.2 (Sussman et al. (In press)).
For and as above, it holds with probability greater than that
The proof of this theorem is omitted and uses the same Hoeffding bound as is used to prove Eq. (7) below.
First, for . Note each entry of is the sum of independent random variables each in : . This means we can apply Hoeffding’s inequality to each entry of to obtain
Using a union bound we have that . Using Weyl’s inequality (Horn and Johnson, 1985), we have the result.
This next lemma shows that we can bound the difference between the eigenvectors of and , while our main results are for scaled versions of the eigenvectors.
With probability greater than , there exists a choice for the signs of the columns of such that for each ,
This is a result of applying the Davis-Kahan Theorem (Davis and Kahan (1970); see also Rohe et al. (2011)) to and . Proposition 4.2 and 4.3 give that the eigenvalue gap for is greater than and that with probabilty greater then . Apply the Davis-Kahan theorem to each eigenvector of and , which are the same as the eigenvectors of and , respectively, to get
for each . The claim then follows by choosing so that minimizes Eq. (9) for each . ∎
We now have the ingredients to prove our main theorem.
Proof of Theorem 4.1.
where the numerator of the right hand side is less than by Proposition 4.2 and the denominator is greater than by Proposition 4.3. The first term in Eq. (10) is thus bounded by . For the second term, and . We have established that with probability greater than ,
We now will show that an orthogonal transformation will give us the same bound in terms of . Let . Then and thus . Because , we have that is non-singular and hence . Let . It is straightforward to verify that and that . is thus an orthogonal matrix, and . Eq. (2) is thus established.
Now, we will prove Eq. (3). Note that because the are i.i.d., the are exchangeable and hence identically distributed. As a result, each of the random variables are identically distributed. Note that for sufficiently large , by conditioning on the event in Eq. (2), we have
because the worst case bound is with probability 1. We also have that
and because the are identically distributed, the left hand side is simply . ∎
5 Consistent Vertex Classification
So far we have shown that using the eigen-decomposition of , we can consistently estimate all latent positions simultaneously (up to an orthogonal transformation). One could imagine that this will lead to accurate inference for various exploitation tasks of interest. For example, Sussman et al. (In press) explored the use of this embedding for unsupervised clustering of vertices in the simpler stochastic blockmodel setting. In this section, we will explore the implications of consistent latent position estimation in the supervised classification setting. In particular, we will prove that universally consistent classification using -nearest-neighbors remains valid when we select the neighbors using the estimated vectors rather than the true but unknown latent positions.
First, let us expand our framework. Let be as in section 3 and let be a distribution on . Let and let and be as in section 3. Here the s are the class labels for the vertices in the graph corresponding to the adjacency matrix .
We suppose that we observe only , the adjacency matrix, and , the class labels for all but the last vertex. Our goal is to accurately classify this last vertex, so for notational convenience define and . Let the rows of be denoted by . The -nearest-neigbor rule for odd is defined as follows. For , let only if is one of the nearest points to from among ; otherwise. (We break ties by selecting the neighbor with the smallest index.)
The -nearest-neighbor rule is then given by . It is a well known theorem of Stone (1977) that, had we observed the original , the -nearest neighbor rule using the Euclidean distance from to is universally consistent provided and . This means that for any distribution ,
as , where is the standard -nearest-neighbor rule trained on the and is the (optimal) Bayes rule. This theorem relies on the following very general result, also of Stone (1977), see also Devroye et al. (1996), Theorem 6.3.
Theorem 5.1 (Stone (1977)).
Assume that for any distribution of , the weights satisfy the following three conditions:
There exists a constant such that for every nonnegative measurable function satisfying ,
For all ,
Then is universally consistent.
Recall that the are defined in Theorem 4.1. Because the are obtained via an orthogonal transformation of the , the nearest neighbors of are the same as those of . As a result of this and the relationship between and , we work using the , even though these cannot be known without some additional knowledge.
To prove that the -nearest-neighbor rule for the is universally consistent, we must show that the corresponding satisfy these conditions. The methods to do this are adapted from the proof presented in Devroye et al. (1996). We will outline the steps of the proof, but the details follow mutatis mutandis from the standard proof.
First, the following Lemma is adapted from Devroye et al. (1996) by using a triangle inequality argument.
Suppose . If , then almost surely, where is the -th nearest neighbor of among .
Condition (iii) follows immediately from the definition of the . The remainder of the proof follows with few changes after recognizing that the random variables are exchangeable. Overall, we have the following universal consistency result.
If and as , then the satisfy the condtions of Theorem 5.1 and hence .
The results presented thus far are for the specific problem of determining one unobserved class label for a vertex in a random dot product graph. In fact, the techniques used can be extended to somewhat more general settings without significant additional work.
For example, the results in section 5 are stated in the case that we have observed the class labels for all but one vertex. However, the universal consistency of the -nearest-neighbor classifier remains valid provided the number of vertices with observed vertex class labels goes to infinity and as the number of vertices . In other words, we may train the -nearest neighbor on a smaller subset of the estimated latent vectors provided the size of that subset goes to .
On the other hand, if we fix the number of observed class labels and the classification rule and let the number of vertices tend to , then we can show the probability of incorrectly classifying a vertex will converge to . Additionally, our results also hold when the class labels can take more than two but still finitely many values.
In fact, the results in section 5 and Eq. (3) from Theorem 4.1 rely only on the fact that the are i.i.d. and bounded, the are exchangeable, and can be bounded with high probability by a function. The random graph structure provided in our framework is of interest, but it is the total noise bounds that are crucial for the universal consistency claim to hold.
6.2 Latent Position Estimation
In section 4, we state our results for the random dot product graph model. We can generalize our results immediately by replacing the dot product with a bi-linear form, , where is the identity matrix. This model has the interpretation that similarities in the first dimensions increase the probability of adjacency, while similarities in the last the last reduce the probability of adjacency. All the results remain valid under this model, and in fact, arguments in Oliveira (2009) can be used to show that the signature of the bi-linear form can also be estimated consistently. We also recall that the assumption of distinct eigenvalues for can be removed with minor changes. Particularly, Lemma 4.4 applies to groups of eigenvalues, and subsequent results can be adapted without changing the order of the bounds.
This work focuses on undirected graphs and this assumption is used explicitly throughout section 4. We believe moderate modifications would lead to similar results for directed graphs, such as in Sussman et al. (In press); however at present we do not investigate this problem. We also note that we assume the graph has no loops so that is hollow. This assumption can be dropped, and in fact, the impact of the diagonal is asymptotically negligible, provided each entry is bounded. Marchette et al. (2011) suggest that augmenting the diagonal may improve latent position estimation for finite samples.
In Rohe et al. (2011), the number of blocks in the stochastic blockmodel, which is related to in our setting (Sussman et al., In press), is allowed to grow with ; our work can also be extended to this setting. In this case, it will be the interaction between the rate of growth of and the rate that vanishes that controls the bounds in Theorem 4.1. Additionally, the consistency of -nearest-neighbors when the dimension grows is less well understood and results such as Stone’s Theorem 5.1 do not apply.
In addition to keeping fixed, we also assume that is known. Fishkind et al. (2012) and Sussman et al. (In press) suggest consistent methods to estimate the latent space dimension. The results in Oliveira (2009) can also be used to derive thresholds for eigenvalues to estimate .
Finally, Fishkind et al. (2012) and Marchette et al. (2011) also consider that the edges may be attributed; for example, if edges represent a communication, then the attributes could represent the topic of the communication. The attributed case can be thought of as a set of adjacency matrices, and we can embed each separately and concatenate the embeddings. Fishkind et al. (2012) argues that this method works under the attributed stochastic blockmodel and similar arguments could likely be used to extend the current work.
6.3 Extension to the Laplacian
The eigen-decomposition of the graph Laplacian is also widely used for similar inference tasks. In this section, we argue informally that our results extend to the Laplacian. We will consider a slight modification of the standard normalized Laplacian as defined in Rohe et al. (2011). This modification scales the Laplacian in Rohe et al. (2011) by so that the first eigenvalues of our matrix are rather then for the standard normalized Laplacian.
Let where is diagonal with . Additionally, let where is diagonal with
Finally, define as , and .
Because the pairwise dot products of the rows of are the same as the entries of , the scaled eigenvectors of must be an orthogonal transformation of the . Further, note that for large , and will be close with high probability because and the function is smooth almost surely. Additionally, the are i.i.d. and is one-to-one so that the Bayes optimal error rate is the same for the as for the : . If the further assumption that the minimum expected degree among all vertices is greater than holds, then the assumptions of Theorem 2.2 in Rohe et al. (2011) are satisfied.
Let denote the row of the matrix defined analogously to section 3 and let be the matrix with row given by . Using the results in Rohe et al. (2011) and similar tools to those we have used thus far, one can show that can be bounded with high probability by a function in . As discussed above, this is sufficient for -nearest-neighbors trained on to be universally consistent. In this paper we do not investigate the comparative values of the eigen-decompositions for the Laplacian versus the adjacency matrix, but one factor may be the properties of the map defined above as applied to different distributions on .
In this section we present empirical results for a graph derived from Wikipedia links as well as simulations for an example wherein the arise from a Dirichlet distribution.
To demonstrate our results, we considered a problem where perfect classification is possible. Each is distributed according to a Dirichlet distribution with parameter where we keep just the first two coordinates. The class labels are determined by the with so in particular .
For each , we simulated 500 instances of the and sample the associated random graph. For each graph, we used our technique to embed each vertex in two dimensions. To facilitate comparisons, we used the matrix to construct the matrix via transformation by the optimal orthogonal . Figure 1 illustrates our embedding for with each point corresponding to a row of with points colored according the class labels . To demonstrate our results from section 4, figure 2 shows the average square error in the latent position estimation per vertex.
For each graph, we used leave-one-out cross validation to evaluate the error rate for -nearest-neighbors for . We suppose that we observe all but 1 class label as in section 5. Figure 3 shows the classification error rates. The black line shows the classification error when classifying using while the red line shows the classification error when classifying using . Unsurprisingly, classifying using gives worse performance. However we still see steady improvement as the number of vertices increases, as predicted by our universal consistency result. Indeed, this figure suggests that the rates of convergence may be similar for both and .
7.2 Wikipedia Graph
For this data (Ma et al. (2012), http://www.cis.jhu.edu/~zma/zmisi09.html), each vertex in the graph corresponds to a Wikipedia page and the edges correspond to the presence of a hyperlink between two pages (in either direction). We consider this as an undirected graph. Every article within two hyperlinks of the article “Algebraic Geometry” was included as a vertex in the graph. This resulted in vertices. Additionally, each document, and hence each vertex, was manually labeled as one of the following: Category (119), Person (372), Location (270), Date (191) and Math (430).
To investigate the implications of the results presented thus far, we performed a pair of illustrative investigations. First, we used our technique on random induced subgraphs and used leave-one-out cross validation to estimate error rates for each subgraph. We used and and performed 100 monte carlo iterates of random induced subgraphs with vertices. Figure 4 shows the mean classification error estimates using leave-one-out cross validation on each randomly selected subgraph. Note, the chance error rate is .
We also investigated the performance of our procedure for different choices of , the embedding dimension, and , the number of nearest neighbors. Because this data has 5 classes, we use the standard -nearest-neighbor algorithm and break ties by choosing the first label as ordered above. Using leave-one-out cross validation, we calculated an estimated error rate for each and . The results are shown in Figure 5. This figure suggests that our technique will be robust to different choices of and within some range.
Overall, we have shown that under the random dot product graph model, we can consistently estimate the latent positions provided they are independent and identically distributed. We have shown further that these estimated positions are also sufficient to consistently classify vertices. We have shown that this method works well in simulations and can be useful in practice for classifying documents based on their links to other documents.
Airoldi et al. (2008)
E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing.
Mixed membership stochastic blockmodels.
The Journal of Machine Learning Research, 9:1981–2014, 2008.
Representations for partially exchangeable arrays of random
Journal of Multivariate Analysis, 11(4):581–598, 1981.
- Bickel and Chen (2009) P. J. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. Proceedings of the National Academy of Sciences of the United States of America, 106(50):21068–73, 2009.
- Bickel et al. (2011) P. J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. Annals of Statistics, 39(5):38–59, 2011.
- Choi et al. (2012) D. S. Choi, P. J. Wolfe, and E. M. Airoldi. Stochastic blockmodels with a growing number of classes. Biometrika, 99(2):273–284, 2012.
- Chung et al. (2004) F. Chung, L. Lu, and V. Vu. The spectra of random graphs with given expected degrees. Internet Mathematics, 1(3):257–275, 2004.
- Davis and Kahan (1970) C. Davis and W. Kahan. The rotation of eigenvectors by a pertubation. III. Siam Journal on Numerical Analysis, 7:1–46, 1970.
- Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition. Springer Verlag, 1996.
- Fishkind et al. (2012) D. E. Fishkind, D. L. Sussman, M. Tang, J.T. Vogelstein, and C.E. Priebe. Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. Arxiv preprint arXiv:1205.0309, 2012.
- Handcock et al. (2007) M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2):301–354, 2007.
- Hoff et al. (2002) P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent Space Approaches to Social Network Analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002.
- Holland et al. (1983) P. W. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983.
- Horn and Johnson (1985) R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1985.
- Kallenberg (2005) O. Kallenberg. Probabilistic symmetries and invariance principles. Springer Verlag, 2005.
- Ma et al. (2012) Z. Ma, D. J. Marchette, and C. E. Priebe. Fusion and inference from multiple data sources in a commensurate space. Statistical Analysis and Data Mining, 5(3):187–193, 2012.
- Marchette et al. (2011) D. Marchette, C. E. Priebe, and G. Coppersmith. Vertex nomination via attributed random dot product graphs. In Proceedings of the 57th ISI World Statistics Congress, 2011.
- Nickel (2006) C. L. M. Nickel. Random dot product graphs: A model for social networks. PhD thesis, Johns Hopkins University, 2006.
- Oliveira (2009) R. I. Oliveira. Concentration of the adjacency matrix and of the laplacian in random graphs with independent edges. Arxiv preprint ArXiv:0911.0600, 2009.
- Rohe et al. (2011) K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. Annals of Statistics, 39(4):1878–1915, 2011.
- Snijders and Nowicki (1997) T. A. B. Snijders and K. Nowicki. Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure. Journal of Classification, 14(1):75–100, 1997.
- Stone (1977) C. J. Stone. Consistent nonparametric regression. Annals of Statistics, 5(4):595–620, 1977.
- Sussman et al. (In press) D. L. Sussman, M. Tang, D. E. Fishkind, and C. E. Priebe. A consistent adjacency spectral embedding for stochastic blockmodel graphs. Journal of the American Statistical Association, In press.
- Young and Scheinerman (2007) S. Young and E. Scheinerman. Random dot product graph models for social networks. Algorithms and models for the web-graph, pages 138–149, 2007.