Out-of-sample Extension for Latent Position Graphs

by   Minh Tang, et al.
Johns Hopkins University

We consider the problem of vertex classification for graphs constructed from the latent position model. It was shown previously that the approach of embedding the graphs into some Euclidean space followed by classification in that space can yields a universally consistent vertex classifier. However, a major technical difficulty of the approach arises when classifying unlabeled out-of-sample vertices without including them in the embedding stage. In this paper, we studied the out-of-sample extension for the graph embedding step and its impact on the subsequent inference tasks. We show that, under the latent position graph model and for sufficiently large n, the mapping of the out-of-sample vertices is close to its true latent position. We then demonstrate that successful inference for the out-of-sample vertices is possible.



There are no comments yet.


page 1

page 2

page 3

page 4


Out-of-sample extension of graph adjacency spectral embedding

Many popular dimensionality reduction procedures have out-of-sample exte...

Universally consistent vertex classification for latent positions graphs

In this work we show that, using the eigen-decomposition of the adjacenc...

Robust Vertex Classification

For random graphs distributed according to stochastic blockmodels, a spe...

Limit theorems for out-of-sample extensions of the adjacency and Laplacian spectral embeddings

Graph embeddings, a class of dimensionality reduction techniques designe...

Universally Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs

In this work we show that, using the eigen-decomposition of the adjacenc...

Improved Reconstruction of Random Geometric Graphs

Embedding graphs in a geographical or latent space, i.e., inferring loca...

Matrix factorisation and the interpretation of geodesic distance

Given a graph or similarity matrix, we consider the problem of recoverin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The classical statistical pattern recognition setting involves

where the

are observed feature vectors and the

are observed class labels, for some probability distribution

on . This setting has been extensively investigated and many important and interesting theoretical concepts and results, e.g., universal consistency, structural complexities, and arbitrary slow convergence are available. See, e.g., [Devroye et al.(1996)Devroye, Györfi, and Lugosi] for a comprehensive overview.

Now, suppose that the feature vectors are unobserved, and we observe instead a graph on vertices. Suppose also that is constructed in a manner such that there is a one-to-one relationship between the vertices of and the feature vectors . The question of classifying the vertices based on and the observed labels then arises naturally.

  Input: , training set and labels .
  Output: Class labels .
  Step 1: Compute the eigen-decomposition of .
  Step 2: Let be the “elbow” in the scree plot of , the diagonal matrix of the top eigenvalues of and the corresponding columns of .
  Step 3: Define to be . Denote by the -th row of . Define to be the rows of corresponding to the index set . is called the adjacency spectral embedding of .
  Step 4: Find a linear classifier that minimizes the empirical -loss when trained on where is a convexloss function that is a surrogate for loss.
  Step 5: Apply on the to obtain the .
Algorithm 1 Vertex classifier on graphs

A general approach to this classification problem is illustrated by Algorithm 1 wherein inference, e.g., classification or clustering, proceeds by first embedding the graph into some Euclidean space

followed by inference in that space. This approach is well-represented in the literature of multidimensional scaling, spectral clustering, and manifold learning. The approach’s popularity is due partly to its simplicity, as after the embedding step, the vertices of

are now points in and classification or clustering can proceed in an almost identical manner to that of the classical setting, with a plethora of well-established and robust inference procedures available. In addition, theoretical justifications for the embedding step are also available. For example in the spectral clustering and manifold learning literature, the embedding step is often based on the spectral decomposition of the (combinatorial or normalized) Laplacians matrices of the graph. It can then be shown that, under mild conditions on the construction of

, the Laplacian matrices converge in some sense to the corresponding Laplace-Beltrami operators on the domain. Thus, the eigenvalues and eigenvectors of the graph Laplacians converge to the eigenvalues and eigenfunctions of the corresponding operator. See for example

[von Luxburg et al.(2008)von Luxburg, Belkin, and Bousquet, Hein et al.(2007)Hein, Audibert, and von Luxburg, von Luxburg(2007), Coifman and Lafon(2006), Belkin and Niyogi(2005), Hein et al.(2005)Hein, Audibert, and von Luxburg, Singer(2006), Rosasco et al.(2010)Rosasco, Belkin, and Vito] and the references therein for a survey of the results.

The above cited results suggest that the embedding is conducive to the subsequent inference task, but as they are general convergence results and do not explicitly consider the subsequent inference problem, they do not directly demonstrate that inference using the embeddings are meaningful. Recently, there has been investigations that coupled the embedding step with the subsequent inference step for several widely-used random models for constructing . For example, [Rohe et al.(2011)Rohe, Chatterjee, and Yu, Sussman et al.(2012a)Sussman, Tang, Fishkind, and Priebe, Fishkind et al.(2013)Fishkind, Sussman, Tang, Vogelstein, and Priebe, Chaudhuri et al.(2012)Chaudhuri, Chung, and Tsiatas] showed that clustering using the embeddings can be consistent for graphs constructed based on the stochastic block model Holland1983, the random dot product model young2007random, and the extended partition model karrer11:_stoch. In related works, sussman12:_univer,tangs.:_univer showed that one can obtain universally consistent vertex classification for graphs constructed based on the random dot product model or its generalization, the latent position model Hoff2002. However, a major technical difficulty of the approach arises when one tries to use it to classify unlabeled out-of-sample vertices without including them in the embedding stage. A possible solution is to recompute the embedding for each new vertex. However, as many of the popular embedding methods are spectral in nature, e.g., classical multidimensional scaling torgersen52:_multid, Isomap tenebaum00:_global_geomet_framew_nonlin_dimen_reduc, Laplacian eigenmaps belkin03:_laplac and diffusion maps coifman06:_diffus_maps, the computational costs for each new embedding is of order , making this solution computationally expensive. To circumvent this technical difficulty, out-of-sample extensions for many of the popular embedding methods such as those listed above have been devised, see e.g. faloutsos95,platt05:_fastm_metric_mds_nyst,bengio04:_out_lle_isomap_mds_eigen,williams01:_using_nystr,silva03:_global,wang99:_evaluat,trosset08. In these out-of-sample extensions, the embedding for the in-sample points is kept fixed and the out-of-sample vertices are inserted into the configuration of the in-sample points. The computational costs are thus much less, e.g., linear in the number of in-sample vertices for each insertion of an out-of-sample vertex into the existing configuration.

In this paper, we study the out-of-sample extension for the embedding step in Algorithm 1 and its impact on the subsequent inference tasks. In particular we show that, under the latent position graph model and for sufficiently large , the mapping of the out-of-sample vertices is close to its true latent position. This suggests that inference for the out-of-sample vertices is possible.

The structure of our paper is as follows. We introduce the framework of latent position graphs in § 2. We describe the out-of-sample extension for the adjacency spectral embedding and analyze its properties in § 3. In § 4, we investigate via simulation the implications of performing inference using these out-of-sample embeddings. We conclude the paper with discussion of related work, how the results presented herein can be extended, and other implications.

2 Framework

Let be a compact metric space and let be a continuous positive definite kernel on . Let be a probability measure on the Borel -field of . Now, for a given , let . Let be arbitrary ( can depends on ). Define . Let be a symmetric, hollow, random binary matrix where the entries of are conditionally independent

Bernoulli random variables given the

, with for all , . A graph whose adjacency matrix is constructed as above is an instance of a latent position graph. The factor controls the sparsity of the resulting graph. For example, if on , then leads to sparse, connected graphs almost surely, leads to graphs with a single giant connected component, and for some fixed leads to dense graphs. We will denote by an instance of a latent position graph on with distribution , link function , and sparsity factor . We shall assume throughout this paper that for some . That is, the expected average degree of grows at least as fast as .

An example of a latent position graph model is the random dot product graph (RDPG) model of [Young and Scheinerman(2007)]. In the RDPG model, is taken to be the unit simplex in and the link function is the Euclidean inner product. One can then take to be a Dirichlet distribution on the unit simplex. Another example of a latent position graph model takes as a compact subset of and the link function

is a radial basis function, e.g., a Gaussian kernel

. This model is similar to the method of constructing graphs based on point clouds in in the manifold learning literature. The small difference is that in the case presented here, the Gaussian kernel is used for generating the edges probabilities in the Bernoulli trials, i.e., the edges are unweighted but random, whereas in the manifold learning literature, the Gaussian kernel is used to assign weights to the edges i.e., the edges are weighted but deterministic.

The latent position graph model and the related latent space approach Hoff2002 is widely used in network analysis. It is a generalization of the stochastic block model (SBM) Holland1983 and variants such as the degree-corrected SBM karrer11:_stoch, the mixed-membership SBM Airoldi2008 and the random dot product graph model young2007random. It is also closely related to the inhomogeneous random graph model bollobas07 and the exchangeable graph model diaconis08:_graph_limit_exchan_random_graph.

We now define a feature map for . will serve as our canonical feature map, i.e., our subsequent results for the out-of-sample extension are based on bounds for the deviation of the out-of-sample embedding from the canonical feature map representation, e.g., Theorem 9. The kernel defines an integral operator on , the space of -square-integrable functions on , via


is then a compact operator and is of trace class (see e.g., Theorem 4.1 in [Blanchard et al.(2007)Blanchard, Bousquet, and Zwald]). Let be the set of eigenvalues of in non-increasing order. The are non-negative and discrete, and their only accumulation point is at . Let be a set of orthonormal eigenfunctions of corresponding to the . Then by Mercer’s representation theorem cucker12, one can write

with the above sum converging absolutely and uniformly for each and in . We define the feature map via


We define a related feature map for by


We will refer to as the truncated feature map or as the truncation of . We note that the feature map and are defined in terms of the spectrum and eigenfunctions of and thus do not depend on the scaling parameter .

We conclude this section with some notations that will be used in the remainder of the paper. Let us denote by and the set of matrices and matrices on , respectively. For a given adjacency matrix , let be the eigen-decomposition of . For a given , let be the diagonal matrix comprising of the largest eigenvalues of and let be the matrix comprising of the corresponding eigenvectors. The matrices are are defined similarly. For a matrix , denotes the spectral norm of and denotes the the Frobenius norm of . For a vector , denote the -th component of and denotes the Euclidean norm of .

3 Out-of-sample extension

We now introduce the out-of-sample extension for the adjacency spectral embedding of Algorithm 1. Suppose is an instance of on vertices. Let and denote by the matrix ; is the Moore-Penrose pseudo-inverse of . Let be the -th column of . For a given , let be the (random) mapping defined by


where is a vector of independent Bernoulli random variables with . The map is the out-of-sample extension of ; that is, extends the embedding for the sampled to any . We make some quick remarks regarding Definition 3. First, we note that the out-of-sample extension give rise to i.i.d. random variables, i.e., if are i.i.d from , then the are i.i.d. random variables in . Secondly, is a random mapping for any given , even when conditioned on the . The randomness of arises from the randomness in the adjacency matrix induced by the in-sample points as well as the randomness in the Bernoulli random variables used in Eq. (4). Thirdly, Eq. (4) states that the out-of-sample extension of is the least square solution to , i.e., is the least square projection of the (random) vector onto the subspace spanned by the columns of . The use of the least square solution to , or equivalently the projection of onto the subspace spanned by the configuration of the in-sample points, is standard in many of the out-of-sample extensions to the popular embedding methods, see e.g. [Bengio et al.(2004)Bengio, Paiement, Vincent, Delalleau, Roux, and Ouimet, Anderson and Robinson(2003), Faloutsos and Lin(1995), de Silva and Tenenbaum(2003), Wang et al.(1999)Wang, Wang, Lin, Shasha, Shapiro, and Zhang]. In general, is a vector containing the proximity (similarity or dissimilarity) between the out-of-sample point and the in-sample points and the least square solution can be related to the Nyström method for approximating the eigenvalues and eigenvectors of a large matrix, see e.g. [Bengio et al.(2004)Bengio, Paiement, Vincent, Delalleau, Roux, and Ouimet, Platt(2005), Williams and Seeger(2001)].

Finally, the motivation for Definition 3 can be gleaned by considering the setting of random dot product graphs. In this setting, where is the matrix whose rows correspond to the sampled latent positions as points in . Then is equivalent (up to rotation) to . Now let be a vector of Bernoulli random variables with . Then . Thus, if we can show that , then we have . As is “close” to tangs.:_univer,sussman12:_univer and is “small” with high probability, see e.g. tropp12:_user,yurinsky95:_sums_gauss, the relationship holds for random dot product graphs. As the latent position graphs with positive definite kernels can be thought of as being random dot product graphs with latent positions being “points” in , one expects a relationship of the form for the (truncated) feature map of . Precise statements of the relationships are given in Theorem 9 and Corollary 9 below.

3.1 Out-of-sample extension and Nyström approximation

In the following discussion, we give a brief description of the relationship between Definition 3 and the Nyström approximation of [Drineas and Mahoney(2005), Gittens and Mahoney(2013)] which they called “sketching”. Let be symmetric and let with . Following [Gittens and Mahoney(2013)], let and . Then serves as a low-rank approximation to with rank at most and [Gittens and Mahoney(2013)] refers to as the sketching matrix. The different choices for leads to different low-rank approximations. For example, a subsampling scheme correspond to the entries of being binaries variable with a single non-zero entry in each row or column. More general entries for correspond to a linear projection of the columns of . There are times when is ill-conditioned and one is instead interested in the best rank approximation to , i.e., the sketched version of is where is a rank approximation to .

Suppose now that correspond to a subsampling scheme. Then correspond to a sub-matrix of , i.e., correspond to the rows and columns indexed by . Without loss of generality, we assume that is the first rows and columns of . That is, we have the following decomposition


where and . We have abused notations slightly by writing . Then can be written as


Let us now take to be the best rank approximation to in the positive semidefinite cone. Then can be written as


Now let be such that and let . Then Eq. (7), can be written as


We thus note that if is an adjacency matrix on a graph with vertices then is the adjacency matrix of the induced subgraph of on vertices. Then is the rank approximation to that arises from the adjacency spectral embedding of . Thus and therefore is the matrix each of whose rows correspond to an out-of-sample embedding of the rows of into as defined in Definition 3.

In summary, in the context of adjacency spectral embedding, the embeddings of the in-sample and out-of-sample vertices generate a Nyström approximation to and a Nyström approximation to can be used to derive the embeddings (through an eigen-decomposition) for the in-sample and out-of-sample vertices.

3.2 Estimation of feature map

The main result of this paper is the following result on the out-of-sample mapping error . Its proof is given in the appendix. We note that the dependency on and is hidden in the spectral gap of , the integral operator induced by and . Let be given. Denote by the quantity and suppose that . Let be arbitrary. Then there exists an orthogonal such that


for some constant independent of , , , , and . We note the following corollary of the above result for the case where the latent position model is the random dot product graph model. For this case, the operator is of rank and the truncated feature map is equal (up to rotation) to the latent position . Let be an instance of . Denote by the smallest eigenvalue of . Let be arbitrary. Then there exists an orthogonal such that


for some constant independent of , , , and . We note the following result from tangs.:_univer that serves as an analogue of Theorem 9 for the in-sample points. We note that, other than the possibly different hidden constants , the bound for the out-of-sample points in Eq. (9) is almost identical to that of the in-sample points in Eq. (11). The main difference is in the power of the spectral gap in the bounds, i.e., against

. This difference might be due to the proof technique and not inherent in the distinction of out-of-sample versus in-sample points. We also note that one can take the orthogonal matrix

for the out-of-sample points to be the same as the in-sample points, i.e., the rotation that makes the in-sample points “close” to the truncated feature map also makes the out-of-sample points “close” to . Let be given. Denote by the quantity and suppose that . Let be arbitrary. Let denote the -th row of

. Then there exists a unitary matrix

such that for all


for some constant independent of , , , , and . Theorem 9 and its corollary states that in the latent position model, the out-of-sample embedded points can be rotated to be very close to the true feature map with high probability. This suggest that successful statistical inference on the out-of-sample points is possible. As an example, we investigate the problem of vertex classification for latent position graphs whose link functions belong to the class of universal kernels. Specifically, we consider an approach that proceeds by embedding the vertices into some followed by finding a linear discriminant in that space. It was shown in [Tang et al.(2013)Tang, Sussman, and Priebe] that such an approach can be made to yield a universally consistent vertex classifier if the vertex to be classified is embedded in-sample as the number of in-sample vertices increases to . In the following discussion we present a variation of this result in the case where the vertex to be classified is embedded out-of-sample and the number of in-sample vertices is fixed and finite. We show that under this out-of-sample setting, the misclassification rate can be made arbitrarily small provided that the number of in-sample vertices is sufficiently large (see Theorem 3.2).

A continuous kernel on a metric space is said to be a universal kernel, if for some choice of feature map of to some Hilbert space , the class of functions of the form


is dense in , i.e., for any continuous and any , there exists such that . We note that if is such that is dense in for some feature map of , then is dense in for any feature map of , i.e., the universality of is independent of its choice of feature map. In addition, any feature map of a universal kernel is injective. The following result lists several well-known universal kernels. For more on universal kernels, the reader is referred to [Steinwart(2001), Micchelli et al.(2006)Micchelli, Xu, and Zhang]. Let be a compact subset of . Then the following kernels are universal on .

  • exponential kernel .

  • Gaussian kernel for .

  • The binomial kernel for .

  • inverse multiquadrics for .

Let be the class of linear functions on induced by the feature map whose linear coefficients are normalized to have norm at most , i.e., if and only if is of the form

for some . We note that the is a nested increasing sequence and furthermore that

Now, given , let be the class of linear functions on induced by the out-of-sample extension , i.e., if and only if is of the form


Let be a universal kernel on . Let be arbitrary. Then for any and any , there exists and such that if then


where is the out-of-sample mapping as defined in Definition 3 and is the Bayes risk for the classification problem with distribution . We make a brief remark regarding Theorem 3.2. The term in Eq. (14) refers to the Bayes risk for the classification problem given by the out-of-sample mapping . As noted earlier, is a random mapping for any given , even when conditioned on the as also depends on the latent position graph generated by the . As such, with slight abuse of notations, refers to the Bayes-risk of the mapping when not conditioned on any set of . That is, is the Bayes-risk for out-of-sample embedding in the presence of in-sample latent positions, i.e., the latent positions of the in-sample points are integrated out. As the information processing lemma implies , one can view Eq. (14) as a sort of converse to the information processing lemma in that the degradation due to the out-of-sample embedding transformation can be made negligible if the number of in-sample points is sufficiently large. Let be any classification-calibrated (see [Bartlett et al.(2006)Bartlett, Jordan, and McAuliffe]) convex surrogate of the loss. For any measurable function , let be defined by . Let be a measurable function such that where the infimum is taken over the set of all measurable functions on . As is dense in the set of measurable functions on , without loss of generality we can take . Now let be arbitrary. As , and the is a nested increasing sequence, there exists a such that for some , we have . Thus, for any , there exists a such that for some . Now let be arbitrary. Then for some . Let and consider the difference . As is convex, it is locally-Lipschitz and


for some constant . Furthermore, we can take to be independent of . Thus, there exists some such that for all , where the supremum is taken over all . Now let be such that . We then have


If is a classification-calibrated convex surrogate of the loss, then there exists a non-decreasing function such that bartlett06:_convex. Thus by Eq. (16) we have


As is arbitrary, the proof is completed.

4 Experimental Results

(a) scatterplot of the data
(b) classification performance
Figure 1: Comparison of the in-sample against out-of-sample classification performance for a simulated data example. The performance degradation due to out-of-sample embedding is less than 2%.

In this section, we illustrate the out-of-sample extension described in § 4 by studying its impact on classification performance through two simulation examples and a real data example. In our first example, data is simulated using a mixture of two multivariate normals in . The components of the mixture have equal prior and the first component of the mixture has mean parameter and identity covariance matrix while the second component has mean and identity covariance matrix. We sample data points from this mixture and assign class labels in to them according to the quadrant in which they fall, i.e., if then . Fig (a)a depicts the scatter plot of the sampled data colored according to their class labels. The Bayes risk is for classifying the . A latent position graph is then generated based on the sampled data points with being the Gaussian kernel.

To measure the in-sample classification performance, we embed into for ranging from through . A subset of vertices is then selected uniformly at random and designated as the training data set. The remaining vertices constitute the testing data set. For each choice of dimension , we select a linear classifier by performing a least square regression on the training data points and measure the classification error of on the testing data points. The results are plotted in Fig (b)b.

For the out-of-sample classification performance, we embed the induced graph formed by the training vertices in the above description. For each choice of dimension , we out-of-sample embed the testing vertices into . For each choice of dimension , a linear classifier

is once again selected by linear regression using the in-sample training data points and tested on the out-of-sample embedded testing data points. The classification errors are also plotted in Fig 

(b)b. A quick glance at the plots in Fig (b)b suggests that the classification performance degradation due to the out-of-sample embedding is negligible.

Our next example uses the abalone dataset from the UCI machine learning repository bache2013. The data set consists of

observations of nine different abalones attributes. The attributes are sex, number of rings, and seven other physical measurements of the abalones, e.g., length, diameter, and shell weight. The number of rings in an abalone is an estimate of its age in years. We consider the problem of classifying an abalone based on its physical measurements. Following the description of the data set, the class labels are as follows. An abalone is classified as class 1 if its number of rings is eight or less. It is classified as class 2 if its number of rings is nine or ten, and it is classified as class 3 otherwise. The dataset is partitioned into a training set of

observations and a test set of observations. The lowest misclassification rate is reported to be 35.39% waugh95:_exten_cascad_correl.

We form a graph on vertices following a latent position model with a Gaussian kernel where represents the physical measurements of the -th abalone observation. To measure the in-sample classification performance, we embed the vertices of into and train a multi-class linear SVM on the embedding of the training vertices. We then measure the mis-classification rate of this classifier on the embedding of the testing vertices. For the out-of-sample setting, we randomly chose a subset of vertices from the training set and embed the resulting induced subgraph into then out-of-sample embed the remaining vertices. We then train a multi-class linear SVM on the out-of-sample embedded vertices in the training set and measure the mis-classification error on the vertices in the testing set. The results for various choices of are given in Table 1.

0.444 0.386 0.391 0.375 0.382 0.374 0.401
Table 1: Out-of-sample classification performance for the abalone dataset. The in-sample classification performance is . The lowest reported mis-classification rate is . The performance degradation due to the out-of-sample embedding is as low as .

Our final example is on the CharityNet dataset. The data set consists of 2 years of anonymized donations transactions between anonymized donors and charities. There are in total million transactions representing donations from 1.8 million donors to 5700 charities. Note that the data set does not contains any explicit information on the charities to charities relationship, i.e., the charities relate to one another through the donations transactions between donors and charities. We investigate the problem of clustering the charities under the assumption that there are additional information on the donors, but virtually no information on the charities.

We can view the problem as embedding an adjacency matrix follows by clustering the vertices of . Here represent the (unobserved) donors to donors graph, represents the donors to charities graph and represents the (unobserved) charities to charities graph. Because we only have transactions between donors and charities, we do not observe any part of except . Using the additional information on the donors, e.g., geographical information of city and state, we can simulate by modeling each of , where is a pairwise distance between donors and . We then use

to obtain an embedding of the donors. Given this embedding, we out-of-sample embed the charities and cluster them using the Mclust implementation of fraley99:_mclus for Gaussian mixture models. We note that for this example, a biclustering of

is also applicable.

For this experiment, we randomly sub-sample donors and use the associated charities and transactions, which yields unique charities, unique donors, and transactions. There are unique states for the charities, and the model-based clustering yields clusters. We validate our clustering via calculating the adjusted Rand Index (ARI) between the clustering labels and the true labels of the charities. We use the state information of the charities as the true labels, and we obtain an ARI of . This number appears small at first sight so we generate a null distribution of the adjusted Rand Index by shuffling the true labels. Figure 2 depicts the null distribution of the ARI with trials. It shows that and . The shaded area indicates the number of times the null ARIs are larger than the alternative ARI, which is the -value. With the -value of , we claim that the ARI obtained by clustering the out-of-sample embedded charities is significantly better than chance. In addition, this example also illustrates the applicability of out-of-sample embedding to scenarios where the lack of information regarding the relationships between a subset of the rows might prevents the use of spectral decomposition algorithms for embedding the whole matrix.

(a) Density plot
(b) Zoomed-in tail of density plot
Figure 2: Density plots for the null distribution, under a permutation test, of the ARI values between the clustering labels and the permuted true labels (state information of the charities). The shaded area above the ARI value between the clustering labels and the true labels represent the estimated -value. The plot indicates that the ARI value of the clustering of the out-of-sample charities is (statistically significant) better than chance.

5 Conclusions

In this paper we investigated the out-of-sample extension for embedding out-of-sample vertices in graphs arising from a latent position model with positive definite kernel . We showed, in Theorem 9, that if the number of in-sample vertices is sufficiently large, then with high-probability, the embedding into given by the out-of-sample extension is close to the true (truncated) feature map . This implies that inference for the out-of-sample vertices using their embeddings is appropriate, e.g.,Theorem 3.2. Experimental results on simulated data suggest that under suitable conditions, the degradation due to the out-of-sample extension is negligible.

The out-of-sample extension described in this paper is related to the notion of “sketching” and Nyström approximation for matrices bengio04:_out_lle_isomap_mds_eigen,williams01:_using_nystr,gittens:_revis_nystr,drineas05:_nystr_gram,platt05:_fastm_metric_mds_nyst. This connection suggests inquiry on how to select the in-sample vertices via consideration of the Nyström approximation so as to yield the best inferential performance on the out-of-sample vertices, i.e., whether one can use results on error bounds in the Nyström approximation to augment the selection of the in-sample vertices. A possible approach might be to select the sketching matrix , and hence the in-sample vertices, via a non-uniform importance sampling based on the leverage scores of the rows of . The leverage score of row of is the norm of the -th row of in the eigen-decomposition of , and fast approximation methods to compute the leverage scores are available, see e.g. clarkson13:_low,drineas12:_fast. We believe the investigation of this and other approaches to selecting the in-sample vertices will yield results that are useful and relevant for application domains.

Finally, as mentioned in Section 3, the out-of-sample extension as defined in this paper depends only on the in-sample vertices. Hence, the embedding of a batch of out-of-sample vertices does not uses the information contained in the relationship between the out-of-sample vertices. A modification of the out-of-sample extension presented herein that uses this information in the batch setting is possible, see e.g. [Trosset and Priebe(2008)] for such a modification in the case of classical multidimensional scaling. However, the construction similar to that in [Trosset and Priebe(2008)] will yield a convex but non-linear optimization problem with no closed-form solution and is much more complicated to analyze. We thus note that it is of potential interest to introduce an out-of-sample extension in the batch setting that is simple and amenable to analysis.


This work was partially supported by National Security Science and Engineering Faculty Fellowship (NSSEFF), Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE), the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303, and the Acheson J. Duncan Fund for the Advancement of Research in Statistics.

Appendix A: Proof of Theorem 9

We now proceed to prove Theorem 9. First recall the definition of in terms of the Moore-Penrose pseudo-inverse of and from Definition 3. We consider the expression


where is the Moore-Penrose pseudo-inverse of and is some orthogonal matrix in . A rough sketch of the argument then goes as follows. We first show that is “close” (up to rotation) in operator norm to . This allows us to conclude that is “small”. We then show that is “small” as it is a sum of zero-mean random vectors in . We then relate to the projection of the feature map into where is induced by the eigenvectors of . Finally, we use results on the convergence of spectra of to the spectra of to show that the projection of is “close” (up to rotation) to the projection that maps into . We thus arrive at an expression of the form as in the statement of Theorem 9.

We first collect some assorted bounds for the eigenvalues of and and bounds for the projection onto the subspaces of or in the following proposition. Let and be the projection operators onto the subspace spanned by the eigenvectors corresponding to the largest eigenvalues of and , respectively. Denote by the quantity and suppose that . Assume also that satisfies . Then with probability at least , the following inequalities hold simultaneously.


[Sketch] Eq. (19) is from [Oliveira(2010)]. The bound for follows from the assumption that the range of is in . The bound for follows from Theorem 30 below. The bounds for the eigenvalues of follow from the bounds for the corresponding eigenvalues of , Eq. (19), and perturbation results, e.g. Corollary III.2.6 in bhatia97:_matrix_analy. Eq. (22) follows from Eq. (19) and the theorem davis70. Eq. (23) follows from Eq. (22), Eq. (19), and an application of the triangle inequality. We also note the following result on perturbation for pseudo-inverses from [Wedin(1973)]. Let and be matrices with . Let and be the Moore-Penrose pseudo-inverses of and , respectively. Then


We now provide a bound for the spectral norm of the difference . Let and . Then, with probability at least, , there exists an orthogonal such that


We have and . Thus, and . Then by Lemma Appendix A: Proof of Theorem 9, we have

for any orthogonal . By Proposition Appendix A: Proof of Theorem 9, with probability at least ,

To complete the proof, we show that with probability at least , there exists some orthogonal such that

We proceed as follows. We note that and are matrices in and are of full column rank. Then by Lemma A.1 in [Tang et al.(2013)Tang, Sussman, and Priebe], there exists an orthogonal matrix such that

As and , we thus have


where the inequalities in Eq. (26) follows from Proposition Appendix A: Proof of Theorem 9 and hold with probability . Eq. (25) is thus established. We now provide a bound for which, as sketched earlier, is one of the key step in the proof of Theorem 9. We note that application of the multiplicative bound for the norm of a matrix vector product, i.e., leads to a bound that is worse by a factor of . This is due to the fact that is a vector whose components are independent Bernoulli random variables and thus the scaling of the probabilities by a constant changes by a factor that is roughly . With probability at least , there exists an orthogonal such that


The proof of Lemma 26 uses the following concentration inequality for sums of independent matrices from tropp12:_user. Consider a finite sequence of independent random matrices with dimensions . Assume that each satisfies and that, for some independent of the