The classical statistical pattern recognition setting involves
are observed feature vectors and the
are observed class labels, for some probability distributionon . This setting has been extensively investigated and many important and interesting theoretical concepts and results, e.g., universal consistency, structural complexities, and arbitrary slow convergence are available. See, e.g., [Devroye et al.(1996)Devroye, Györfi, and Lugosi] for a comprehensive overview.
Now, suppose that the feature vectors are unobserved, and we observe instead a graph on vertices. Suppose also that is constructed in a manner such that there is a one-to-one relationship between the vertices of and the feature vectors . The question of classifying the vertices based on and the observed labels then arises naturally.
A general approach to this classification problem is illustrated by Algorithm 1 wherein inference, e.g., classification or clustering, proceeds by first embedding the graph into some Euclidean space
followed by inference in that space. This approach is well-represented in the literature of multidimensional scaling, spectral clustering, and manifold learning. The approach’s popularity is due partly to its simplicity, as after the embedding step, the vertices ofare now points in and classification or clustering can proceed in an almost identical manner to that of the classical setting, with a plethora of well-established and robust inference procedures available. In addition, theoretical justifications for the embedding step are also available. For example in the spectral clustering and manifold learning literature, the embedding step is often based on the spectral decomposition of the (combinatorial or normalized) Laplacians matrices of the graph. It can then be shown that, under mild conditions on the construction of
, the Laplacian matrices converge in some sense to the corresponding Laplace-Beltrami operators on the domain. Thus, the eigenvalues and eigenvectors of the graph Laplacians converge to the eigenvalues and eigenfunctions of the corresponding operator. See for example[von Luxburg et al.(2008)von Luxburg, Belkin, and Bousquet, Hein et al.(2007)Hein, Audibert, and von Luxburg, von Luxburg(2007), Coifman and Lafon(2006), Belkin and Niyogi(2005), Hein et al.(2005)Hein, Audibert, and von Luxburg, Singer(2006), Rosasco et al.(2010)Rosasco, Belkin, and Vito] and the references therein for a survey of the results.
The above cited results suggest that the embedding is conducive to the subsequent inference task, but as they are general convergence results and do not explicitly consider the subsequent inference problem, they do not directly demonstrate that inference using the embeddings are meaningful. Recently, there has been investigations that coupled the embedding step with the subsequent inference step for several widely-used random models for constructing . For example, [Rohe et al.(2011)Rohe, Chatterjee, and Yu, Sussman et al.(2012a)Sussman, Tang, Fishkind, and Priebe, Fishkind et al.(2013)Fishkind, Sussman, Tang, Vogelstein, and Priebe, Chaudhuri et al.(2012)Chaudhuri, Chung, and Tsiatas] showed that clustering using the embeddings can be consistent for graphs constructed based on the stochastic block model Holland1983, the random dot product model young2007random, and the extended partition model karrer11:_stoch. In related works, sussman12:_univer,tangs.:_univer showed that one can obtain universally consistent vertex classification for graphs constructed based on the random dot product model or its generalization, the latent position model Hoff2002. However, a major technical difficulty of the approach arises when one tries to use it to classify unlabeled out-of-sample vertices without including them in the embedding stage. A possible solution is to recompute the embedding for each new vertex. However, as many of the popular embedding methods are spectral in nature, e.g., classical multidimensional scaling torgersen52:_multid, Isomap tenebaum00:_global_geomet_framew_nonlin_dimen_reduc, Laplacian eigenmaps belkin03:_laplac and diffusion maps coifman06:_diffus_maps, the computational costs for each new embedding is of order , making this solution computationally expensive. To circumvent this technical difficulty, out-of-sample extensions for many of the popular embedding methods such as those listed above have been devised, see e.g. faloutsos95,platt05:_fastm_metric_mds_nyst,bengio04:_out_lle_isomap_mds_eigen,williams01:_using_nystr,silva03:_global,wang99:_evaluat,trosset08. In these out-of-sample extensions, the embedding for the in-sample points is kept fixed and the out-of-sample vertices are inserted into the configuration of the in-sample points. The computational costs are thus much less, e.g., linear in the number of in-sample vertices for each insertion of an out-of-sample vertex into the existing configuration.
In this paper, we study the out-of-sample extension for the embedding step in Algorithm 1 and its impact on the subsequent inference tasks. In particular we show that, under the latent position graph model and for sufficiently large , the mapping of the out-of-sample vertices is close to its true latent position. This suggests that inference for the out-of-sample vertices is possible.
The structure of our paper is as follows. We introduce the framework of latent position graphs in § 2. We describe the out-of-sample extension for the adjacency spectral embedding and analyze its properties in § 3. In § 4, we investigate via simulation the implications of performing inference using these out-of-sample embeddings. We conclude the paper with discussion of related work, how the results presented herein can be extended, and other implications.
Let be a compact metric space and let be a continuous positive definite kernel on . Let be a probability measure on the Borel -field of . Now, for a given , let . Let be arbitrary ( can depends on ). Define . Let be a symmetric, hollow, random binary matrix where the entries of are conditionally independent
Bernoulli random variables given the, with for all , . A graph whose adjacency matrix is constructed as above is an instance of a latent position graph. The factor controls the sparsity of the resulting graph. For example, if on , then leads to sparse, connected graphs almost surely, leads to graphs with a single giant connected component, and for some fixed leads to dense graphs. We will denote by an instance of a latent position graph on with distribution , link function , and sparsity factor . We shall assume throughout this paper that for some . That is, the expected average degree of grows at least as fast as .
An example of a latent position graph model is the random dot product graph (RDPG) model of [Young and Scheinerman(2007)]. In the RDPG model, is taken to be the unit simplex in and the link function is the Euclidean inner product. One can then take to be a Dirichlet distribution on the unit simplex. Another example of a latent position graph model takes as a compact subset of and the link function
is a radial basis function, e.g., a Gaussian kernel. This model is similar to the method of constructing graphs based on point clouds in in the manifold learning literature. The small difference is that in the case presented here, the Gaussian kernel is used for generating the edges probabilities in the Bernoulli trials, i.e., the edges are unweighted but random, whereas in the manifold learning literature, the Gaussian kernel is used to assign weights to the edges i.e., the edges are weighted but deterministic.
The latent position graph model and the related latent space approach Hoff2002 is widely used in network analysis. It is a generalization of the stochastic block model (SBM) Holland1983 and variants such as the degree-corrected SBM karrer11:_stoch, the mixed-membership SBM Airoldi2008 and the random dot product graph model young2007random. It is also closely related to the inhomogeneous random graph model bollobas07 and the exchangeable graph model diaconis08:_graph_limit_exchan_random_graph.
We now define a feature map for . will serve as our canonical feature map, i.e., our subsequent results for the out-of-sample extension are based on bounds for the deviation of the out-of-sample embedding from the canonical feature map representation, e.g., Theorem 9. The kernel defines an integral operator on , the space of -square-integrable functions on , via
is then a compact operator and is of trace class (see e.g., Theorem 4.1 in [Blanchard et al.(2007)Blanchard, Bousquet, and Zwald]). Let be the set of eigenvalues of in non-increasing order. The are non-negative and discrete, and their only accumulation point is at . Let be a set of orthonormal eigenfunctions of corresponding to the . Then by Mercer’s representation theorem cucker12, one can write
with the above sum converging absolutely and uniformly for each and in . We define the feature map via
We define a related feature map for by
We will refer to as the truncated feature map or as the truncation of . We note that the feature map and are defined in terms of the spectrum and eigenfunctions of and thus do not depend on the scaling parameter .
We conclude this section with some notations that will be used in the remainder of the paper. Let us denote by and the set of matrices and matrices on , respectively. For a given adjacency matrix , let be the eigen-decomposition of . For a given , let be the diagonal matrix comprising of the largest eigenvalues of and let be the matrix comprising of the corresponding eigenvectors. The matrices are are defined similarly. For a matrix , denotes the spectral norm of and denotes the the Frobenius norm of . For a vector , denote the -th component of and denotes the Euclidean norm of .
3 Out-of-sample extension
We now introduce the out-of-sample extension for the adjacency spectral embedding of Algorithm 1. Suppose is an instance of on vertices. Let and denote by the matrix ; is the Moore-Penrose pseudo-inverse of . Let be the -th column of . For a given , let be the (random) mapping defined by
where is a vector of independent Bernoulli random variables with . The map is the out-of-sample extension of ; that is, extends the embedding for the sampled to any . We make some quick remarks regarding Definition 3. First, we note that the out-of-sample extension give rise to i.i.d. random variables, i.e., if are i.i.d from , then the are i.i.d. random variables in . Secondly, is a random mapping for any given , even when conditioned on the . The randomness of arises from the randomness in the adjacency matrix induced by the in-sample points as well as the randomness in the Bernoulli random variables used in Eq. (4). Thirdly, Eq. (4) states that the out-of-sample extension of is the least square solution to , i.e., is the least square projection of the (random) vector onto the subspace spanned by the columns of . The use of the least square solution to , or equivalently the projection of onto the subspace spanned by the configuration of the in-sample points, is standard in many of the out-of-sample extensions to the popular embedding methods, see e.g. [Bengio et al.(2004)Bengio, Paiement, Vincent, Delalleau, Roux, and Ouimet, Anderson and Robinson(2003), Faloutsos and Lin(1995), de Silva and Tenenbaum(2003), Wang et al.(1999)Wang, Wang, Lin, Shasha, Shapiro, and Zhang]. In general, is a vector containing the proximity (similarity or dissimilarity) between the out-of-sample point and the in-sample points and the least square solution can be related to the Nyström method for approximating the eigenvalues and eigenvectors of a large matrix, see e.g. [Bengio et al.(2004)Bengio, Paiement, Vincent, Delalleau, Roux, and Ouimet, Platt(2005), Williams and Seeger(2001)].
Finally, the motivation for Definition 3 can be gleaned by considering the setting of random dot product graphs. In this setting, where is the matrix whose rows correspond to the sampled latent positions as points in . Then is equivalent (up to rotation) to . Now let be a vector of Bernoulli random variables with . Then . Thus, if we can show that , then we have . As is “close” to tangs.:_univer,sussman12:_univer and is “small” with high probability, see e.g. tropp12:_user,yurinsky95:_sums_gauss, the relationship holds for random dot product graphs. As the latent position graphs with positive definite kernels can be thought of as being random dot product graphs with latent positions being “points” in , one expects a relationship of the form for the (truncated) feature map of . Precise statements of the relationships are given in Theorem 9 and Corollary 9 below.
3.1 Out-of-sample extension and Nyström approximation
In the following discussion, we give a brief description of the relationship between Definition 3 and the Nyström approximation of [Drineas and Mahoney(2005), Gittens and Mahoney(2013)] which they called “sketching”. Let be symmetric and let with . Following [Gittens and Mahoney(2013)], let and . Then serves as a low-rank approximation to with rank at most and [Gittens and Mahoney(2013)] refers to as the sketching matrix. The different choices for leads to different low-rank approximations. For example, a subsampling scheme correspond to the entries of being binaries variable with a single non-zero entry in each row or column. More general entries for correspond to a linear projection of the columns of . There are times when is ill-conditioned and one is instead interested in the best rank approximation to , i.e., the sketched version of is where is a rank approximation to .
Suppose now that correspond to a subsampling scheme. Then correspond to a sub-matrix of , i.e., correspond to the rows and columns indexed by . Without loss of generality, we assume that is the first rows and columns of . That is, we have the following decomposition
where and . We have abused notations slightly by writing . Then can be written as
Let us now take to be the best rank approximation to in the positive semidefinite cone. Then can be written as
Now let be such that and let . Then Eq. (7), can be written as
We thus note that if is an adjacency matrix on a graph with vertices then is the adjacency matrix of the induced subgraph of on vertices. Then is the rank approximation to that arises from the adjacency spectral embedding of . Thus and therefore is the matrix each of whose rows correspond to an out-of-sample embedding of the rows of into as defined in Definition 3.
In summary, in the context of adjacency spectral embedding, the embeddings of the in-sample and out-of-sample vertices generate a Nyström approximation to and a Nyström approximation to can be used to derive the embeddings (through an eigen-decomposition) for the in-sample and out-of-sample vertices.
3.2 Estimation of feature map
The main result of this paper is the following result on the out-of-sample mapping error . Its proof is given in the appendix. We note that the dependency on and is hidden in the spectral gap of , the integral operator induced by and . Let be given. Denote by the quantity and suppose that . Let be arbitrary. Then there exists an orthogonal such that
for some constant independent of , , , , and . We note the following corollary of the above result for the case where the latent position model is the random dot product graph model. For this case, the operator is of rank and the truncated feature map is equal (up to rotation) to the latent position . Let be an instance of . Denote by the smallest eigenvalue of . Let be arbitrary. Then there exists an orthogonal such that
for some constant independent of , , , and . We note the following result from tangs.:_univer that serves as an analogue of Theorem 9 for the in-sample points. We note that, other than the possibly different hidden constants , the bound for the out-of-sample points in Eq. (9) is almost identical to that of the in-sample points in Eq. (11). The main difference is in the power of the spectral gap in the bounds, i.e., against
. This difference might be due to the proof technique and not inherent in the distinction of out-of-sample versus in-sample points. We also note that one can take the orthogonal matrixfor the out-of-sample points to be the same as the in-sample points, i.e., the rotation that makes the in-sample points “close” to the truncated feature map also makes the out-of-sample points “close” to . Let be given. Denote by the quantity and suppose that . Let be arbitrary. Let denote the -th row of
. Then there exists a unitary matrixsuch that for all
for some constant independent of , , , , and . Theorem 9 and its corollary states that in the latent position model, the out-of-sample embedded points can be rotated to be very close to the true feature map with high probability. This suggest that successful statistical inference on the out-of-sample points is possible. As an example, we investigate the problem of vertex classification for latent position graphs whose link functions belong to the class of universal kernels. Specifically, we consider an approach that proceeds by embedding the vertices into some followed by finding a linear discriminant in that space. It was shown in [Tang et al.(2013)Tang, Sussman, and Priebe] that such an approach can be made to yield a universally consistent vertex classifier if the vertex to be classified is embedded in-sample as the number of in-sample vertices increases to . In the following discussion we present a variation of this result in the case where the vertex to be classified is embedded out-of-sample and the number of in-sample vertices is fixed and finite. We show that under this out-of-sample setting, the misclassification rate can be made arbitrarily small provided that the number of in-sample vertices is sufficiently large (see Theorem 3.2).
A continuous kernel on a metric space is said to be a universal kernel, if for some choice of feature map of to some Hilbert space , the class of functions of the form
is dense in , i.e., for any continuous and any , there exists such that . We note that if is such that is dense in for some feature map of , then is dense in for any feature map of , i.e., the universality of is independent of its choice of feature map. In addition, any feature map of a universal kernel is injective. The following result lists several well-known universal kernels. For more on universal kernels, the reader is referred to [Steinwart(2001), Micchelli et al.(2006)Micchelli, Xu, and Zhang]. Let be a compact subset of . Then the following kernels are universal on .
exponential kernel .
Gaussian kernel for .
The binomial kernel for .
inverse multiquadrics for .
Let be the class of linear functions on induced by the feature map whose linear coefficients are normalized to have norm at most , i.e., if and only if is of the form
for some . We note that the is a nested increasing sequence and furthermore that
Now, given , let be the class of linear functions on induced by the out-of-sample extension , i.e., if and only if is of the form
Let be a universal kernel on . Let be arbitrary. Then for any and any , there exists and such that if then
where is the out-of-sample mapping as defined in Definition 3 and is the Bayes risk for the classification problem with distribution . We make a brief remark regarding Theorem 3.2. The term in Eq. (14) refers to the Bayes risk for the classification problem given by the out-of-sample mapping . As noted earlier, is a random mapping for any given , even when conditioned on the as also depends on the latent position graph generated by the . As such, with slight abuse of notations, refers to the Bayes-risk of the mapping when not conditioned on any set of . That is, is the Bayes-risk for out-of-sample embedding in the presence of in-sample latent positions, i.e., the latent positions of the in-sample points are integrated out. As the information processing lemma implies , one can view Eq. (14) as a sort of converse to the information processing lemma in that the degradation due to the out-of-sample embedding transformation can be made negligible if the number of in-sample points is sufficiently large. Let be any classification-calibrated (see [Bartlett et al.(2006)Bartlett, Jordan, and McAuliffe]) convex surrogate of the loss. For any measurable function , let be defined by . Let be a measurable function such that where the infimum is taken over the set of all measurable functions on . As is dense in the set of measurable functions on , without loss of generality we can take . Now let be arbitrary. As , and the is a nested increasing sequence, there exists a such that for some , we have . Thus, for any , there exists a such that for some . Now let be arbitrary. Then for some . Let and consider the difference . As is convex, it is locally-Lipschitz and
for some constant . Furthermore, we can take to be independent of . Thus, there exists some such that for all , where the supremum is taken over all . Now let be such that . We then have
If is a classification-calibrated convex surrogate of the loss, then there exists a non-decreasing function such that bartlett06:_convex. Thus by Eq. (16) we have
As is arbitrary, the proof is completed.
4 Experimental Results
In this section, we illustrate the out-of-sample extension described in § 4 by studying its impact on classification performance through two simulation examples and a real data example. In our first example, data is simulated using a mixture of two multivariate normals in . The components of the mixture have equal prior and the first component of the mixture has mean parameter and identity covariance matrix while the second component has mean and identity covariance matrix. We sample data points from this mixture and assign class labels in to them according to the quadrant in which they fall, i.e., if then . Fig (a)a depicts the scatter plot of the sampled data colored according to their class labels. The Bayes risk is for classifying the . A latent position graph is then generated based on the sampled data points with being the Gaussian kernel.
To measure the in-sample classification performance, we embed into for ranging from through . A subset of vertices is then selected uniformly at random and designated as the training data set. The remaining vertices constitute the testing data set. For each choice of dimension , we select a linear classifier by performing a least square regression on the training data points and measure the classification error of on the testing data points. The results are plotted in Fig (b)b.
For the out-of-sample classification performance, we embed the induced graph formed by the training vertices in the above description. For each choice of dimension , we out-of-sample embed the testing vertices into . For each choice of dimension , a linear classifier
is once again selected by linear regression using the in-sample training data points and tested on the out-of-sample embedded testing data points. The classification errors are also plotted in Fig(b)b. A quick glance at the plots in Fig (b)b suggests that the classification performance degradation due to the out-of-sample embedding is negligible.
Our next example uses the abalone dataset from the UCI machine learning repository bache2013. The data set consists of
observations of nine different abalones attributes. The attributes are sex, number of rings, and seven other physical measurements of the abalones, e.g., length, diameter, and shell weight. The number of rings in an abalone is an estimate of its age in years. We consider the problem of classifying an abalone based on its physical measurements. Following the description of the data set, the class labels are as follows. An abalone is classified as class 1 if its number of rings is eight or less. It is classified as class 2 if its number of rings is nine or ten, and it is classified as class 3 otherwise. The dataset is partitioned into a training set ofobservations and a test set of observations. The lowest misclassification rate is reported to be 35.39% waugh95:_exten_cascad_correl.
We form a graph on vertices following a latent position model with a Gaussian kernel where represents the physical measurements of the -th abalone observation. To measure the in-sample classification performance, we embed the vertices of into and train a multi-class linear SVM on the embedding of the training vertices. We then measure the mis-classification rate of this classifier on the embedding of the testing vertices. For the out-of-sample setting, we randomly chose a subset of vertices from the training set and embed the resulting induced subgraph into then out-of-sample embed the remaining vertices. We then train a multi-class linear SVM on the out-of-sample embedded vertices in the training set and measure the mis-classification error on the vertices in the testing set. The results for various choices of are given in Table 1.
Our final example is on the CharityNet dataset. The data set consists of 2 years of anonymized donations transactions between anonymized donors and charities. There are in total million transactions representing donations from 1.8 million donors to 5700 charities. Note that the data set does not contains any explicit information on the charities to charities relationship, i.e., the charities relate to one another through the donations transactions between donors and charities. We investigate the problem of clustering the charities under the assumption that there are additional information on the donors, but virtually no information on the charities.
We can view the problem as embedding an adjacency matrix follows by clustering the vertices of . Here represent the (unobserved) donors to donors graph, represents the donors to charities graph and represents the (unobserved) charities to charities graph. Because we only have transactions between donors and charities, we do not observe any part of except . Using the additional information on the donors, e.g., geographical information of city and state, we can simulate by modeling each of , where is a pairwise distance between donors and . We then use
to obtain an embedding of the donors. Given this embedding, we out-of-sample embed the charities and cluster them using the Mclust implementation of fraley99:_mclus for Gaussian mixture models. We note that for this example, a biclustering ofis also applicable.
For this experiment, we randomly sub-sample donors and use the associated charities and transactions, which yields unique charities, unique donors, and transactions. There are unique states for the charities, and the model-based clustering yields clusters. We validate our clustering via calculating the adjusted Rand Index (ARI) between the clustering labels and the true labels of the charities. We use the state information of the charities as the true labels, and we obtain an ARI of . This number appears small at first sight so we generate a null distribution of the adjusted Rand Index by shuffling the true labels. Figure 2 depicts the null distribution of the ARI with trials. It shows that and . The shaded area indicates the number of times the null ARIs are larger than the alternative ARI, which is the -value. With the -value of , we claim that the ARI obtained by clustering the out-of-sample embedded charities is significantly better than chance. In addition, this example also illustrates the applicability of out-of-sample embedding to scenarios where the lack of information regarding the relationships between a subset of the rows might prevents the use of spectral decomposition algorithms for embedding the whole matrix.
In this paper we investigated the out-of-sample extension for embedding out-of-sample vertices in graphs arising from a latent position model with positive definite kernel . We showed, in Theorem 9, that if the number of in-sample vertices is sufficiently large, then with high-probability, the embedding into given by the out-of-sample extension is close to the true (truncated) feature map . This implies that inference for the out-of-sample vertices using their embeddings is appropriate, e.g.,Theorem 3.2. Experimental results on simulated data suggest that under suitable conditions, the degradation due to the out-of-sample extension is negligible.
The out-of-sample extension described in this paper is related to the notion of “sketching” and Nyström approximation for matrices bengio04:_out_lle_isomap_mds_eigen,williams01:_using_nystr,gittens:_revis_nystr,drineas05:_nystr_gram,platt05:_fastm_metric_mds_nyst. This connection suggests inquiry on how to select the in-sample vertices via consideration of the Nyström approximation so as to yield the best inferential performance on the out-of-sample vertices, i.e., whether one can use results on error bounds in the Nyström approximation to augment the selection of the in-sample vertices. A possible approach might be to select the sketching matrix , and hence the in-sample vertices, via a non-uniform importance sampling based on the leverage scores of the rows of . The leverage score of row of is the norm of the -th row of in the eigen-decomposition of , and fast approximation methods to compute the leverage scores are available, see e.g. clarkson13:_low,drineas12:_fast. We believe the investigation of this and other approaches to selecting the in-sample vertices will yield results that are useful and relevant for application domains.
Finally, as mentioned in Section 3, the out-of-sample extension as defined in this paper depends only on the in-sample vertices. Hence, the embedding of a batch of out-of-sample vertices does not uses the information contained in the relationship between the out-of-sample vertices. A modification of the out-of-sample extension presented herein that uses this information in the batch setting is possible, see e.g. [Trosset and Priebe(2008)] for such a modification in the case of classical multidimensional scaling. However, the construction similar to that in [Trosset and Priebe(2008)] will yield a convex but non-linear optimization problem with no closed-form solution and is much more complicated to analyze. We thus note that it is of potential interest to introduce an out-of-sample extension in the batch setting that is simple and amenable to analysis.
This work was partially supported by National Security Science and Engineering Faculty Fellowship (NSSEFF), Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE), the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303, and the Acheson J. Duncan Fund for the Advancement of Research in Statistics.
Appendix A: Proof of Theorem 9
where is the Moore-Penrose pseudo-inverse of and is some orthogonal matrix in . A rough sketch of the argument then goes as follows. We first show that is “close” (up to rotation) in operator norm to . This allows us to conclude that is “small”. We then show that is “small” as it is a sum of zero-mean random vectors in . We then relate to the projection of the feature map into where is induced by the eigenvectors of . Finally, we use results on the convergence of spectra of to the spectra of to show that the projection of is “close” (up to rotation) to the projection that maps into . We thus arrive at an expression of the form as in the statement of Theorem 9.
We first collect some assorted bounds for the eigenvalues of and and bounds for the projection onto the subspaces of or in the following proposition. Let and be the projection operators onto the subspace spanned by the eigenvectors corresponding to the largest eigenvalues of and , respectively. Denote by the quantity and suppose that . Assume also that satisfies . Then with probability at least , the following inequalities hold simultaneously.
[Sketch] Eq. (19) is from [Oliveira(2010)]. The bound for follows from the assumption that the range of is in . The bound for follows from Theorem 30 below. The bounds for the eigenvalues of follow from the bounds for the corresponding eigenvalues of , Eq. (19), and perturbation results, e.g. Corollary III.2.6 in bhatia97:_matrix_analy. Eq. (22) follows from Eq. (19) and the theorem davis70. Eq. (23) follows from Eq. (22), Eq. (19), and an application of the triangle inequality. We also note the following result on perturbation for pseudo-inverses from [Wedin(1973)]. Let and be matrices with . Let and be the Moore-Penrose pseudo-inverses of and , respectively. Then
We now provide a bound for the spectral norm of the difference . Let and . Then, with probability at least, , there exists an orthogonal such that
We have and . Thus, and . Then by Lemma Appendix A: Proof of Theorem 9, we have
for any orthogonal . By Proposition Appendix A: Proof of Theorem 9, with probability at least ,
To complete the proof, we show that with probability at least , there exists some orthogonal such that
We proceed as follows. We note that and are matrices in and are of full column rank. Then by Lemma A.1 in [Tang et al.(2013)Tang, Sussman, and Priebe], there exists an orthogonal matrix such that
As and , we thus have
where the inequalities in Eq. (26) follows from Proposition Appendix A: Proof of Theorem 9 and hold with probability . Eq. (25) is thus established. We now provide a bound for which, as sketched earlier, is one of the key step in the proof of Theorem 9. We note that application of the multiplicative bound for the norm of a matrix vector product, i.e., leads to a bound that is worse by a factor of . This is due to the fact that is a vector whose components are independent Bernoulli random variables and thus the scaling of the probabilities by a constant changes by a factor that is roughly . With probability at least , there exists an orthogonal such that
The proof of Lemma 26 uses the following concentration inequality for sums of independent matrices from tropp12:_user. Consider a finite sequence of independent random matrices with dimensions . Assume that each satisfies and that, for some independent of the