Persistent Homology of Graph Embeddings

by   Vinesh Solanki, et al.

Popular network models such as the mixed membership and standard stochastic block model are known to exhibit distinct geometric structure when embedded into R^d using spectral methods. The resulting point cloud concentrates around a simplex in the first model, whereas it separates into clusters in the second. By adopting the formalism of generalised random dot-product graphs, we demonstrate that both of these models, and different mixing regimes in the case of mixed membership, may be distinguished by the persistent homology of the underlying point distribution in the case of adjacency spectral embedding. Moreover, despite non-identifiability issues, we show that the persistent homology of the support of the distribution and its super-level sets can be consistently estimated. As an application of our consistency results, we provide a topological hypothesis test for distinguishing the standard and mixed membership stochastic block models.



There are no comments yet.


page 1

page 2

page 3

page 4


Bipartite mixed membership stochastic blockmodel

Mixed membership problem for undirected network has been well studied in...

Mixed Membership Distribution-Free model

We consider the problem of detecting latent community information of mix...

Hypothesis Testing for Equality of Latent Positions in Random Graphs

We consider the hypothesis testing problem that two vertices i and j of ...

Vertex Classification on Weighted Networks

This paper proposes a discrimination technique for vertices in a weighte...

A consistent adjacency spectral embedding for stochastic blockmodel graphs

We present a method to estimate block membership of nodes in a random gr...

Correcting a Nonparametric Two-sample Graph Hypothesis Test for Graphs with Different Numbers of Vertices

Random graphs are statistical models that have many applications, rangin...

Dynamic Infinite Mixed-Membership Stochastic Blockmodel

Directional and pairwise measurements are often used to model inter-rela...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graph embedding has diverse uses, including visualisation, community detection, classification and link prediction [14] and many different approaches to graph embedding now exist with highly cited modern examples including DeepWalk [23], LINE [27] and node2vec [19]. The focus of this paper is on spectral embedding of graphs, and we initiate a more comprehensive study of the geometry of the resulting point distribution.

After embedding a graph into

, a clustering procedure is typically applied to the points in the hope of identifying network communities. The spectral clustering algorithm is a famous example of this approach, and its use is widespread in network science

[1, 18, 29]. Embedded point clouds arising from real-world networks or realisations of network models can however often exhibit richer geometry going beyond that which clustering alone is able to describe. Examples include spectral embedding of the mixed membership stochastic block model [3] and the latent structure model [5].

Our focus on spectral embedding is motivated by an existing and expanding body of rigorous statistical theory [4, 8, 21, 25, 26]. Recent results have allowed for precise statements to be made about the extent to which the inference of latent structure is possible. In particular, modulo issues of non-identifiability, it now makes sense to speak of the embedded points as being estimates of points drawn from a true underlying point distribution. We build on some of the work in [26] in this paper, in making this mathematically explicit and our geometric analysis focusses on topological aspects of this point distribution.

Differences between important network models often boil down to topological observations about these distributions. In the stochastic block model, the number of communities is equal to the number of connected components in the support of the distribution. Under the mixed membership stochastic block model, the support is connected; however, in one regime some super-level sets of the corresponding density have a void, whereas none do in the others. These observations may be inferred using the tools of topological data analysis, in particular those of persistent homology (for a general overview see [9]).

Persistent homology provides a multi-scale approach to topological inference on point clouds via the construction of nested sequences of topological spaces. We focus on several such sequences in this paper. Considering a union of balls about each point of fixed radius, and letting this radius vary allows for inferential statements to be made about the support of the underlying point distribution, whereas finer information about the structure of this distribution may be obtained from the sequence of super-level sets of a kernel density estimate.

The principal objects of study of persistent homology are persistence diagrams and barcodes, which track the birth and death times of topological features, and persistence diagrams may be compared by use of various metrics such as the bottleneck distance. A number of statistical results now exist for quantifying how a persistence diagram estimated from a sample of points taken from a distribution compares to the relevant diagram constructed from the distribution itself [12, 13, 17].

A complication of spectral embedding previously alluded to is that the embedded points are non-identifiable up to an unknown distance-distorting transformation and persistent homology is itself sufficiently geometrically sensitive not to be invariant under such transformations. Nevertheless, we demonstrate in this paper that for a large class of random graph models, it is possible to consistently estimate the persistent homology of point distributions associated with them.

As an application of our consistency results, we formulate a topological hypothesis testing framework which is capable of distinguishing the mixed membership and stochastic block models in a number of regimes, and is to the best of our knowledge the first of its kind. The development of a fuller topological hypothesis testing framework along with a more detailed investigation of what other geometric features of the point distribution may be inferred in the presence of non-identifiability is the focus of ongoing work.

2. Preliminaries

2.1. Notation

The notation used in this paper is standard. The space of real-valued matrices is denoted by . The group of orthogonal matrices in is denoted by and given non-negative integers such that , the corresponding indefinite orthogonal group is denoted by

. Recall that this is the group of all linear transformations preserving the indefinite orthogonal bilinear form

where . Equivalently (treating elements of

as row vectors), the indefinite orthogonal group is defined to consist of all matrices

such that

where is defined to be the diagonal matrix consisting of ’s followed by ’s. If is a matrix, its spectral norm is denoted by and its Frobenius norm is denoted by . The Euclidean norm of an element is denoted by

. We will deal exclusively with Borel probability measures on

. If is such a measure, its support is denoted by . Recall that this is the minimal closed set (with respect to inclusion) such that .

2.2. Spectral embedding and generalised random dot product graphs

We concentrate in this paper on spectrally embedding adjacency matrices of graphs sampled according to a random graph model to be introduced below. Consider an undirected graph on nodes with (symmetric) adjacency matrix

. A truncation of the eigenvalue decomposition of


For some , the matrix is defined to have as columns the

orthonormal eigenvectors corresponding to the largest

eigenvalues of by magnitude (taking account of multiplicity) with the diagonal matrix consisting of these eigenvalues.

Define the matrix . As the notation suggests, for a suitable random graph model from which is sampled, the rows of may be viewed as being estimates of population quantities drawn independently from a measure supported on . The model is defined as follows.

Definition 1.

Let be a probability measure supported on and let

denote its second moment matrix, where

. If and are non-negative integers such that , define to be -admissible if for every ,

The rank of the second moment matrix of determines the level at which the spectral decomposition of is truncated and if a probability measure is -admissible, the indefinite inner product between any pair of points in the support of can be used to give a valid probability.

Definition 2 (Generalised random dot-product graph).

Let be a probability measure on that is -admissible. Then a symmetric matrix is distributed according to a generalised random dot product graph with signature and base probability measure if conditional on sampled independently and identically with distribution , for all ,

Definition 2 is a slight modification of that found in [26]. It is worth noting that the base probability measure in Definition 2 is not unique due to the symmetry group of the indefinite bilinear form. Given any suitable base measure , it is clear that the push-forward of by any will give an identical random graph model. The lack of identifiability of the base measure is the reason why the only estimate their true counterparts up to an unknown indefinite orthogonal transformation. The precise way in which they do so is given by the following theorem (stated without sparsity conditions).

Theorem 3 (Theorem 5 of [26]).

Suppose that are obtained by spectrally embedding an adjacency matrix of a generalised random dot product graph with base measure . Then there exist identically and independently distributed according to , a universal constant

and a random matrix

such that

The matrices are of the form where depends on the sample drawn from and

is a random orthogonal matrix. The

can be given an explicit description (Lemma 14 in Section 5.1) and it is a contribution of the present paper that these matrices converge in some sense asymptotically to a constant matrix which can be described explicitly in terms of the second moment matrix of (Lemma 16 of Section 5.1).

2.3. The stochastic and mixed membership stochastic block models

Both the stochastic block model and the (undirected) mixed membership stochastic block model turn out to be specific examples of generalised random dot product graphs and the manner in which they are is derived in [26]. We define these models below for the reader’s convenience. Let be a matrix that is symmetric and of full rank.

Definition 4 (Stochastic block model).

Let be a -dimensional probability vector. We say that if is a symmetric, hollow matrix such that for all , conditional on ,

where .

Definition 5 (Mixed membership stochastic block model).

Let . We say that if is a symmetric, hollow matrix such that for all , conditional on ,


and .

The matrix is used to calculate the probability of there being an edge between any two nodes of a graph sampled according to either of these models. In the case of the stochastic block model, this probability depends solely on which of the communities each node belongs to. In the case of mixed membership, a node chooses a community at random when deciding whether to form an edge, and this choice is governed by a probability vector representing community affinity.

Adopting the formal framework of generalised random dot-product graphs, base probability measures can be explicitly described for both models. Using the singular value decomposition of

, vector representatives for each of the communities, can be computed. If in Definition 4 then the stochastic block model is a generalised random dot-product graph with base measure a mixture of point masses

An i.i.d. sample from this distribution is a community assignment (for example, if , then and are assigned to community 1) and the corresponding observed latent positions are perturbed from possibly after the application of an unknown indefinite orthogonal transformation by Theorem 3. Similarly, under the mixed membership stochastic block model, the true latent positions are Dirichlet distributed on the simplex with vertices , with the barycentric coordinates giving the probabilities that the associated node will choose each community. In both cases, there is a probability measure compactly supported on and each true latent position is a random element taking its values in with this distribution.

Point clouds obtained by carrying out adjacency spectral embedding on graphs generated according to these two models are shown in Figure 1. These correspond to three simulated graphs on nodes in three regimes: a three-community stochastic block model (with , regime 1), a mixed membership stochastic block model with all (, regime 2), and a mixed membership stochastic block model again where at least one (, regime 3). The raw embedding of each is shown in the first row.

Figure 1. Illustration of different topological spaces constructed from point clouds corresponding to spectrally embedded graphs under the mixed membership (middle and right column) and standard (left column) stochastic block model. The blue boxes indicate which models can be distinguished by these different constructions. Top: the raw embedding; middle: a union of balls of fixed radius; bottom: a kernel density estimate. Further details in main text.

The topology of the support of a base measure can be used to differentiate the mixed membership from the standard stochastic block model. The number of connected components of a finite mixture of point masses is equal to the number of those point masses which are distinct (and hence the number of communities in the stochastic block model), whereas a simplex has a single connected component. Any finite point cloud does not reflect reflect this topology however, with the number of connected components of a point cloud being equal to its size. For this reason, the connected components of the topological spaces obtained by taking unions of balls of a fixed radius around each estimated latent position are considered, in the hope that for a reasonable range of such radii the true topology of the support can be recovered. The second row of Figure 1 illustrates these topological spaces for a fixed radius for the three spectral embeddings considered, with the blue boxes indicating which of the regimes can be teased apart in this manner.

Whilst the topology of the support of the distribution of true latent positions might be useful for differentiating the mixed membership from the standard stochastic block model, being able to get at the distribution of the true latent positions is more discriminatory. Again, topological spaces must be constructed from the estimated latent positions, and in this case it is appropriate to consider super-level sets of a kernel density estimate. The third row of Figure 1 illustrates how this estimated density varies, represented with heat colours. At an appropriately chosen threshold (roughly speaking, corresponding to red in the figure), the subset of that has density higher than this threshold has three connected components in the first and second regimes, but a single connected component in the third.

The constructions on point clouds alluded to in the previous two paragraphs are familiar objects of persistent homology. In particular, for estimating the topology of the support one uses the persistent homology of the Čech filtration built from the point cloud [11, 16, 22] and for distributional inference one uses the persistent homology of the filtration of topological spaces given by the super-level sets of a kernel density estimate [17]. The relevant constructions are described in the following subsection.

2.4. Persistent homology

We introduce in this subsection relevant notation and terminology from persistent homology. The reader wishing to obtain a more comprehensive account of the theory is referred to [22].

If is a compact subset of , the distance function to (with respect to the Euclidean metric) is defined by

For any and , the closed ball of radius centred at is denoted by . For each , we will be interested in computing the topology of the -offset

Note that is a sub-level set of the distance function and so the collection of -offsets gives a filtration of topological spaces, and is an example of a sub-level sets filtration of a function.

If is a finite set, it is easy to see that is just the union of the closed balls centred at the . In this case, the topology of this set can be computed by use of the Čech complex

If isn’t a finite set, the topology of an offset can still be computed by using a cover of the set. The justification for both cases is provided by the Nerve Lemma (Lemma of [22] and the comments following it for dealing with closed covers).

The collection of all Čech complexes for each gives the Čech filtration from which one can compute the persistent homology of the offsets of collectively. In what follows, the resulting persistence diagram will be denoted by . We recall that persistence diagrams are collections of points in each showing the birth and death time of a topological feature, and that they can be compared by use of a variety of metrics, such as the bottleneck distance (a more detailed introduction is given in Section 3.1.1 of [22]). The key result we will rely on is that persistence diagrams are stable in the following sense.

Theorem 6.

Let and be -tame functions. Then

where denotes the bottleneck distance between the persistence diagrams resulting from the sub-level sets filtrations of and respectively.

The reader is referred to [11] for an account of tameness conditions for functions and it suffices to mention that the functions we deal with are. A particular case of Theorem 6 applies to distance functions to compact sets, and the bottleneck distance between the corresponding Čech filtrations is then bounded above by the Hausdorff distance between those two sets.

In much the same way that one can consider the sub-level sets of a function , it is also possible to consider its super-level sets

for any . The super-level sets of a function also give rise to a filtration of topological spaces. We consider the persistent homology of these filtrations for probability densities which are sufficiently tame, obtained by convolving (potentially) singular distributions with a smooth kernel and the resulting persistence diagrams are also denoted (by abuse of notation) by .

3. Consistent estimation of the persistent homology of base measures

In order to apply the tools of persistent homology, it is a requirement that the base measure of a generalised random dot-product graph have bounded (and hence compact) support. That this is indeed the case is proved in Section 5.2.

Theorem 7.

Suppose that is -admissible for some . Then is a bounded set.

As motivation for some of the considerations in Section 5.2, we remark that for an arbitrary measure it need not be the case that its support is bounded even if for all . For an example, consider the signature

and the Gaussian distribution supported on the line

. For any distinct points on this line, it is immediate that and yet it has unbounded support.

The consistency results stated in this section concern a suitable choice of base probability measure. If is a -admissible measure defining a generalised random dot-product graph, it may be pushed forward by an element of to another -admissible measure with the property that its second moment matrix is diagonal. The way in which this is possible is described in Section 5.1. This base measure is referred to as being representative in the statements that follow though it is worth bearing in mind that it is not unique.

With regard to Čech complexes, one can consistently estimate the persistent homology of the support of a representative base measure.

Theorem 8.

Suppose that is locally lower -regular, is admissible with signature and is a representative base measure for the corresponding generalised random dot-product graph. Let be a set of points obtained by spectrally embedding an adjacency matrix generated according to this model. Then


and is a universal constant.

The technical condition of local lower -regularity on which Theorem 8 relies, and the proof of the theorem, are given in Section 5.3. The three terms appearing in the asymptotics of Theorem 8 may be motivated as follows. The first term comes from the geometric discrepancy between latent position estimates and their true counterparts, as described by Theorem 3. The second term is due to the asymptotic behaviour of the matrices described in Section 5.1. The remaining term (at least when ) gives the rate of convergence of the point cloud of true latent positions to that of the support.

Concerning the distribution , the picture here is similar due to the following lemma.

Lemma 9.


be a probability distribution with density

and let denote its push-forward by . Then the corresponding density has the property that

Lemma 9 is a straightforward consequence of the persistence equivalence theorem of [16] and its proof is therefore omitted. The lemma can be used to judiciously push distributions around without effecting the relevant persistence diagrams. Because the base measures typically dealt with are singular (a mixture of point masses in the case of the stochastic block model and a distribution supported on a lower dimensional simplex in the case of a mixed membership stochastic block model), the persistent homology of these measures convolved with a smooth kernel is considered. We proceed to define the class of kernels.

Recall that a kernel is defined to be any real-valued bi-variate map on a set. We define a kernel to be Lipschitz radial if there exists a monotonically non-increasing Lipschitz function such that for all . It is also assumed that gives a probability density on with respect to Lebesgue measure. Given any Lipschitz radial kernel and , define the kernel

with . Because indefinite orthogonal transformations are volume-preserving, it can be verified that is a probability density. Given any finite set , the corresponding kernel density estimator can be defined by

with .

The consistency theorem to follow uses rates of convergence coming from empirical process theory which are adaptive to the dimension of as developed in [20]. To this end we use the notion of volume dimension introduced in [20], namely the quantity

Theorem 10.

Let be a -admissible and representative measure and suppose that is a Lipschitz radial kernel. Fix some . Suppose that the set of points has been obtained by spectrally embedding an adjacency matrix generated according to the corresponding generalised random dot-product graph. Then


where is a universal constant and denotes the convolution of with the kernel .

The three terms appearing in the rates are again motivated in much the same way as for the corresponding statement for Čech complexes. The first term is a consequence of the manner in which the estimated latent positions converge to their true counterparts according to Theorem 3

. The second term arises from an asymptotic analysis of the matrices

and the third term is due to the way in which a kernel density estimate on any sample drawn from the base measure converges to the convolution of the measure itself, and comes out of the empirical process theory of [20].

An immediate corollary of Theorem 10 is that the persistent homology of certain probability densities can be consistently estimated regardless of the distortion introduced by the indefinite orthogonal group. This result does not therefore make any assumptions about the nature of the base measure.

Corollary 11.

Suppose that has bounded and continuous density with respect to Lebesgue measure on . Then

in probability as where for any .

We conclude with a corollary which makes explicit that persistence diagrams are asymptotically estimating an invariant of the class of all base measures defining any given generalised random dot-product graph.

Corollary 12.

Let be -admissible and and suppose that and are the estimated latent position point clouds obtained by embedding adjacency matrices generated according the corresponding generalised random dot-product graph. Then

and for any and Lipschitz radial kernel ,

as in probability.


Let be the push-forward of by some . The second moment matrices are seen to be related by from which it follows that both and have the same spectrum. By considerations in Section 5.1, both and then have corresponding representative measures where one is pushed forward from the other by an orthogonal transformation. The claim then follows by Theorems 8 and 10. ∎

4. Hypothesis testing

In this section we use persistent homology of graph embeddings to distinguish between standard and mixed membership stochastic block models. We present the problem as a hypothesis test,

Given a graph observation

, we will compute a test statistic

and, from its distribution under , report the p-value

where is a replicate of under with probability measure .

By the discussion of Section 2, the standard and mixed membership stochastic block models can be distinguished according the topology of their base measures, which comprises of multiple connected components only in the first case. For an embedding we will therefore compute the bottleneck distance between Čech complexes

restricted, as indicated by the superscript, to -th homology. The test statistic is so defined to align with the common hypothesis testing convention of rejecting for a large value of , which occurs when the point cloud is in topological terms close to the trivial simply connected set and suggests the presence of mixed membership.

To obtain , we need access to or at least i.i.d. replicates of

under the null hypothesis. These are available if

and are known, since it is then straightforward to simulate replicates of from the null hypothesis, obtain corresponding spectral embeddings, and estimate

We will assume is large enough to ignore simulation error and treat as . By our consistency results, the p-value

is an observation of a random variable

that has a uniform distribution under and satisfies in probability under .

We propose to use the parametric bootstrap [7] to cope with the (presumably common) case where and are unknown. For ease of presentation we will assume that is known, but the methodology proposed is easily adapted to the case where is estimated, for example, using profile likelihood [31]. To obtain replicates of under , we first estimate and from the observed using spectral clustering. A partition of the node set is obtained by applying -means with to the point cloud , from which and are constructed component-wise using the relevant empirical frequencies. Next, graphs are generated with those estimated parameters and each is spectrally embedded to generate approximate replicates of under . This finally provides a (doubly) approximate p-value

This p-value no longer has the formal probabilistic guarantees available in the known-parameter case above; nevertheless it can be expected to hold similar or more conservative statistical properties [24], that is, for small , where is a replicate of under the null hypothesis.

We will now investigate the performance of our proposed approach under several simulated conditions. Fixing a matrix to

we generate four graphs on nodes respectively from a 3-community stochastic block model with and three 3-community mixed membership stochastic block models with ,

(a uniform distribution) and

. These are each spectrally embedded into and shown in the left column of Figure 2. The colour of each point indicates its community membership, so that a purely red, green or blue point corresponds to a node belonging to a single community whereas a point with RGB colour (1/3, 1/3, 1/3) (grey) corresponds to a node with equal membership of each (which is allowed only under the mixed membership stochastic block model when the relevant ). In the right-hand column, a single replicate embedding is shown, with its associated community memberships, and the p-value computed over all replicated embeddings is shown above.

If is rejected at a standard threshold such as

, the test gives a correct outcome in the first three examples. The p-value above the first example is not small (and is purported to be uniformly distributed). Correspondingly the replicate embedding appears to exhibit similar topology to the original. The p-values in the next two examples are very small (0, up to simulation error), and indeed the replicate embeddings look different. Note however that in the second of those examples (the third row overall), determining by eye without using colour that the support is “much more disconnected” on the right than on the left is not necessarily straightforward. Finally, the last row provides an example where the test fails, giving a type II error. For large enough

we cannot distinguish the mixed membership stochastic block model from a stochastic block model whose communities are close to each other. This difficulty of course only arises when is unknown. To summarise, the problem of distinguishing mixed membership gets harder as (when the null and alternative hypotheses merge) but also, perhaps unexpectedly, as .

Figure 2. Testing for mixed membership. The left-column shows spectral embeddings of graphs simulated under mixed membership and standard stochastic block models. The right column shows a corresponsing replicate embedding based on a stochastic block model fit, and the p-value shown above each is computed over such replications. If is rejected at a standard threshold such as , the test gives a correct outcome in the first three rows, but a type II error in the fourth, loosely demonstrating that difficulties in detecting mixed membership arise both when and . Further details in main text.

A more systematic investigation of the statistical properties of the test is now conducted. We simulate graphs on nodes following the stochastic block model above, and more graphs on nodes following the mixed membership stochastic block model with . The test proposed is applied to each, with , so that two samples of 100 p-values are obtained. The empirical distribution of each (known as power curves) is shown in Figure 3, showing rough agreement with the theory. In particular, the test is conservative (the distribution function associated with lies below ) but nevertheless has power under the alternative (the distribution function associated with lies above ).

Figure 3. Testing for mixed membership: power curves. The empirical distribution shown with a black line corresponds to simulated approximate p-values under (specific parameters given in main text); since the function lies below (straight black line) the test appears to be conservative. The empirical distribution shown in red corresponds to simulated approximate p-values under : the test nevertheless has power under the alternative.

5. Proofs

5.1. Asymptotic behaviour of

Fix some base probability measure according to Definition 2 and suppose that . If the matrix then there is the corresponding edge probability matrix

The non-identifiability of implies that the cannot be recovered from , and this non-identifiability is manifest at the level of edge probability matrices by there being a transitive action of on the set of matrices in leaving invariant. This is a consequence of the following general remark, which is a straightforward calculation.

Remark 13.

Let be a symmetric matrix of rank and suppose that with . Then there exists a matrix such that and .

Remark 13 can be used to explicitly describe how the matrix is related to the spectral decomposition of the edge probability matrix .

Lemma 14.

Suppose that the edge probability matrix has eigenvalue decomposition and consider the eigenvalue decomposition

where is orthogonal. Then


for some matrix .


First note that

Setting , it follows that for some by Remark 13. Now has the same spectrum as , which has the same spectrum as . Put

It is readily verified that (and so ) and that the columns of are eigenvectors of by noting that . The columns of are also seen to be eigenvectors of by noting that

Suppose that has diagonal entries with multiplicity respectively, and suppose that . There exist such that where . Now both and are in and so is too. Also,

and so is block orthogonal. It is immediate that commutes with and the claim follows. ∎

More can be said about the structure of the matrix in Lemma 14 insofar as it readily seen to centralise , i.e. that it lies in the group

That it commutes past then implies that all possible matrices are of the form where is a matrix whose columns consist of eigenvectors of . In other words, the possible matrices are determined solely by the spectral decomposition of .

Given a base measure , Lemma 14 suggests a candidate which the matrices may converge to asymptotically in the form of

where there is the eigenvalue decomposition . Again, need not be unique due to the non-uniqueness of the spectral decomposition itself, and this lack of uniqueness has to be taken into account when speaking of convergence. The key ingredient is a variant of the Davis-Kahan theorem which allows for eigenvalue multiplicity in the population matrix and is stated below.

Theorem 15 (Theorem of [30]).

Let be symmetric matrices with eigenvalues and respectively. Fix and suppose that where and . Let and let have orthonormal columns satisfying and for . Then there exists a matrix such that

Lemma 16.

Fix a matrix where has eigenvalue decomposition and suppose that where we have spectral decomposition

Then there is an orthogonal matrix centralising such that


Suppose that has diagonal entries each with multiplicity . Put and . Then for each , and one can apply Theorem 15 to obtain orthogonal matrices such that


and . Let be the block diagonal matrix comprised of the . By construction,

leaves the eigenspaces of

invariant and so centralises . Moreover,

where is the minimum eigengap of . Now we consider the decompositions


We have that

where we have used that the support of is bounded by some (Theorem 7). Now the square-root function is operator monotone on ([6], Proposition V.1.8) and hence for any positive semi-definite matrices and ,

by Theorem X.1.1 of [6]. Taking the spectral norm of the decomposition for and applying the triangle inequality then gives

Likewise, applying the spectral norm to the decomposition for and using the triangle inequality yields

The first of the terms in the sum on the right-hand side is bounded above by

For the second term one can use the polarisation identity, namely the observation that for any invertible matrices and , , thus giving the bound

For the third term, we note that