# Graph reduction by local variation

How can we reduce the size of a graph without significantly altering its basic properties? We approach the graph reduction problem from the perspective of restricted similarity, a modification of a well-known measure for graph approximation. Our choice is motivated by the observation that restricted similarity implies strong spectral guarantees and can be used to prove statements about certain unsupervised learning problems. The paper then focuses on coarsening, a popular type of graph reduction. We derive sufficient conditions for a small graph to approximate a larger one in the sense of restricted similarity. Our theoretical findings give rise to a novel quasi-linear algorithm. Compared to both standard and advanced graph reduction methods, the proposed algorithm finds coarse graphs of improved quality -often by a large margin- without sacrificing speed.

## Authors

• 24 publications
02/02/2021

### Graph Coarsening with Neural Networks

As large-scale graphs become increasingly more prevalent, it poses signi...
10/19/2020

### On the restricted isometry property of the Paley matrix

In this paper, we prove that the Paley graph conjecture implies that the...
07/24/2019

### Reducing Path TSP to TSP

We present a black-box reduction from the path version of the Traveling ...
02/23/2018

### Graph Similarity and Approximate Isomorphism

The graph similarity problem, also known as approximate graph isomorphis...
05/09/2016

### Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction

We propose a stochastic variance reduced optimization algorithm for solv...
04/25/2022

### Parallel coarsening of graph data with spectral guarantees

Finding coarse representations of large graphs is an important computati...
01/24/2018

### A Theoretical Investigation of Graph Degree as an Unsupervised Normality Measure

For a graph representation of a dataset, a straightforward normality mea...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As graphs grow in size, it becomes pertinent to look for generic ways of simplifying their structure while preserving key properties. Simplified graph representations find profound use in the design of approximation algorithms, can facilitate storage and retrieval, and ultimately ease graph data analysis by separating overall trends from details.

There are two main ways to simplify graphs. First, one may reduce the number of edges, a technique commonly referred to as graph sparsification. In a series of works, it has been shown that it is possible to find sparse graphs that approximate all pairwise distances (Peleg and Schäffer, 1989), every cut (Karger, 1999)

, or every eigenvalue

(Spielman and Teng, 2011)—respectively referred to as spanners, cut sparsifiers and spectral sparsifiers. Spectral sparsification techniques in particular can yield computational benefits whenever the number of edges is the main bottleneck (Batson et al., 2013). Indeed, they form a fundamental component of nearly-linear time algorithms for linear systems involving symmetric diagonally dominant matrices (Koutis et al., 2010; Spielman and Srivastava, 2011)

, and have found application to machine learning problems involving graph-structured data

(Calandriello et al., 2018).

Alternatively, one may seek to reduce directly the size of the graph, i.e., the number of its vertices , by some form of vertex selection or re-combination scheme followed by re-wiring. This idea can be traced back to the mutligrid literature, that targets the acceleration of finite-element methods using cycles of multi-level coarsening, lifting and refinement. After being generalized to graphs, reduction methods have become pervasive in computer science and form a key element of modern graph processing pipelines, especially with regards to graph partitioning (Hendrickson and Leland, 1995; Karypis and Kumar, 1998; Kushnir et al., 2006; Dhillon et al., 2007; Wang et al., 2014) and graph visualization (Koren, 2002; Hu, 2005; Walshaw, 2006). In machine learning, reduction methods are used to create multi-scale representations of graph-structured data (Lafon and Lee, 2006; Gavish et al., 2010; Shuman et al., 2016)

and as a layer of graph convolutional neural networks

(Bruna et al., 2014; Defferrard et al., 2016; Bronstein et al., 2017; Simonovsky and Komodakis, 2017; Ardizzone et al., 2018). In addition, being shown to solve linear systems in (empirically) linear time (Koutis et al., 2011; Livne and Brandt, 2012)

as well as to approximate the Fiedler vector

(Urschel et al., 2014; Gandhi, 2016), reduction methods have been considered as a way of accelerating graph-regularized problems (Hirani et al., 2015; Colley et al., 2017). Some of their main benefits are the ability to deal with sparse graphs –graphs with at most edges– and to accelerate algorithms whose complexity depends on the number of vertices as well as edges.

Yet, in contrast to graph sparsification, there has been only circumstantial theory supporting graph reduction (Moitra, 2011; Dörfler and Bullo, 2013; Loukas and Vandergheynst, 2018). The lack of a concrete understanding of how different reduction choices affect fundamental graph properties is an issue: the significant majority of reduction algorithms in modern graph processing and machine learning pipelines have been designed based on intuition and possess no rigorous justification or provable guarantees.

#### A new perspective.

My starting point in this work is spectral similarity—a measure that has been proven useful in sparsification for determining how well a graph approximates another one. To render spectral similarity applicable to graphs of different sizes, I generalize it and restrict it over a subspace of size that is at most equal to the size of the reduced graph. I refer to the resulting definition as restricted spectral approximation111Though similarly named, the definition of restricted spectral similarity previously proposed by (Loukas and Vandergheynst, 2018) concerns a set of vectors (rather than subspaces) and is significantly weaker than the one examined here.

(or restricted approximation for short). Despite being a statement about subspaces, restricted similarity has important consequences. It is shown that when the subspace in question is a principal eigenspace (this is a data agnostic choice where one wants to preserve the global graph structure), the eigenvalues and eigenspaces of the reduced graph approximate those of the original large graph. It is then a corollary that (

i) if the large graph has a good cut so does the smaller one; and (ii

) that unsupervised learning algorithms that utilize spectral embeddings, such as spectral clustering

(Von Luxburg, 2007) and Laplacian eigenmaps (Belkin and Niyogi, 2003), can also work well when run on the smaller graph and their solution is lifted.

The analysis then focuses on graph coarsening—a popular type of reduction where, in each level, reduced vertices are formed by contracting disjoint sets of connected vertices (each such set is called a contraction set). I derive sufficient conditions for a small coarse graph to approximate a larger graph in the sense of restricted spectral approximation. Crucially, this result holds for any number of levels and is independent of how the subspace is chosen. Though the derived bound is global, a decoupling argument renders it locally separable over levels and contraction sets, facilitating computation. The final bound can be interpreted as measuring the local variation over each contraction set, as it involves the maximum variation of vectors supported on each induced subgraph.

These findings give rise to greedy nearly-linear time algorithms for graph coarsening, that I refer to as local variation algorithms. Each such algorithm starts from a predefined family of candidate contraction sets. Even though any connected set of vertices may form a valid candidate set, I opt for small well-connected sets, formed for example by pairs of adjacent vertices or neighborhoods. The algorithm then greedily222Even after decoupling, the problem of candidate set selection is not only NP-hard but also cannot be approximated to a constant factor in polynomial time (by reduction to the maximum-weight independent set problem). For the specific case of edge-based families, where one candidate set is constructed for each pair of adjacent vertices, the greedy iterative contraction can be substituted by more sophisticated procedures accompanied by improved guarantees. contracts those sets whose local variation is the smallest. Depending on how the candidate family is constructed, the proposed algorithms obtain different solutions, trading off computational complexity for reduction.

#### Theoretical and practical implications.

Despite not providing a definitive answer on how much one may gain (in terms of reduction) for a given error, the analysis improves and generalizes upon previous works in a number of ways:

• Instead of directly focusing on specific constructions, a general graph reduction scheme is studied featuring coarsening as a special case. As a consequence, the implications of restricted similarity are proven in a fairly general setting where specifics of the reduction (such as the type of graph representation and the reduction matrices involved) are abstracted.

• Contrary to previous results on the analysis of coarsening (Loukas and Vandergheynst, 2018), the analysis holds for multiple levels of reduction. Given that the majority of coarsening methods reduce the number of vertices by a constant factor at each level, a multi-level approach is necessary to achieve significant reduction. Along that line, the analysis also brings an intuitive insight: rather than taking the common approach of approximating at each level the graph produced by the previous level, one should strive to preserve the properties of the original graph at every level.

• The proposed local variation algorithms are not heuristically designed, but greedily optimize (an upper bound of) the restricted spectral approximation objective. Despite the breadth of the literature that utilizes some form of graph reduction and coarsening, the overwhelming majority of known methods are heuristics—see for instance

(Safro et al., 2015). A notable exception is Kron reduction (Dörfler and Bullo, 2013), an elegant method that aims to preserve the effective resistance distance. Compared to Kron reduction, the graph coarsening methods proposed here are accompanied by significantly stronger spectral guarantees (i.e., beyond interlacing), do not sacrifice the sparsity of the graph, and can ultimately be more scalable as they do not rely on the Schur complement of the Laplacian matrix.

To demonstrate the practical benefits of local variation methods, the analysis is complemented with numerical results on representative graphs ranging from scale-free graphs to meshes and road networks. Compared to both standard (Karypis and Kumar, 1998) and advanced reduction methods (Ron et al., 2011; Livne and Brandt, 2012; Shuman et al., 2016), the proposed methods yield small graphs of improved spectral quality, often by a large margin, without being much slower than naive heavy-edge matching. A case in point: when examining how close are the principal eigenvalues of the coarse and original graph for a reduction of 70%, local variation methods attain on average 2.6 smaller error; this gain becomes 3.9 if one does not include Kron reduction in the comparison.

## 2 Graph reduction and coarsening

The following section introduces graph reduction. The exposition starts by considering a general reduction scheme. It is then shown how graph coarsening arises naturally if one additionally imposes requirements w.r.t. the interpretability of reduced variables.

### 2.1 Graph reduction

Consider a positive semidefinite (PSD) matrix whose sparsity structure captures the connectivity structure of a connected weighted symmetric graph of vertices and edges. In other words, only if is a valid edge. Moreover, let be an arbitrary vector of size .

I study the following generic reduction scheme:

Commence by setting and and proceed according to the following two recursive equations:

 Lℓ=P∓ℓLℓ−1P+ℓandxℓ=Pℓxℓ−1,

where are matrices with more columns than rows, is the level of the reduction, symbol denotes the transposed pseudoinverse, and is the dimensionality at level such that and .

Vector is lifted back to by recursion , where

Graph reduction thus involves a sequence of graphs

 G=G0=(V0,E0,W0)G1=(V1,E1,W1)⋯Gc=(Vc,Ec,Wc) (1)

of decreasing size , where the sparsity structure of matches that of graph , and each vertex of represents one of more vertices of .

The multi-level design allows us to achieve high dimensionality reduction ratio

 r=1−nN,

even when at each level the dimensionality reduction ratio is small. For instance, supposing that for each , then levels suffice to reduce the dimension to .

One may express the reduced quantities in a more compact form:

 xc =Px,Lc=P∓LP+and˜x=Πx, (2)

where , and . For convenience, I drop zero indices and refer to a lifted vector as .

The rational of this scheme is that vector should be the best approximation of given in an -sense, which is a consequence of the following property:

###### Property 2.1.

is a projection matrix.

On the other hand, matrix is reduced such that .

Though introduced here for the reduction of sparse PSD matrices representing the similarity structure of a graph, Scheme 2.1 can also be applied to any PSD matrix . In fact, this and similar reduction schemes belong to the class of Nyström methods and, to the extend of my knowledge, they were first studied in the context of approximate low-rank matrix approximation (Halko et al., 2011; Wang and Zhang, 2013). Despite the common starting point, interpreting and as sparse similarity matrices, as it is done here, incorporates a graph-theoretic twist to reduction that distinguishes from previous methods333To achieve low-rank approximation, matrix is usually built by sampling columns of .: the constructions that we will study are eventually more scalable and interpretable as they maintain the graph structure of after reduction. Obtaining guarantees is also significantly more challenging in this setting, as the involved problems end up being combinatorial in nature.

### 2.2 Properties of reduced graphs

Even in this general context where is an arbitrary matrix, certain handy properties can be proven about the relation between and .

To begin with, it is simple to see that the set of positive semidefinite matrices is closed under reduction.

###### Property 2.2.

If is PSD, then so is .

The proof is elementary: if is PSD then there exists matrix such that , implying that can also be written as if one sets .

I further consider the spectrum of the two matrices. Sort the eigenvalues of as and denote by the -th largest eigenvalue of and

the associated eigenvector.

It turns out that the eigenvalues and are interlaced.

###### Theorem 2.3.

For any with full-row rank and , we have

 γ1λk≤~λk≤γ2λk+N−n

with and , respectively the smallest and largest eigenvalue of .

The above result is a generalization of the Cauchy interlacing theorem for the case that . It also resembles the interlacing inequalities known for the normalized Laplacian (where the re-normalization is obtained by construction). Chen et al. (2004) showed in Theorem 2.7 of their paper that after contracting edges for and with when , resembling the upper bound above. The lower bound is akin to that given in (Chung, 1997, Lemma 1.15), again for the normalized Laplacian. Also notably, the inequalities are similar to those known for Kron reduction (Dörfler and Bullo, 2013, Lemma 3.6).

Theorem 2.3 is particularly pessimistic as it has to hold for every possible and . Much stronger results will be obtained later on by restricting the attention to constructions that satisfy additional properties (see Theorem 3.3).

One can also say something about the eigenvectors of .

###### Property 2.4.

For every vector for which , one has

 x⊤cLcxc=x⊤ΠLΠx=x⊤Lxand˜x=Πx=x.

In other words, reduction maintains the action of of every vector that lies in the range of . Naturally, after lifting the eigenvectors of are included in this class.

### 2.3 Coarsening as a type of graph reduction

Coarsening is a type of graph reduction abiding to a set of constraints that render the graph transformation interpretable. More precisely, in coarsening one selects for each level a surjective (i.e., many-to-one) map between the original vertex set and the smaller vertex set . I refer to the set of vertices mapped onto the same vertex of as a contraction set:

 V(r)ℓ−1={v∈Vℓ−1:φℓ(v)=v′r}

For a graphical depiction of contraction sets, see Figure 3. I also constrain slightly by requiring that the subgraph of induced by each contraction set is connected.

It is easy to deduce that contraction sets induce a partitioning of into subgraphs, each corresponding to a single vertex of . Every reduced variable thus corresponds to a small set of adjacent vertices in the original graph and coarsening basically amounts to a scaling operation. An appropriately constructed coarse graph aims to capture the global problem structure, whereas neglected details can be recovered in a local refinement phase.

Coarsening can be placed in the context of Scheme 2.1 by restricting each to lie in the family of coarsening matrices, defined next:

###### Definition 2.5 (Coarsening matrix).

Matrix is a coarsening matrix w.r.t. graph if and only if it satisfies the following two conditions:

• It is a surjective mapping of the vertex set, meaning that if then for every .

• It is locality preserving, equivalently, the subgraph of induced by the non-zero entries of is connected for each .

An interesting consequence of this definition is that, in contrast to graph reduction, with coarsening matrices the expensive pseudo-inverse computation can be substituted by simple transposition and re-scaling:

###### Proposition 2.1 (Easy inversion).

The pseudo-inverse of a coarsening matrix is given by , where is the diagonal matrix with .

Proposition 2.1 carries two consequences. First, coarsening can be done in linear time. Each coarsening level (both in the forward and backward directions) entails multiplication by a sparse matrix. Furthermore, both and have only non-zero entries meaning that and operations suffice to coarsen respectively a vector and a matrix whose sparsity structure reflects the graph adjacency. In addition, the number of graph edges also decreases at each level. Denoting by the average number of edges of the graphs induced by contraction sets for every , then a quick calculation reveals that the coarsest graph has edges. If, for instance, at each level all nodes are perfectly contracted into pairs then and , meaning that .

### 2.4 Laplacian consistent coarsening

A further restriction that can be imposed is that coarsening is consistent w.r.t. the Laplacian form. Let be the combinatorial Laplacian of defined as

 L(i,j)=⎧⎨⎩diif i=j−wijif eij∈E0otherwise,

where is the weight associated with edge and the weighted degree of . The following lemma can then be proven:

###### Proposition 2.2 (Consistency).

Let be a coarsening matrix w.r.t. a graph with combinatorial Laplacian . Matrix is a combinatorial Laplacian if and only if the non-zero entries of are equally valued.

It is a corollary of Propositions 2.1 and 2.2 that in consistent coarsening, for any and matrices and should be given by:

where the contraction sets were defined in Section 2.3.

The toy graph shown in Figure (a)a illustrates an example where the gray vertices of are coarsened into vertex , as shown in Figure (b)b. The main matrices I have defined are

 P1=⎡⎢⎣\sfrac13\sfrac13\sfrac13000001000001⎤⎥⎦  P+1=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣100100100010001⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦  Π=P+1P1=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣\sfrac13\sfrac13\sfrac1300\sfrac13\sfrac13\sfrac1300\sfrac13\sfrac13\sfrac13000001000001⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

and coarsening results in

 Lc=P∓1LP+1=⎡⎢⎣2−1−1−110−101⎤⎥⎦xc=P1x=⎡⎢⎣(x(1)+x(2)+x(3))/3x(4)x(5)⎤⎥⎦.

Finally, when lifted becomes

 ˜x=P+1xc=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣\sfrac(x(1)+x(2)+x(3))3\sfrac(x(1)+x(2)+x(3))3\sfrac(x(1)+x(2)+x(3))3x(4)x(5)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

Since vertices and are not affected, the respective contraction sets and are singleton sets.

### 2.5 Properties of Laplacian consistent coarsening

Due to its particular construction, Laplacian consistent coarsening is accompanied by a number of interesting properties. We lay out three in the following:

Cuts. To begin with, weights of edges in correspond to weights of cuts in .

###### Property 2.6.

For any level , the weight between vertices is equal to

 Wℓ(r,q)=∑vi∈S(r)ℓ∑vj∈S(q)ℓwij,

where contains all vertices of contracted onto .

In the toy example, there exists a single edge of unit weight connecting vertices in and , and as such the weight between and is equal to one.

Eigenvalue interlacing. For a single level of Laplacian consistent coarsening, matrix is given by , implying that the multiplicative constants in Theorem 2.3 are:

 γ1=minvi∈V|Vφ1(vi)0|≥1andγ2=maxvi∈V|Vφ1(vi)0|.

Above, is the vertex to which is mapped to and the set contains all vertices also contracted to . Thus in the toy example, for every . If multiple levels are utilized these terms become dependent on the sequence of contractions. To obtain a general bound let be the vertex onto which is contracted to in the -th level.

###### Property 2.7.

If is obtained from by Laplacian consistent coarsening, then

 γ1≥minvi∈Vc∏ℓ=1|Vφℓ1(vi)ℓ−1|≥1andγ2≤maxvi∈Vc∏ℓ=1|Vφℓ1(vi)ℓ−1|,

with the set containing all vertices of that are contracted onto .

Though not included the proof follows from the diagonal form of and the special row structure of each for every . The dependency of on the size of contraction sets can be removed either by enforcing at each level that all contraction sets have identical size and dividing the graph weights by that size, or by re-normalizing each such that . The latter approach was used by Loukas and Vandergheynst (2018) but is not adopted here as it results in losing its Laplacian form.

Nullspace. Finally, as is desirable, the structure of the nullspace of is preserved both by coarsening and lifting:

###### Property 2.8.

If is a (multi-level) Laplacian consistent coarsening matrix, then

 P1N=1nandP+1n=1N,

where the subscript indicates the dimensionality of the constant vector.

Thus, we can casually ignore vectors parallel to the constant vector in our analysis.

## 3 Restricted notions of approximation

This section aims to formalize how should a graph be reduced such that the structure of the reduced and original problems should be as close as possible. Inspired by work in graph sparsification, I introduce a measure of approximation that is tailored to graph reduction. The new definition implies strong guarantees about the distance of the original and coarsened spectrum and gives conditions such that the cut structure of a graph is preserved by coarsening.

### 3.1 Restricted spectral approximation

One way to define how close a PSD matrix is to its reduced counterpart is to establish an isometry guarantee w.r.t. the following induced semi-norms:

 ∥x∥L=√x⊤Lxand∥xc∥Lc=√x⊤cLcxc

Ideally, one would hope that there exists such that

 (1−ϵ)∥x∥L≤∥xc∥Lc≤(1+ϵ)∥x∥L (3)

for all .

If the equation holds, matrices and are called -similar. The objective of constructing sparse spectrally similar graphs is the main idea of spectral graph sparsifiers, a popular method for accelerating the solution of linear systems involving the Laplacian. In addition, spectral similarity carries a number of interesting consequences that are of great help in the construction of approximation algorithms: the eigenvalues and eigenvectors of two similar graphs are close and, moreover, all vertex partitions have similar cut size.

In contrast to graph sparsification however, since here the dimension of the space changes it is impossible to satisfy (3) for every unless one trivially sets (this follows by a simple rank argument). To carry out a meaningful analysis, one needs to consider a subspace of dimension and aim to approximate the behavior of solely within it.

I define the following generalization of spectral similarity:

###### Definition 3.1 (Restricted spectral approximation).

Let be a -dimensional subspace of . Matrices and are -similar if there exists an such that

 ∥x−~x∥L≤ϵ∥x∥L,for allx∈R,

where .

In addition to the restriction on , the above definition differs from (3) in the way error is measured. In fact, it asserts a property that is slightly stronger than an approximate isometry w.r.t. a semi-norm within . The strengthening of the notion of approximation deviates from the restricted spectral similarity property proposed by Loukas and Vandergheynst (2018) and is a key ingredient in obtaining multi-level bounds. Nevertheless, one may recover a restricted spectral similarity-type guarantee as a direct consequence:

###### Corollary 3.2.

If and are -similar, then

 (1−ϵ)∥x∥L≤∥xc∥Lc≤(1+ϵ)∥x∥L,for allx∈R.
###### Proof.

Let be defined such that . By the triangle inequality:

 |∥x∥L−∥xc∥Lc|=|∥Sx∥−∥∥SP+Px∥∥2|≤∥Sx−SP+Px∥2=∥x−~x∥L≤ϵ∥x∥L,

which is equivalent to the claimed relation. ∎

Clearly, if and are -similar then they are also -similar, where is any subspace of and . As such, results about large subspaces and small are the most desirable.

It will be shown in Sections 3.2 and 3.3 that the above definition implies restricted versions of the spectral and cut guarantees provided by spectral similarity. For instance, instead of attempting to approximate the entire spectrum as done by spectral graph sparsifiers, here one can focus on a subset of the spectrum with particular significance.

### 3.2 Implications for the graph spectrum

One of the key benefits of restricted spectral approximation is that it implies a relation between the spectra of matrices and that goes beyond interlacing (see Theorem 2.3).

To this effect, consider the smallest eigenvalues and respective eigenvectors and define the following matrices:

 Uk∈RN×k=[u1,u2,…,uk]andΛk=diag(λ1,λ2,…,λk)

As I will show next, ensuring that in Proposition 4.1 is small when suffices to guarantee that the first eigenvalues and eigenvectors of and are aligned.

The first result concerns eigenvalues.

###### Theorem 3.3 (Eigenvalue approximation).

If and are -similar, then

 γ1λk≤˜λk ≤γ2(1+ϵk)21−ϵ2k(λk/λ2)λk,

whenever .

Crucially, the bound depends on instead of and thus can be significantly tighter than the one given by Theorem 2.3. Noticing that whenever , one also deduces that it is stronger for smaller eigenvalues. For in particular, one has

 γ1λ2≤˜λ2≤γ2(1+ϵ2)21−ϵ22λ2,

which is small when .

I also analyze the angle between principal eigenspaces of and . I follow Li (1994) and split the eigendecompositions of and as

 L =(Uk,Uk⊥)(ΛkΛk⊥)(U⊤kU⊤k⊥)P⊤LcP=(P⊤˜Uk,P⊤˜Uk⊥)(˜Λk˜Λk⊥)(˜U⊤kP˜U⊤k⊥P),

where and are defined analogously to and . Davis and Kahan (1970) defined the canonical angles between the spaces spanned by and as the singlular values of the matrix

 Θ(Uk,P⊤˜Uk)Δ=arccos(U⊤kP⊤˜Uk˜U⊤kPUk)−\sfrac12,

see also (Stewart, 1990). The smaller the sinus of the canonical angles are the closer the two subspaces lie. The following theorem reveals a connection between the Frobenius norm of the sinus of the canonical angles and restricted spectral approximation.

###### Theorem 3.4 (Eigenspace approximation).

If and are -similar then

 ∥∥sinΘ(Uk,P⊤˜Uk)∥∥2F ≤1λk+1−λk(∑i≤kλi((1+ϵi)2γ1−1)+λk∑i≤kϵi),

Note that the theorem above utilizes all with , corresponding to the restricted spectral approximation constants for , respectively. However, all these can be trivially relaxed to , since for all .

### 3.3 Implications for graph partitioning

One of the most popular applications of coarsening is to accelerate graph partitioning (Hendrickson and Leland, 1995; Karypis and Kumar, 1998; Kushnir et al., 2006; Dhillon et al., 2007; Wang et al., 2014). In the following, I provide a rigorous justification for this choice by showing that if the (Laplacian consistent) coarsening is done well and contains a good cut, then so will . For the specific case of spectral clustering, I also provide an explicit bound on the coarse solution quality.

#### Existence results.

For consistent coarsening, the spectrum approximation results presented previously imply similarities between the cut-structures of and .

To formalize this intuition, the conductance of any subset of is defined as

 ϕ(S)Δ=w(S,¯S)min{w(S),w(¯S)},

where is the complement set, is the weight of the cut and is the volume of .

The -conductance of a graph measures how easy it is to cut it into disjoint subsets of balanced volume:

 ϕk(G)=minS1,…,Skmaxiϕ(Si)

The smaller is, the better the partitioning.

As it turns out, restricted spectral approximation can be used to relate the conductance of the original and coarse graphs. To state the result, it will be useful to denote by the diagonal degree matrix and further to suppose that contains the first eigenvectors of the normalized Laplacian , whose eigenvalues are .

###### Theorem 3.5.

For any graph and integer , if and are -similar combinatorial Laplacian matrices then

 ϕk(G)≤ϕk(Gc)=O⎛⎝ ⎷γ2(1+ϵ2k)2ξk(G)1−ϵ22k(μ2k/μ2)ϕk(G)⎞⎠

with and , whenever . If is planar then More generally, if excludes as a minor, then For , supposing that and are -similar, we additionally have

 ϕ2(G)≤ϕ2(Gc)≤2 ⎷γ2(1+ϵ2)21−ϵ22ϕ2(G).

This is a non-constructive result: it does not reveal how to find the optimal partitioning, but provides conditions such that the latter is of similar quality in the two graphs.

#### Spectral clustering.

It is also possible to derive approximation results about the solution quality of unsupervised learning algorithms that utilize the first eigenvectors in order to partition . I focus here on spectral clustering. To perform the analysis, let and be the spectral embedding of the vertices w.r.t. and , respectively, and define the optimal partitioning as

 (4)

where, for any embedding , the -means cost induced by partitioning into clusters is defined as

 Fk(X,P)Δ=k∑z=1∑vi,vj∈Sz∥X(i,:)−X(j,:)∥222|Sz|.

One then measures the quality of by examining how far the correct minimizer is to . Boutsidis et al. (2015) noted that if the two quantities are close then, despite the clusters themselves possibly being different, they both feature the same quality with respect to the -means objective.

An end-to-end control of the -means error is obtained by combining the inequality derived by Loukas and Vandergheynst (2018), based on the works of (Boutsidis et al., 2015; Yu et al., 2014; Martin et al., 2018), with Theorem 3.4:

###### Corollary 3.6.

If and are -similar then

 (Fk(Uk,P∗)\sfrac12−Fk(Uk,˜P∗)\sfrac12)2≤8λk+1−λk(∑i≤kλi((1+ϵi)2γ1−1)+λk∑i≤kϵi).

Contrary to previous analysis (Loukas and Vandergheynst, 2018), the approximation result here is applicable to any number of levels and it can be adapted to hold for the eigenvectors of the normalized Laplacian444For the normalized Laplacian, one should perform (combinatorial) Laplacian consistent coarsening on a modified eigenspace, as in the proof of Theorem 3.5.. Nevertheless, it should be stressed that at this point it is an open question whether the above analysis yields benefits over other approaches tailored especially to the acceleration of spectral clustering. A plethora of such specialized algorithms are known (Tremblay et al., 2016; Boutsidis et al., 2015)—arguing about the pros and cons of each extends beyond the scope of this work.

One might be tempted to change the construction so as to increase . For example, this could be achieved by multiplying with a small constant (see Theorem 2.3). In reality however, such a modification would not yield any improvement as the increase of would also be accompanied by an increase of .

### 3.4 Some limits of restricted spectral approximation

The connection between spectral clustering and coarsening runs deeper than what was shown so far. As it turns out, the first restricted spectral approximation constants associated with a Laplacian consistent coarsening are linked to the -means cost induced by the contraction sets . The following lower bound is a direct consequence:

###### Proposition 3.1.

Let be a Laplacian matrix. For any obtained by a single level of Laplacian consistent coarsening, if and are -similar then it must be that

 ∑i≤kϵk≥Fn(Uk,P∗),

with being the optimal -means cost for the points .

Computing the aforementioned lower bound is known to be NP-hard, so the result is mostly of theoretical interest.

## 4 Graph coarsening by local variation

This section proposes algorithms for Laplacian consistent graph coarsening. I suppose that is a combinatorial graph Laplacian and, given subspace and target graph size , aim to find an -similar Laplacian of size with smaller than some threshold .

Local variation algorithms differ only in the type of contraction sets that they consider. For instance, the edge-based local variation algorithm only contracts edges, whereas in the neighborhood-based variant each contraction set is a subsets of the neighborhood of a vertex. Otherwise, all local variation algorithms follow the same general methodology and aim to minimize an upper bound of . To this end, two bounds are exploited: First, is shown to be -similar to with , where the variation cost depends only on previous levels (see Section 4.1). The main difficulty with minimizing is that it depends on interactions between contraction sets. For this reason, the second bound shows that these interactions can be decoupled by considering each local variation cost, i.e., the cost of contracting solely the vertices in , independently on a slightly modified subgraph (see Section 4.2). Having achieved this, Section 4.3 considers ways of efficiently identifying disjoint contraction sets with small local variation cost.

### 4.1 Decoupling levels and the variation cost

Guaranteeing restricted spectral approximation w.r.t. subspace boils down to minimizing at each level the variation cost

 σℓ=∥Π⊥ℓAℓ−1∥Lℓ−1=∥Sℓ−1Π⊥ℓAℓ−1∥2,

where and is a projection matrix. Matrix captures two types of information:

1. Foremost, it encodes the behavior of the target matrix w.r.t. . This is clearly seen in the first level, for which one has that with being an orthonormal basis of .

2. When one needs to consider in view of the reduction done in previous levels. The necessary modification turns out to be , with expressed in a recursive manner and .

The following result makes explicit the connection between and .

###### Proposition 4.1.

Matrices and are -similar with

Crucially, the above makes it possible to design a multi-level coarsening greedily, by starting from the first level and optimizing consecutive levels one at a time:

It is a consequence of Proposition 4.1 that the above algorithm returns a Laplacian matrix that is -similar to with , where is the last level . On the other hand, setting to a large value ensures that the same algorithm always attains the target reduction at the expense of loose restricted approximation guarantees.

Remark. The variation cost simplifies when is an eigenspace of . I demonstrate this for the choice of , though an identical argument can be easily derived for any eigenspace. Denote by the diagonal eigenvalue matrix placed from top-left to bottom-right in non-decreasing order and by the respective full eigenvector matrix. Furthermore, let be the sub-matrix of with the smallest eigenvalues in its diagonal. By the unitary invariance of the spectral norm, it follows that . Simplifying and eliminating zero columns, one may redefine , such that once more . This is computationally attractive because now at each level one needs to take the pseudo-inverse-square-root of a matrix , with .

### 4.2 Decoupling contraction sets and local variation

Suppose that is the (complement) projection matrix obtained by contracting solely the vertices in set , while leaving all other vertices in untouched:

 [Π⊥Cx](i)={x(i)−∑vj∈Cx(j)|C|if vi∈C0otherwise.

(Here, for convenience, the level index is suppressed.)

Furthermore, let be the combinatorial Laplacian whose weight matrix is

 [WC](i,j)=⎧⎪⎨⎪⎩Wℓ−1(i,j)if vi,vj∈C2Wℓ−1(i,j)if vi∈C and vj∉C0otherwise. (5)

That is, is zero everywhere other than at the edges touching at least one vertex in . The following proposition shows us how to decouple the contribution of each contraction set to the variation cost.

###### Proposition 4.2.

The variation cost is bounded by

 σ2ℓ≤∑C∈Pℓ∥Π⊥CAℓ−1∥2LC,

where is the family of contraction sets of level .

The above argument therefore entails bounding the, difficult to optimize, variation cost as a function of locally computable and independent costs . The obtained expression is a relaxation, as it assumes that the interaction between contraction sets will be the worst possible. It might be interesting to notice that the quality of the relaxation depends on the weight of the cut between contraction sets. Taking the limit, the inequality converges to an equality as the weight of the cut shrinks. Also of note, the bound becomes tighter the larger the dimensionality reduction requested (the smaller is, the fewer inequalities are involved in the derivation).

### 4.3 Local variation coarsening algorithms

Starting from a candidate family , that is, an appropriately sized family of candidate contraction sets, the strategy will be to search for a small contraction family with minimal variation cost ( is valid if it partitions into contraction sets). Every coarse vertex