# Clustering of graph vertex subset via Krylov subspace model reduction

Clustering via graph-Laplacian spectral imbedding is ubiquitous in data science and machine learning. However, it becomes less efficient for large data sets due to two factors. First, computing the partial eigendecomposition of the graph-Laplacian typically requires a large Krylov subspace. Second, after the spectral imbedding is complete, the clustering is typically performed with various relaxations of k-means, which may become prone to getting stuck in local minima and scale poorly in terms of computational cost for large data sets. Here we propose two novel algorithms for spectral clustering of a subset of the graph vertices (target subset) based on the theory of model order reduction. They rely on realizations of a reduced order model (ROM) that accurately approximates the diffusion transfer function of the original graph for inputs and outputs restricted to the target subset. While our focus is limited to this subset, our algorithms produce its clustering that is consistent with the overall structure of the graph. Moreover, working with a small target subset reduces greatly the required dimension of Krylov subspace and allows to exploit the approximations of k-means in the regimes when they are most robust and efficient, as verified by the numerical experiments. There are several uses for our algorithms. First, they can be employed on their own to clusterize a representative subset in cases when the full graph clustering is either infeasible or not required. Second, they may be used for quality control. Third, as they drastically reduce the problem size, they enable the application of more powerful approximations of k-means like those based on semi-definite programming (SDP) instead of the conventional Lloyd's algorithm. Finally, they can be used as building blocks of a divide-and-conquer algorithm for the full graph clustering. The latter will be reported in a separate article.

## Authors

• 6 publications
• 5 publications
• 5 publications
• ### Scalable Spectral Clustering with Nystrom Approximation: Practical and Theoretical Aspects

Spectral clustering techniques are valuable tools in signal processing a...
06/25/2020 ∙ by Farhad Pourkamali-Anaraki, et al. ∙ 0

• ### Mini-Batch Spectral Clustering

The cost of computing the spectrum of Laplacian matrices hinders the app...
07/07/2016 ∙ by Yufei Han, et al. ∙ 0

• ### Scalable and Robust Sparse Subspace Clustering Using Randomized Clustering and Multilayer Graphs

Sparse subspace clustering (SSC) is one of the current state-of-the-art ...
02/21/2018 ∙ by Maryam Abdolali, et al. ∙ 0

• ### Towards Scalable Spectral Clustering via Spectrum-Preserving Sparsification

The eigendeomposition of nearest-neighbor (NN) graph Laplacian matrices ...
10/12/2017 ∙ by Yongyu Wang, et al. ∙ 0

• ### Two provably consistent divide and conquer clustering algorithms for large networks

08/18/2017 ∙ by Soumendu Sundar Mukherjee, et al. ∙ 0

• ### Articulated Shape Matching Using Laplacian Eigenfunctions and Unsupervised Point Registration

Matching articulated shapes represented by voxel-sets reduces to maximal...
12/14/2020 ∙ by Diana Mateus, et al. ∙ 0

• ### Incrementally Updated Spectral Embeddings

Several fundamental tasks in data science rely on computing an extremal ...
09/03/2019 ∙ by Vasileios Charisopoulos, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Imbedding via graph-Laplacian eigenmaps (a.k.a. spectral imbedding) is heavily employed in unsupervised machine learning and data science. Its power comes from the dimensionality reduction property, i.e. unveiling the low-dimensional manifold structure within the data. Subsequently, this structure can be exploited efficiently by clustering algorithms. Such strategy is commonly known as spectral clustering [6]

. On the downside, spectral clustering becomes computationally expensive for large graphs, due to the cost of computing the spectral data. The use of combinatorial clustering algorithms, such as the true k-means (known to be NP-hard), on the spectral data can further increase the cost, as typically those do not scale well with graph size. Alternatively, one may apply heuristic approximations of k-means such as Lloyd’s algorithm to the spectral data. In this case robustness may become an issue since the heuristics are more likely to get stuck in local minima for larger graphs. Here we should also point out a recent promising approach that substitutes k-means by a direct algorithm

[11].

Spectral imbedding reduces the dimension of the data set space, but typically does not change the number of data points. In principle, reducing both can be achieved via graph contraction and model order reduction methods, see e.g., [9]. One can view the graph-Laplacian as a second order finite-difference discretization of Laplace-Beltrami operator that governs the diffusion on a Riemannian data manifold. The graph contraction is conceptually similar to multigrid coarsening, it constructs a coarse grain graph-Laplacian based on local information. As such, it can be at most second order accurate approximation of the original fine-scale graph-Laplacian with respect to the maximum edge weight. On the other hand, well developed tools of model order reduction for linear time-invariant dynamical systems produce exponential convergence with respect to the order of the reduced order model (ROM), e.g., see [1, 19, 5]. Model order reduction was successfully used for finding nearest local clusters [30], PageRank computations [33] and dynamics of quantum graphs [2].

The objective of this work is to design a reduced order proxy of the graph-Laplacian, targeted at accurate clustering of an a priory prescribed arbitrary subset of vertices of the full graph.

While a major goal is to avoid clustering the full graph, the resulting subset clustering must nevertheless be consistent with full graph clustering that one may perform. Therefore, the reduced order model of the graph-Laplacian must somehow take into account the whole graph structure. The main motivation behind the proposed approach is to ultimately develop a divide-and-conquer algorithm for full graph clustering. Obviously, the method proposed here is a crucial building block of such an algorithm. Full graph clustering algorithm based on subset clustering presented here is a topic of current research and will be presented in a separate article.

In order to employ model reduction techniques, we consider the entire graph-Laplacian as a system matrix of the linear time-invariant (LTI) dynamical system of diffusion type. A great success of model reduction for diffusive problems is that a multi-input/multi-output (MIMO) transfer function of an LTI dynamical system with moderate sizes of input and output arrays can be well approximated by a comparably small numbers of equivalent eigenmodes.

In this work we employ reduced models of square MIMO LTI systems, e.g., [1, 29]

. These are the systems in which the input and output functionals are the same. In our case the input/output is defined by the indicator vectors of the vertices in the prescribed target subset. To obtain an accurate approximation of the diffusive LTI system response at the target subset, the ROM needs to have additional, “interior” degrees of freedom that account for diffusion in the rest of the graph outside the target subset. These interior degrees of freedom are obtained by subsequently projecting the full graph Laplacian on certain Krylov subspaces which is performed via two-stage block-Lanczos algorithm, as in

[18].

After the ROM is constructed, in order to clusterize the target subset, the interior degrees of freedom must be sampled. Although the clustering of the internal degrees of freedom is an auxiliary construction that is discarded, it is essential to accounting for the structure of the full graph. Here we consider two algorithms for sampling the interior degrees of freedom. The first algorithm samples the harmonic Ritz vectors computed from the ROM. The second algorithm transforms the ROM realization to a symmetric semi-definite block-tridiagonal matrix with zero sum rows. This matrix mimics a finite-difference discretization of an elliptic PDE operator with Neumann boundary conditions. It can be viewed as a multi-dimensional manifold generalization of the so-called “finite-difference Gaussian quadrature rule” or simply the optimal grid [16] imbedding the ROM realization to the space of the full graph-Laplacian. The interior degrees of freedom then correspond to the sampling of the data manifold at the nodes of that grid. Once the interior degrees of freedom are sampled, any conventional clustering algorithm can be applied, e.g., Lloyd’s algorithm or approximations to k-means based on semi-definite programming (SDP) [11, 28, 32].

The main computational cost of the proposed approach is ROM construction via Krylov subspace projection. However, due to adaptivity to the target subset, it requires a smaller Krylov subspace dimension compared to that used by the partial eigensolvers in conventional spectral clustering algorithms. The subspace dimension reduction is especially pronounced for graphs with a large number of connected components. The significant dimensionality reduction and the regular sparse (deflated block-tridiagonal) structure of ROM realization matrix leads to more efficient and robust performance of approximate k-means algorithms. Moreover, the reduced dimensionality enables the use of more accurate approximations of k-means like those based on SDP that are otherwise computationally infeasible.

The paper is organized as follows. We begin with subset clustering problem formulation and some key notations in Section 2. A two-stage algorithm for constructing the ROM for the graph-Laplacian in described in Section 3. The harmonic Ritz vectors and a clustering algorithm based on their sampling is given in Section 4. The second clustering algorithm based on the reduced-order graph-Laplacian (ROGL) is presented in Section 5 along with the realization of the ROM via a deflated block-tridiagonal matrix in the form mimicking the random-walk normalized graph-Laplacian. In Section 6 we discuss the results of the numerical experiments and examine the quality of clustering for synthetic and real-world data sets. We conclude with the summary and discussion of further research directions in Section 7. Finally, in Appendix A we discuss an interpretation of the ROGL in terms of finite-difference Gaussian rules, in Appendix B we present the details of the deflated block-Lanczos procedure, the main building block of our model reduction algorithms, and in Appendix C we prove that ROGL preserves certain distances on the graph that are relevant for clustering.

## 2 Problem formulation and notation

Consider a graph with vertex set . Our goal is to clusterize the target subset , where

 Gm={i1,i2,…,im}, (1)

and the case of most interest is when .

The connectivity of the graph and the weights associated with graph edges are encoded in the graph-Laplacian matrix that has the following properties: is symmetric positive-definite, the sum of entries in every of its rows is zero, and the off-diagonal entries are non-positive. Hereafter we denote matrices with bold uppercase letters and vectors are bold lowercase, while the individual entries of matrices and vectors are non-bold uppercase and lowercase letters, respectively.

In what follows an important role is played by the matrix containing the diagonal part of :

 D=diag(L11,L22,…,LNN). (2)

It allows us to define the random-walk normalized graph-Laplacian

 LRW=D−1L, (3)

and the normalized symmetric graph-Laplacian

 A=D−1/2LD−1/2. (4)

We also need the following quantities associated with the target subset. First, denote by the j column of the identity matrix. Then, the columns of

 B=[ei1,…,eim]∈RN×m, (5)

are the “indicator vectors” of the vertices in the target subset . Likewise, we can also define , where in a slight abuse of notation are the columns of the identity matrix, for some .

## 3 Two-stage model reduction algorithm

Here we introduce the two-stage model reduction algorithm that projects the normalized symmetric graph-Laplacian consecutively on two Krylov subspaces. The Krylov subspaces are chosen to accurately approximate the discrete-time diffusion transfer function

 F(p)=BT(I−A)pB∈Rm×m,p=1,2,… (6)

that has both its inputs and outputs in the target subset , hence we will refer to as the input/output matrix. This approach follows the methodology of [18] for multiscale model reduction for the wave propagation problem. Note that and, therefore, (6) is stable. Also, the rescaled components of the matrix are related to the diffusion time distance (see [10], also Section 5.4).

At the first stage we use the deflated block-Lanczos process, Algorithm B

, to compute an orthogonal matrix

, the columns of which span the block Krylov subspace

 Kk1(A,B)=colspan{B,AB,…,Ak1−1B}. (7)

After projecting on , the normalized symmetric graph-Laplacian takes the deflated block tridiagonal form

 T1=QT1AQ1∈Rn1×n1, (8)

as detailed in Appendix B. Note that the input/output matrix is transformed simply to

 E1=QT1B∈Rn1×m. (9)

Observe also that , the number of columns of , satisfies with a strict inequality in case of deflation occurring.

###### Remark

Since the input/output matrix is supported at the vertices in the target subset , repeated applications of cannot propagate outside of the connected components of the graph that contain . Therefore, the support of the columns of is included in these connected components. As a result, projection (8) is only sensitive to the entries of corresponding to graph vertices that can be reached from with a path of at most steps. Hereafter, we will assume by default for simplicity that , and only contain the entries corresponding to the connected components of the graph that contain vertices from .

The number of block-Lanczos steps is chosen to attain the desired accuracy of the discrete-time diffusion transfer function approximation

 F1(p)=ET1(I−T1)pE1≈BT(I−A)pB=F(p), (10)

for some sufficiently large range of , i.e., so that (10) is a good approximation of the stationary limit and late-time dynamics, with the number of convergent modes roughly equal to the number of expected clusters in . In particular, the nullspace of should be well-approximated by Krylov subspace . Hereafter the computation of the reduced-order discrete-time transfer function is stable due to stability of the original transfer function (6) and the min-max property of projection (8) and the subsequent ones.

While the first stage provides a certain level of graph-Laplacian compression, the approximation considerations presented above may lead to the number of block-Lanczos steps and the resulting subspace dimension to be relatively large still. Therefore, our approach includes the second stage to compress the ROM even further. This is achieved by another application of the deflated block-Lanczos process that computes the orthonormal basis for the rational block Krylov subspace

 Kk2((T1+s0I)−1,E1)=colspan{E1,(T1+s0I)−1E1,…,(T1+s0I)−k2+1E1}, (11)

where compression is achieved by choosing . As in the first stage, the vectors forming the orthonormal basis for (11) are arranged into the columns of , where with a strict inequality in case of deflation. Also, similarly to the first stage, we obtain a deflated block tridiagonal matrix

 T2=QT2(T1+s0I)−1Q2∈Rn2×n2. (12)

The use of rational block Krylov subspace (11

) is equivalent to a single point Padé approximation of the frequency domain transfer function (see

[5])

 ˆF1(s)=ET1(T1+sI)−1E1, (13)

at the point , which converges exponentially (spectral convergence), albeit with a sub-optimal rate***Optimal or near optimal convergence rate for a prescribed spectral interval can be achieved using Padé-Chebyshev approximation [16] or frequency-banded truncation [1].. Therefore, it is natural to choose to be in the vicinity of the spectral interval of interest. In spectral clustering we are interested in the lower part of the spectrum, hence we choose

empirically as the first positive (Fiedler) eigenvalue of

estimated via the spectrum of .

The dimension of the rational block Krylov subspace (11) must be controlled by the desired accuracy of the transfer function approximation

 F2(p)=ET1(I−(T−12−s0I))pE1≈ET1(I−T1)pE1=F1(p), (14)

similarly to (10) in the first stage. Accurate approximation in (14) provides accurate approximation of ’s nullspace by the rational Krylov subspace.

We summarize the model reduction algorithm below.

###### Algorithm (Two-stage model reduction)

Input: normalized symmetric graph-Laplacian , target subset , numbers of Lanczos steps , for the first and second stage deflated block-Lanczos processes, respectively, the shift for the second stage block-Lanczos process and the truncation tolerance .
Output: deflated block tridiagonal matrix , and the orthogonal matrices and such that for

 T2≈QT12(A+s0I)−1Q12. (15)

Stage 1: Form the input/output matrix (5) and perform the deflated block-Lanczos process with steps on and with truncation tolerance , as described in Appendix B, to compute the orthonormal basis for the block Krylov subspace (7) and the deflated block tridiagonal matrix .
Stage 2: Perform the deflated block-Lanczos process with steps on and and truncation tolerance , as described in Appendix B, to compute the orthonormal basis for the rational block Krylov subspace (11) and the deflated block tridiagonal matrix . Set

 Q12=Q1Q2. (16)
###### Remark

Relation (15) in Algorithm 3 becomes an equality if, e.g., , which is neither feasible nor desirable in practice. Otherwise, the quality of approximation in (15) is determined by the quality of (10).

###### Remark

Due to good compression properties of Krylov (7) and rational Krylov (11) subspaces, , thus the computational cost of Algorithm 3 is dominated by the first stage block-Lanczos process.

To illustrate the compression properties of both stages of Algorithm 3, we display in Figure 1 the convergence of to and of to (corresponding to the late, null-space dominated, part of the diffusion curve) for the Astro Physics collaboration network data set with described in Section 6.3. The second stage was performed using with and corresponding to converging to the relative error level of . Both curves exhibit superlinear (in logarithmic scale) convergence in agreement with the bounds of [15]. Even without accounting for deflation, the first stage provides more than -fold compression of the full graph, and due to much faster convergence of , the second stage provides more than two-fold additional compression.

## 4 Clustering via harmonic Ritz vectors sampling

In this section we present the first clustering algorithm. As mentioned in Section 1, in order for the clustering of the target subset to be consistent with the overall graph structure, the internal degrees of freedom (degrees of freedom corresponding to graph vertices outside of ) must be sampled somehow. Our first clustering algorithm is based on sampling the harmonic Ritz vectors of , determined by the eigendecomposition of obtained with Algorithm 3. We discuss the harmonic Ritz vectors in detail in Section 4.1, while the clustering algorithm is given in Section 4.2.

### 4.1 Spectral imbedding via harmonic Ritz vectors sampling

Let be the matrix computed by Algorithm 3. Denote by the eigenvalues of

and the corresponding eigenvectors by

:

 T2yi=θiyi,∥yi∥=1,i=1,2,…,n2. (17)

We define the harmonic Ritz pairsAssuming the approximation (15) is exact, are the Ritz vectors of on the column space of , therefore, strictly speaking, are the shifted harmonic Ritz values, but we omit “shifted” for brevity. Non-increasing ordering of obviously yields a non-decreasing sequence of . of as , where

 λi=1θi−s0,vi=Q12yi,i=1,2,…,n2. (18)

The discrete-time diffusion on can be written as the action of matrix polynomials on the input/output matrix in terms of the harmonic Ritz pairs

 (I−A)pB≈Q12(I−(T−12−s0I))pE1=n2∑i=1(1−λi)p(vTiB)vi, (19)

similarly for the transfer function

 F(p)≈F2(p)=n2∑i=1(1−λi)p(vTiB)T(vTiB). (20)

It is known [6, 31] that the best spectral clustering performance is attained when the embedding is done via the eigenmap of the random-walk normalized graph-Laplacian . While it has the same spectrum as the normalized symmetric graph-Laplacian , its eigenvectors differ from those of by scaling. Therefore, we also introduce the scaled harmonic Ritz vectors

 wi=D−1/2vi=D−1/2Q12yi,i=1,2,…,n2, (21)

which are sampled by the clustering algorihm described in the Section 4.2.

Let , be the sampling vertices, such that

 qj=ij∈Gm if1≤j≤m, (22) qj∈G∖Gm ifm+1≤j≤ns, (23)

i.e. the first vertices form the target subset and the rest are auxiliary chosen as explained in Section 4.2. To sample the scaled harmonic Ritz vectors at these vertices, we need to compute the inner products

 eTqjwk=eTqjD−1/2Q1Q2yk,j=1,2,…,ns,k=1,2,…,n2. (24)

To minimize computations and storage we compute (24) as

 eTqjwk=gjQ2yk, (25)

where

 gj=eTqjD−1/2Q1 (26)

can be recursively precomputed during the first stage block-Lanczos process for .

###### Remark

The harmonic Ritz pairs yield an approximation of the lower part of the spectral measure of (of for scaled vectors ) projected on , and they constitute a significantly smaller set than the spectral data typically used in spectral clustering algorithms [6, 11]. Such compression is partly explained in Remark 3. However, the convergence of the late time diffusion on controlled by (20) does not guarantee the same approximation level of (19) outside (the latter is just the square root of the former, e.g., see [29]), and modes with small (but nonzero) projections on (corresponding to strongly separated clusters that do not include but belong to the same connected components) may be missing in the ROM spectral measure. Validity of the ROM approach will become more clear when we discuss its geometric interpretation via reduced order graph-Laplacian in Section 5.

### 4.2 Clustering algorithm

###### Algorithm (Harmonic Ritz vectors sampling clustering [HRVSC])

Input: normalized symmetric graph-Laplacian , target subset , number of clusters to construct, size of imbedding subspace, set of sampling points, numbers of Lanczos steps , for the first and second stage deflated block-Lanczos processes, respectively, shift for the second stage block-Lanczos process and truncation tolerance .
Output: Clusters .
Step 1: Use Algorithm 3 and (26) to compute deflated block tridiagonal matrix , parameters , , and orthogonal matrix .
Step 2: Use (25) to compute the scaled harmonic Ritz vectors sampled at for smallest . Assemble these values in the matrix :

 Hjk=eTqjwk,j=1,…,ns,k=1,…,nev. (27)

Step 3: Apply an approximate k-means algorithm, e.g., Lloyd’s, to the matrix to clusterize into clusters .
Step 4: Set , .

Typically, we choose equal to the ROM order and the sampling vertices are chosen randomly following the reasoning outlined in Appendix A.

## 5 Clustering via reduced order graph-Laplacian

In this section we present the second clustering algorithm, based on transforming the ROM to the graph-Laplacian form. Unlike Algorithm 4.2, no spectral imbedding into is needed. Instead, the internal degrees of freedom are sampled directly from the eigendecomposition of the transformed ROM. The transformation of the ROM to the graph-Laplacian form is considered in Sections 5.1 and 5.2. The geometry of the corresponding reduced-order graph is discussed in Section 5.3, while the properties of the reduced-order graph-Laplacian (ROGL) approximating certain distances on the original graph are considered in Section 5.4. The clustering algorithm is presented in Section 5.5.

### 5.1 The third block-Lanczos process

The output of Algorithm 3, matrix , is an approximate projection of according to (15). Therefore, in order to obtain a ROM for the normalized symmetric graph-Laplacian itself, we need to consider the matrix .

Note that while , the output of the second block-Lanczos process in Algorithm 3, has the sparsified (block-tridiagonal) form, the inverse is in general completely filled in. Therefore, we transform to the block tridiagonal form, mimicking a finite-difference approximation of an elliptic symmetric semidefinite operator, as in [18]. This can be achieved by employing yet another deflated block-Lanczos process, third if we count the two in Algorithm 3, to compute the orthogonal matrix , the columns of which span the block Krylov subspace

 Kk2(T−12,E1)={E1,T−12E1,…,T−k2+12E1}, (28)

and the deflated block-tridiagonal matrix

 T3=QT3(T−12−s0I)Q3∈Rn2×n2. (29)

Compared to previous two Lanczos processes, deflation here is exact up to machine precision and allows to equivalently transform to without reducing its dimensionality. Therefore, the third one preserves the diffusion transfer function obtained after the second Lanzcos process exactly (up to machine precision):

 F3(p)=ET1(I−T3)pE1≡ET1(I−(T−12−s0I))pE1=F2(p), (30)

which follows trivially from (29) and . To emphasize the fact that is a ROM for , we rewrite (29) as

 (31)

The approximation error in (31) is roughly of the same order as the errors in (10) and (14) combined.

### 5.2 Transformation to the graph-Laplacian form

The ROM realization in terms of loses some important properties of graph-Laplacian matrices, namely the zero-sum of the rows and columns and nonpositivity of the off-diagonal elements, which is a known drawback of conventional model reduction methods. While the latter does not significantly affects the clustering quality, the former is critical [23]. The zero row sum condition can be (approximately) enforced by a rescaling of . Obviously, this is possible only if approximates the nullspace of . Therefore, we make the following assumption.

###### Assumption

Vector of ones We assume that the components of are on the support of and outside following the default convention of Remark 3 is in the range of . This is achieved by the accuracy control in Algorithm 3 discussed in Section 3. In practice this is approximate up to accuracy of ROM for late diffusion times.

###### Assumption

All entries of are nonzero.

The validity of Assumption 5.2 is discussed in Appendix A.

With the two assumptions above in mind we can define the scaling of that transforms it to the graph-Laplacian form. It is summarized in the proposition below.

###### Proposition

Let satisfy (31) with projection matrix satisfying Assumption 5.2, and

 z0=VTD1/21 (32)

satisfies Assumption 5.2.

Then

 ˜D1/2=\diag(z0)∈Rn2×n2, (33)

is a non-singular diagonal scaling matrix and

 ˜L=˜D1/2T3˜D1/2∈Rn2×n2, (34)

is a symmetric positive semidefinite matrix with the sum of elements in every row (column) equal to zero, which we refer to hereafter as the reduced-order graph-Laplacian (ROGL).

Since is symmetric positive semidefinite, then (31) in conjunction with (34)–(33) and Assumption 5.2 implies trivially that is symmetric and positive semidefinite.

From Assumption 5.2 it follows that , thus

 AVz0=0. (35)

Multiplying both sides from the left by and assuming for simplicity that (31) is exact we get

 T3z0=VTAVz0=0. (36)

If we denote the vector of all ones, then the definition (33) implies . Substituting this expression for into (36) and multiplying both sides by we arrive at

 ˜D1/2T3˜D1/21n2=˜L1n2=0, (37)

which concludes the proof.

### 5.3 Reduced-order graph geometry and implications for spectral clustering

To illustrate the performance of Algorithm 3 followed by the third block-Lanczos process (29) and rescaling (32)–(34) we display in Figure 2 the reduced-order graph corresponding to ROGL . It is computed for the same data set as the one used in Figure 1 with and . Since the first block of corresponds to the original graph’s vertices in the target subset , we label them as vertices through of the reduced-order graph, in general denoted hereafter by . While the edge weights of the reduced order graph are strongly non-uniform (see the discussion in Appendix A), we plot all edges in the same way just to highlight the deflated block-tridiagonal structure of . Deflation is observed in the shrinking of the blocks away from . Note that deflation results in additional two-fold reduction of ROM size from to . Overall, we observe about -fold model order reduction () compared to the original problem size ().

As discussed above and also in Appendix A, the weight matrix in (33) differs significantly from the diagonal of , unlike in the random-walk normalized graph Laplacian definition, which provides the best results for spectral clustering [6, 31]. This difference manifests itself as “ghost clusters” outside the target set, in particular in the parts of the reduced order graph corresponding to last blocks of that have little influence on diffusion at .

The above difficulty can be resolved using the following empirical procedure that introduces auxiliary clusters. Suppose the reduced order graph is clusterized into clusters, we denote by the number of such clusters that have a nonempty intersection with vertices in . This defines as a function of . This function is piece-wise constant with a number of plateaus, the intervals of where assumes the same value. Note that in practice . Next, suppose that is the desired number of clusters, then we choose the optimal value of , denoted by , such that the corresponding is as close as possible to (ideally, ) and is at the midpoint of the corresponding plateau. This ensures that the number of auxiliary clusters produces the number of clusters at the target subset as close as possible to the desired one in the most stable manner.

Although the above empirical procedure works well in our examples, more complicated problems may require a better algorithm for choosing with stability conditions not only on but also on the structure of the auxiliary clusters. Such an algorithm will provide extra flexibility in finding the optimal number of auxiliary clusters at a small cost of running an approximate k-means algorithm (possibly in parallel) for different choices of on the precomputed ROGL .

### 5.4 Commute-time and diffusion distances approximations by ROGL

Diffusion and commute-time distances are widely used measures that quantify random walks on graphs. In this section we show that the ROGL computed by Algorithm 3 followed by the third block-Lanczos process (29) and rescaling (32)–(34) preserves both distances on the target subset with exponential accuracy. For completeness of exposition we briefly discuss both distances first.

For denote by

the probability clouds on graph vertices

, i.e., the probability distributions at the discrete time

for a random walk originating at node at the discrete time . The diffusion distance between two nodes at time is defined as a distance between probability clouds of random walks originating at those nodes (see [10]):

 (Dpjk(G))2=∥Ppj(G)−Ppk(G)∥2D. (38)

It can be expressed in terms of normalized symmetric graph-Laplacian as

 (Dpjk(G))2=(√LjjeTj−√LkkeTk)(I−A)2p(√Ljjej−√Lkkek). (39)

Commute-time distance between two vertices is defined as the expected time it takes a random walk to travel from one vertex to the other and return back. Note that another metric, the so-called resistance distance, differs from the commute-time distance just by a constant factor. It can be expressed in terms of the graph-Laplacian as follows (see [31]):

 C2jk(G)=(eTj−eTk)L†(ej−ek) (40)

where is Moore-Penrose pseudo-inverse of . Replacing by its normalized symmetric counterpart , we obtain

 C2jk(G)=(1√LjjeTj−1√LkkeTk)A†(1√Ljjej−1√Lkkek). (41)

Both the diffusion distance for large times and the commute-time distance take into account the entire structure of the graph. Hence, they represent useful tools for graph clustering compared, e.g., to the shortest-path distance. The structure of ROGL is obviously totally different than the structure of the original . Therefore, even for the vertices in (that correspond to ) the shortest-path distance is not preserved in general. However, we establish below that commute-time and diffusion distances between the vertices of are approximately preserved in the reduced-order graph with exponential accuracy.

###### Proposition

Let and be the vertex sets of the original graph and the reduced-order graph with deflated block-tridiagonal ROGL , respectively. Then for any two vertices that correspond to vertices , we have

 Dpij,ik(G)≈Dpjk(˜G) and Cij,ik(G)≈Cjk(˜G) (42)

with exponential accuracy with respect to .

The proof is provided in Appendix C. Note that ROGL may have positive off-diagonal elements in general. Hence, the random walk on can be considered from a quasi-probability distribution point of view, i.e., with negative probabilities (first introduced in [13]) of traveling along certain edges.

### 5.5 Clustering algorithm

###### Algorithm (Reduced order graph-Laplacian clustering [ROGLC])

Input: normalized symmetric graph-Laplacian , target subset , number of desired clusters in , dimension of the imbedding subspace, numbers of steps , for the first and second stage deflated block-Lanczos processes, respectively, shift for the second stage block-Lanczos process and truncation tolerance .
Output: The optimal number of clusters in and the clusters .
Step 1: Use Algorithm 3 to compute the deflated block tridiagonal matrix and the orthogonal matrices and .
Step 2: Use Algorithm B to perform the deflated block-Lanczos process with steps on and with truncation tolerance , to compute the orthonormal basis for the block Krylov subspace (28) and the deflated block tridiagonal matrix

 T3=QT3(T−12+s0I)−1Q3. (43)

Here rounds the argument to the nearest integer greater or equal than the argument.
Step 3: Set and use (32)–(34) to compute the reduced-order graph-Laplacian and the diagonal normalization matrix .
Step 4: Compute the eigenvectors of the matrix pencil for smallest eigenvalues and assemble them in the matrix .
Step 5: Apply an approximate k-means algorithm, e.g., Lloyd’s, to for various values of to clusterize reduced-order graph nodes into clusters , and find , , as their nonempty intersections with .
Step 6: Compute the optimal numbers of clusters and as described in Section 5.3 and set the output clusters to the corresponding from Step 5.

## 6 Numerical results

We validate the performance of our clustering algorithms over three scenarios. The first one is a synthetic weighted 2D graph with circular clusters. The other two graphs were taken from SNAP repository [25]: one with ground-truth communities (E-Mail network) and one without (collaboration network of arXiv Astro Physics). For the sake of brevity, hereafter we refer to both clustering algorithms by their acronyms, i.e., to Algorithm 4.2 as HRVSC and to Algorithm 5.5 as ROGLC.

### 6.1 Synthetic 2D weighted graph

The first scenario we consider is a synthetic data set consisting of points in the 2D plane consisting of equally sized circular “clouds” of points (we reserve the name “clusters” to the subsets computed by the clustering algorithms), as shown in Figure 3. Corresponding to this data set, we construct a weighted fully connected graph with the corresponding graph-Laplacian entries defined via the heat kernel:

 Lij=−e−||xi−xj||2/τ2,i≠j, (44)

where is the standard Euclidean norm in 2D. The similarity measure and, consequently, graph clustering depend strongly on the choice of parameter in (44). For small the heat kernel is close to Dirac’s function, so the distance between any two vertices is close to . In contrast, for large the distance between all vertices is approximately the same. The most difficult and interesting case for clustering would be an intermediate case, i.e. when the nodes from each cloud form their separate clusters, but each cluster remains coupled with some of its neighbors. For the numerical experiments in this scenario we choose .

We choose a target subset corresponding to two randomly selected points from each of the clouds. We benchmark the clustering of using our HRVSC and ROGLC algorithms against the random-walk normalized graph Laplacian spectral clustering (RWNSC) of the full graph, as described in [31]. We used parameters , for Algorithm 3. For these parameters and threshold no deflation occurred, so the resulting Krylov subspaces have dimensions and , respectively. For the clustering with both our algorithms and RWNSC we use the spectral imbedding into the subspace of dimension . We used the same number of clusters for all 3 algorithms. That yielded and for ROGLC.

In this example we benchmarked just the clustering outcome and not the computational time. Indeed, for such a small dataset Lloyd’s method took a fraction of second to compute the clusters for all the algorithms. The comparison of all three approaches is shown in Figure 4. We observe that all three algorithms managed to separate the points in from all clouds into separate clusters. Obviously, while for ROGLC, not all of these auxiliary clusters contain the vertices from the target set. In Figure 4 we only display the clusters containing vertices from , all auxiliary degrees of freedom (including the vertices in for RWNSC) are discarded.

### 6.2 E-Mail network with ground-truth communities

In the second scenario we consider the graph of “email-Eu-core” network generated using email data from a large European research institution, available at the SNAP repository [25]. The original graph is directed, so we symmetrize for our numerical experiment. The network consists of nodes that are split into

ground-truth communities. However, not all of the ground-truth communities are clearly recognizable from the graph structure. For example, the graph contains outliers, i.e., isolated vertices that are contained in some nontrivial ground-truth communities. Hence, any clustering algorithm preserving the structure of the simply-connected components would put each of those vertices into separate clusters.

We consider the clustering of consisting of two randomly selected vertices from each of the ground-truth communities. For the reference clustering with RWNSC we use the spectral imbedding into the subspace of dimension and set . The same parameters were used for HRVSC algorithm. For ROGLC we used the same imbedding subspace dimension and that corresponds to ? The numbers of block-Lanczos steps for Algorithm 3 were set at and . Similarly to the previous example, for threshold no deflation occurred, so the resulting Krylov subspace dimensions are and .

Similarly to the synthetic example in Section 6.1, the data set is too small for a meaningful comparison of computational time. The typical results of clustering comparing the three algorithms are shown in Figure 5. We display the ground-truth communities in that were reproduced by the clustering algorithms, i.e., the corresponding vertices in were assigned to the same clusters without mixing with the vertices from other ground-truth communities. In this example our algorithms managed to recover slightly more communities than RWNSC. We ran multiple scenarios with different choices of , and in approximately half of the cases our algorithms recover the same number of communities as RWNSC, while in remaining half of the cases HRVSC and ROGLC recover slightly more communities, with the latter being the best performer. We can speculate that better performance is achieved due to smaller size of data set that allows us to exploit Lloyd’s algorithm in the regime where it is the most robust and efficient.

Dimensionality reduction not only makes Lloyd’s algorithm more robust, but it also allows to apply more accurate approximations to k-means like those based on semi-definite programming (SDP). These algorithms typically provide more accurate clustering results, but become infeasible for large data sets due to the prohibitive computational cost. In our next example we show the benefits of replacing Lloyd’s method by SDP [27] in ROGLC. We chose to consist of two randomly chosen vertices from randomly chosen ground-truth communities. Our reference clustering with RWNSC was the same as in the previous example. For ROGLC we used the numbers of block-Lanczos steps and which resulted in Krylov subspaces of dimension and . The number of clusters was set to .

In Figure 6 we compare the typical results of clusterizations using RWNSC and two variants of ROGLC, with Lloyd’s algorithm and with SDP. Similarly to the previous example, ROGLC with Lloyd’s algorithm reproduced more communities compared to conventional RWNSC ( against ). Moreover, replacing Lloyd’s algorithm with SDP in ROGLC allowed to recover one more community. As mentioned above, the size of the target subset was reduced to since running SDP for would be computationally infeasible. We ran multiple experiments with different random choices of and observed that ROGLC with SDP always performed as well or better than with Lloyd’s lagorithm in terms of the number of recovered ground-truth communities. We should stress out that 20 was the maximum size of target subset for which applying the SDP algorithm was feasible due to rapid complexity increase.

### 6.3 Astro Physics collaboration network

Following [11], in the third scenario we consider “ca-AstroPh” network (available at [25]) representing collaborations between papers submitted to Astro Physics category of the e-print arXiv. This is our largest example graph with vertices and connected components. The relatively large graph size presents multiple challenges for clustering algorithms.

It is crucial to ensure that different connected components are separated into different clusters. As was shown in [11] on the same data set, the conventional RWNSC algorithm does not guarantee that. A necessary condition for RWNSC algorithm to achieve this objective is to compute the entire null-space of the graph-Laplacian. In contrast, for HRVSC and ROGLC clustering of the target subset we only need to take into account the connected components of the graph that have nonempty intersections with . Hence, for small we only need to compute a small subspace of . Since the eigenmode computation in RWNSC is typically performed using Krylov subspaces similar to (7), RWNSC would require a significantly larger Krylov subspace compared to our algorithms.

Since the ground-truth clustering is not available for this scenario, we modify our testing methodology from the one used in the previous two.

First, we clusterize the full graph into clusters using RWNSC with the imbedding subspace of dimension , which becomes our reference clustering. Here Lloyd’s algorithm takes seconds. In order to avoid difficulties in finding the connected components (as reported in [11]) we made a special choice of the initial guess for the k-means algorithm for RWNSC.

Next, we chose of two random vertices per each of some randomly chosen reference clusters. The goal of this benchmark is to check whether our algorithms can put the vertices of in the reference clusters. For both our algorithms we used the imbedding subspace of dimension . We took and block-Lanczos steps in Algorithm 3. In this example some connected components are significantly smaller than others. This results in early deflation of Krylov subspace (7) if contains vertices from small connected components. In particular, for the random realization of reported here, the resulting Krylov subspaces have dimensions and , compared to , .

In Figure 7 we display the clusters of computed using HRVSC and ROGLC with , for which the latter produced and . We observe that both our algorithms have successfully reproduced all the reference clusters of . Due to small in HRVSC and ROGLC, Lloyd’s algorithm took less than second, compared to seconds reported above for full RWNSC clustering, which represents a significant computational speed-up.

Note that in this example the dimension of Krylov subspace at the first stage of Algorithm 3 is . In contrast, RWNSC with ARPACK eigensolver [24] requires a subspace of dimension . Moreover, for smaller the difference between the sizes of Krylov subspaces becomes even more pronounced. For example, for of two random vertices from each of the three randomly chosen reference clusters, it suffices to take to reproduce the reference clustering, as shown in Figure 8.

## 7 Summary and future work

We presented two algorithms for spectral clustering of a target subset of graph vertices based on Krylov subspace model order reduction. Both algorithms use the two-stage ROM construction procedure consisting of two block-Lanczos processes that compute a reduced model for the graph-Laplacian that approximates the diffusion transfer function on the target subset. Since the clustering of the target subset must be consistent with the overall graph structure, both algorithms need to sample the degrees of freedom corresponding to the vertices outside the target subset. While the first algorithm, “harmonic Ritz vectors sampling clustering” (HRVSC), achieves this by sampling the harmonic Ritz vectors of the ROM, the second one, “reduced order graph-Laplacian clustering” (ROGLC), transforms the ROM further to a form resembling a graph-Laplacian and then samples its eigenvectors.

The performance of both proposed clustering algorithms is verified numerically on one synthetic and two real data sets against the conventional random-walk normalized graph Laplacian spectral clustering (RWNSC) of the full graph. For the scenarios with the synthetic and one real data set our algorithms show results fully consistent with the RWNSC. In the scenario with real data containing multiple outliers (isolated vertices), our algorithms produce better results on the target subset than RWNSC by detecting more clusters corresponding to the ground-truth communities. For all the cases our algorithms show clear computational advantages, such as reduction of the dimension of Krylov subspaces compared to partial eigensolvers required for the RWNSC.

Our algorithms can be used as standalone to clusterize a representative “skeleton” subset when the full graph clustering is not required. However, a more important application is to use them as building blocks of a divide-and-conquer type method for clustering of the full graph. We believe that such approach has multiple advantages. First, if graph vertices are split into disjoint target subsets, each of them can be clusterized independently thus making this step perfectly parallelizable.

Second, since the approximate k-means algorithms are applied only to the target subsets individually, we avoid the problems of their reduced robustness with respect to the initial guess for large data sets. Once the individual subset clusters are computed, one needs to merge them to obtain the clusterization of the full graph. This is where another advantage lies, as the merging step provides for greater flexibility and the possibility to control the quality of the clustering. The development of a divide-and-conquer clustering method is a topic of the ongoing research and will be reported in a separate paper.

Third, the small size and special structure of the ROGL opens potential opportunities to substitute simple k-means approximations, e.g., Lloyd’s algorithm, by more accurate and expensive ones. For example, one can use SDP relaxations of k-means [11, 28, 32] that produce better results in certain cases, but do not scale well with the size of the problem.

Finally, while the focus of this work is spectral clustering, the ROGL developed here has other potential uses. For example, one may look into adapting the min/max cut graph clustering algorithms (e.g., [22, 12, 20]) to graphs with edge weights of arbitrary signs, like those produced by ROGL construction. Another example could be the direct solution of the NP-hard problem of modularity optimization [26] infeasible for large graphs, but computationally tractable for small target subsets.

## Acknowledgments

This material is based upon research supported in part by the U.S. Office of Naval Research under award number N00014-17-1-2057 to Mamonov. Mamonov was also partially supported by the National Science Foundation Grant DMS-1619821. Druskin acknowledges support by Druskin Algorithms.

The authors thank Andrew Knyazev, Alexander Lukyanov, Cyrill Muratov and Eugene Neduv for useful discussions.

## Appendix A Interpretation in terms of the finite-difference Gaussian rules

To connect the clustering approaches presented here to the so-called finite-difference Gaussian rules, a.k.a optimal grids [16, 17, 21, 3, 4, 8], we view the random-walk normalized graph-Laplacian as a finite-difference approximation of the positive-semidefinite elliptic operator

 Lu(x)=−1σ(x)∇⋅[σ(x)∇u(x)], (45)

on a grid uniform in some sense defined on the data manifold, e.g., see [6]. Note that since we assume the grid to be “uniform”, all the variability of the weights of is absorbed into the coefficient .

For simplicity, following the setting of [7], let us consider the single input/single output (SISO) 1D diffusion problem on , :

 ut(x,t)−1σ(x)[σ(x)ux(x,t)]x=0,u(x,0)=δ(x),ux(0,t)=0,ux(1,t)=0, (46)

with a regular enough , and the diffusion transfer function defined as

 F(t)=u(0,t). (47)

Since both input and output are concentrated at , the “target set” consists of a single “vertex” corresponding to . Therefore, it does not make sense to talk about clustering, however, we can still use the SISO dynamical system (46)–(47) to give a geometric interpretation of the imbedding properties of our reduced model and to provide the reasoning for Assumption 5.2.

The ROM (34)–(33) constructed for the system (46) transforms it into

 ˜ut−˜D−1˜L˜u=0,˜u|t=0=˜D−1e1, (48)

where , , , with , and is the second order finite-difference operator defined by

 [˜L˜u]i=σihi(˜ui−˜ui−1)−σi+1hi+1(˜ui+1−˜ui),i=1,…,n2, (49)

with and defined to satisfy the discrete Neumann boundary conditions

 ˜u0=˜u1,˜un2=˜un2+1. (50)

As we expect from Sections 5.1 and 5.2, is indeed a tridiagonal matrix. Parameters , can be interpreted as the steps of primary and dual grids of a staggered finite-difference scheme, respectively, whereas are respectively the values of at the primary and dual grid points. Assumption 5.2 then follows from the positivity of primary and dual grid steps (a.k.a. Stieljes parameters) given by the Stieljes theorem [16]. Note that the approximation condition (14) in this case means .

As an illustration, we display in Figure 9 the optimal grid with steps , computed for . The continuum operator is discretized on an equidistant grid on with nodes, i.e., . The optimal grid steps were computed using the gridding algorithm from [14], which coincides with Algorithm 3 and the subsequent transformations (29) and (34) for the 1D Laplacian operator. We observe that the grid is imbedded in the domain and has a pronounced stretching away from the origin. We should point out, that such stretching is inconsistent with random-walk normalized graph-Laplacian formulation, employing uniform grids. Grid non-uniformity is the price to pay for the spectral convergence of the transfer function approximation. One can view Algorithm 4.2 as imbedding of the reduced-order graph back to the space of the original normalized random walk graph Laplacian . The randomized choice of sampling vertices provides a uniform graph sampling.

For the general MIMO problem, Proposition 5.2 yields a symmetric positive-semidefinite block-tridiagonal with zero sum of elements in every row. The classical definition of graph-Laplacians requires non-positivity of the off-diagonal entries, which may not hold for our . However, it is known that operators with oscillating off-diagonal elements still allow for efficient clustering if the zero row sum condition remains valid [23]. Matrix still fits to a more general PDE setting, i.e., when the continuum Laplacian is discretized in an anisotropic media or using high order finite-difference schemes. Such schemes appear naturally when one wants to employ upscaling or grid coarsening in approximation of elliptic equations in multidimensional domains, which is how we can interpret the transformed ROM (34).

## Appendix B Deflated block-Lanczos tridiagonalization process for symmetric matrices

Let be a symmetric matrix and be a tall () orthogonal matrix: , where is the identity matrix. The conventional block-Lanczos algorithm successively constructs an orthonormal basis, the columns of the orthogonal matrix , for the block Krylov subspace

 K