An important objective in modern machine learning, and part of many scientific and data analysis pipelines, is
An important objective in modern machine learning, and part of many scientific and data analysis pipelines, isclustering [berkhin2007survey]. By clustering, we generally mean the separation of data into groups, in a way that is somehow meaningful for the domain-specific relationships that govern the underlying data and problem in question. However, the demands that the clustering scheme should satisfy are of course inherently vague.
For data that form a point cloud in Euclidean space, and where one expects clusters to exist, one may employ elementary methods such as -means clustering [steinhaus1956division]. For data in a more abstract “similarity space”, for which no obviously meaningful Euclidean embedding exists, researchers invented the schemes [sorensen1948method, sneath1957application] that we today refer to as hierarchical clustering. Alternatively, one can derive a graph structure from some notion of similarity between the data points. Treating the data points as vertices of a graph allows one to exploit the popular and highly successful spectral clustering techniques [vonluxburg2007tutorial, ng2001] which developed from the field of spectral graph theory [chung1997spectral].
Although the graph structure provides us with additional information about the data, graphs are intrinsically limited to modeling pairwise interactions. The success of topological methods in studying data, and the parallel establishment of topological data analysis (TDA) as a field [edelsbrunner2000topological, zomorodian2005computing] (see also [carlsson2008, chazal2017, edelsbrunner2010computational, ghrist2008barcodes] for modern introductions and surveys), have confirmed the usefulness of viewing data through a higher-dimensional analog of graphs [moore2012, patania2017]. Such a higher-dimensional analog is called a simplicial complex, a mathematical object whose structure can describe -fold interactions between points. Their ability to capture hidden patterns in the data has led to various applications from biology [giusti2015, reimann2017] to material science [hiraoka2016]. Recent work has also expanded classical graph-centric results — such as Cheeger inequalities [gundert2014higher, braxton2018], isoperimetric inequalities [parzanchevski2016] and spectral methods [horak2013spectra] — to simplicial complexes. This leads naturally towards a novel domain of “spectral TDA” methods.
In this paper we present the harmonic clustering algorithm , a novel clustering scheme inspired by the well-known spectral clustering algorithm for graphs. Our method, like spectral clustering, does not require any parameter optimization and involves only computing the smallest eigenvalue eigenvectors of a sparse matrix. The harmonic clustering algorithm is applied directly to a simplicial complex and it outputs a clustering of the
, a novel clustering scheme inspired by the well-known spectral clustering algorithm for graphs. Our method, like spectral clustering, does not require any parameter optimization and involves only computing the smallest eigenvalue eigenvectors of a sparse matrix. The harmonic clustering algorithm is applied directly to a simplicial complex and it outputs a clustering of thesimplices (of a fixed degree) that is sensitive to the homological structure of the complex, something that is highly relevant in TDA. Moreover, since simplices can encode interactions of higher order than just the pairs captured by graphs, our algorithm allows us to cluster complex community structures rather than just the entities they comprise.
Our method can be seen as complimentary to the one presented in [braxton2018].
1.1 Spectral graph theory
The method we present in this paper does not require many formal results from spectral graph theory. The notions relevant for our purposes are described below for the sake of completeness.
In its simplest form, the Laplacian of an undirected and unweighted finite graph is taken to be the positive definite matrix , where is the adjacency matrix of and its diagonal degree matrix (i.e. the row/column sums of ). The normalized Laplacian is then defined as . For reasons that will become clear later on (see 2.1), we will write for the free real vector space generated by the vertices of
for the free real vector space generated by the vertices of, and consider as the matrix of a linear map in this basis.
Already in the middle of the 19th century it was clear that the eigenvalue spectrum of has a lot to say about , as is evident from as early as a historic theorem of Kirchhoff relating the eigenvalues of the Laplacian with the number of spanning trees in the graph [kirchhoff1847]. From the 1950s, graph theorists and quantum chemists were independently discovering more relationships between a graph and the eigenspectrum of its Laplacian. However, the publication of the book [cvetkovic1979] may be said to mark the start of spectral graph theory as a field in its own right. A modern introduction to the field and references to the results listed below can be found in [chung1997spectral].
The spectrum of encodes information about the connectivity of the graph. For instance, the number of connected components of the graph is equal to the dimension of the kernel of . Moreover, the eigenvectors associated to the zero eigenvalues, also called harmonic representatives, take constant values on connected components. A perhaps more interesting result is given by the Cheeger constant [cheeger1969lower], a measure of how far away a connected graph is from being disconnected by bounding the smallest non-zero eigenvalue of .
Theorem 1 (cheeger1969lower, cheeger1969lower [cheeger1969lower]; see also e.g. [chung1997spectral]).
Let be a finite, connected, undirected, unweighted graph. Write for the triples with and such that and
Define the Cheeger constant of as
Then the first non-zero eigenvalue of the graph’s normalized Laplacian satisfies
A partition of as that attains the Cheeger constant is called a Cheeger cut. It is known that finding an exact Cheeger cut is an NP-hard problem [szlam2010total]. One of the best known approaches to approximating the Cheeger cut is the spectral clustering method, which takes the first non-zero eigenvalue eigenvector of the graph Laplacian as a relaxed real-valued solution of the original discrete optimization problem [vonluxburg2007tutorial]. Namely, the smallest non-zero eigenvector of , also called the Fiedler vector or the connectivity vector [fiedler1973algebraic], can be exploited to find the best partition of the graph into two “almost disconnected” components. The Cheeger cut can be easily generalized to find “almost disconnected” components using the first non-zero eigenvectors of the graph Laplacian [vonluxburg2007tutorial].
1.2 Graph spectral clustering
The Fiedler vector being a relaxed solution of the Cheeger cut has implications for clustering the vertices of a graph into “almost disconnected” components [vonluxburg2007tutorial, belkin2002laplacian]. For the remainder of this section we will assume that the graph under consideration is connected.
Graph spectral clustering of a graph with Laplacian works in two steps. First, one uses the information encoded in the lowest-eigenvalue eigenvectors of to map into low-dimensional Euclidean space. One thereafter uses standard -means or any applicable Euclidean clustering technique on the points in the image of this map, before pulling back to . Specifically, we will write for the eigenvectors associated with the first non-zero eigenvalues of . One defines a function, also called a spectral embedding, by
where is the inner product on that makes orthonormal. As a finite Euclidean point cloud, is then clustered in by standard -means or any suitable clustering algorithms. The clustering obtained is then pulled back to . Figure 1 shows an example. Observe that in this case, mapping into the real line using only the Fiedler vector would suffice (i.e. ).
As pointed out in [vonluxburg2007tutorial] , spectral clustering is one of the standard approaches to identify groups of “similar behavior” in empirical data. It is therefore not surprising that it has been successfully employed in many fields ranging from computer science and statistics to biology and social science. Moreover, compared to other approaches, such as Gaussian Mixture Models clustering, spectral clustering does not require any parameter optimization and can be solved efficiently by standard linear algebra methods.
, spectral clustering is one of the standard approaches to identify groups of “similar behavior” in empirical data. It is therefore not surprising that it has been successfully employed in many fields ranging from computer science and statistics to biology and social science. Moreover, compared to other approaches, such as Gaussian Mixture Models clustering, spectral clustering does not require any parameter optimization and can be solved efficiently by standard linear algebra methods.
2 Harmonic clustering in simplicial complexes
Our method is inspired by spectral clustering in graphs, but applies instead to a higher-dimensional analog, namely simplicial complexes. Instead of clustering only vertices, which are the zero-dimensional building blocks of graphs and simplicial complexes, the method clusters independently building blocks of any dimension.
This section outlines the prerequisite basic constructions from algebraic topology before describing our method. A reader interested in more background on algebraic topology is directed to standard textbooks [hatcher].
Those wishing a quick overview of method can view it in algorithmic form in figure 3.
2.1 Algebraic topology
A simplicial complex is a collection of finite sets closed under taking subsets. We refer to a set in a simplicial complex as a simplex of dimension if it has cardinality . Such a -simplex has faces of dimension , namely the sets omitting one element, which we will denote as when omitting the ’th element. While this definition is entirely combinatorial, we will soon see that there is a geometric interpretation, and it will make sense to refer to and think of -simplices as vertices, -simplices as edges, -simplices as triangles, -simplices as tetrahedra, and so forth.
Let be the free real vector space with basis , the set of -simplices in a simplicial complex . The elements of are called -chains. These vector spaces come equipped with boundary maps, namely linear maps defined by
with the convention that and for convenience. Figure 2 shows how the boundary maps give a geometric interpretation of simplicial complexes.
One readily verifies that , and so is a real chain complex. By the ’th homology vector space of we will mean the ’th homology vector space of this chain complex, namely
The elements of are called -cycles, while those of are called -boundaries, as can be seen geometrically in Figure 2. The Betti numbers are the dimensions of the homology vector spaces, and we write . Intuitively, the Betti numbers count connected components, non-bounding loops, non-bounding cavities, and so forth.
We emphasize again that this is homology with real coefficients, not integer or finite field, as is common in TDA.
2.2 Simplicial Laplacians
We are in this paper concerned with finite simplicial complexes, and assume that they are built in a way that encodes useful information about the data being studied. We will briefly discuss the case where each simplex in comes equipped with extra data — including, but not limited to the filtration/weighting information that is ubiquitous in TDA — or with a normalization factor derived from the complex’s structure, in the form of a function . The latter is analogous to the various normalization schemes that are often used in graph spectral theory. Our computational experiments, however, will only consider the case .
The weights are encoded into the chain complex by endowing each degree with the inner product that makes all simplices orthogonal, and a simplex have norm given by the weight, i.e.
Further discussions of weighting schemes can be found in [horak2013spectra].
We place no further assumptions on the simplicial complex that we take as input. In particular, it is not necessary for it to come equipped with some embedding into Euclidean space, nor do we demand that it triangulates a Riemannian manifold. Therefore dualities like the Hodge star, which is used to construct the Hodge–de Rham Laplacian in the smooth setting [madsen1997calculus] that motivates us, are unavailable for our method. The same is true for discrete versions of the Hodge star, such as that of Hirani [hirani2003thesis]. Instead of dualizing with respect to a Hodge star, to define a discrete version of the Laplacian for simplicial complexes, we simply take the linear adjoint of the boundary operator with respect to the inner product, defining by
In analogy with Hodge–de Rham theory, we then define the degree- simplicial Laplacian of a simplicial complex as the linear operator such that
The harmonics are defined as
Observe that there are Laplacians for a complex of dimension . In most practical applications, the matrices for the Laplacians are very sparse and can easily be computed as a product of sparse boundary matrices and their transposes.
The following discrete version of the important Hodge decomposition theorem is a simple exercise in linear algebra in the current setting.
Theorem 2 (eckmann1944, eckmann1944 [eckmann1944]).
The vector spaces of chains decompose orthogonally as
the harmonics are both cycles and cocycles (i.e. cycles with respect to )
the harmonics are the -minimal representatives of their (co)homology classes, i.e. if and are homologous, then .
The first detailed work on the spectral properties of this kind of simplicial Laplacian was carried out by Horak and Jost [horak2013spectra]. Recently Steenbergen et al. [steenbergen2014cheeger] provided a notion of a higher dimensional Cheeger constant for simplicial complexes. At the same time, Gundert and Szedlák [gundert2014higher] proved a lower bound for a modified version of the higher dimensional Cheeger constant for simplicial complex which was later generalized to weighted complexes by Braxton et al. Mukherjee and Steenbergen [mukherjee2016random] developed an appropriate notion of random walks on simplicial complexes, and related the asymptotic properties of these walks to the simplicial Laplacians and harmonics. It is worth mentioning that, to the best of our knowledge, no connection between the eigenvectors of the simplicial Laplacian and an optimal cut for simplices in higher dimensions is known.
Our contribution is a notion of spectral clustering for simplicial complexes using the harmonics.
2.3 Harmonic clustering
Observe that the ordinary graph Laplacian, as described in section 1.1, is just the matrix of in the standard basis for . The function in equation (1) can thus be seen as projecting the -simplices onto a subspace of low-but-nonzero-eigenvalue eigenvectors of . The zero part of the spectrum is not used. Theorem 2 makes the reason clear: harmonics in have the same coefficient for every vertex in a connected component of . As connectivity information is easy to obtain anyway, there is little use in adding these eigenvectors to the subspace that projects onto. This is not so for the higher Laplacians. In fact, our method primarily uses the harmonics, and only optionally ventures into the non-zero part of the eigenspectrum.
In what follows, is a fixed simplicial complex arising from data. The particulars of how was built from data is outside the scope of this paper, and is a topic that is well-studied in the field of TDA in general. Our goal is to obtain a useful clustering of for some chosen . We assume that is of low “homological complexity” in degree , by which we mean that is small (less than , say).
Analogously to above, we define the harmonic embedding
where , is orthogonal projection, and is any vector space isomorphism. In practice, we simply pick an orthonormal basis for and let
In many situations of practical use, it turns out that many points in lie along one-dimensional subspaces of . The membership of a point in such a subspace is what is used to cluster the -simplex (or to leave it unclustered in case it is not judged to be sufficiently close to lying in one of the subspaces). This amounts to clustering by performing Euclidean subspace clustering of . A variety of Euclidean subspace clustering methods are available, but are outside the scope of this paper. Examples include independent component analysis [ica], SUBCLU [subclu], and density maximization on (or, more precisely, on ), which itself has a multitude of approaches, including purely TDA-based ones by means of persistent homology of sublevel sets.
We point out that the choice of the isomorphism does not matter on a theoretical level. It may, however, have practical implications for how easy it is to perform subspace clustering. In experiments we typically choose to be the isomorphism that sends to the standard basis vector . Choosing a different orthonormal basis for then just amounts to an element of acting on .
Figure 3 summarizes our method in algorithmic form.
3 Experimental results
In this section we present experimental results for the harmonic clustering algorithm on synthetic data. Specifically, we focus on clustering the edges of various constructed simplicial complexes. The outcomes of our experiments suggest that the harmonic clustering algorithm provides clusters sensitive to the homology of the complex. Comparing our results with those of the traditional spectral clustering algorithm applied to the graph underlying the simplicial complex reinforces the idea that our algorithm reveals substantially different patterns in data compared to the classical method.
Below, we consider four simplicial complexes. Three of them are complexes built from Euclidean point clouds by standard methods from TDA, while one is a triangulation of a torus. We reiterate that our method works with abstract simplicial complexes without utilizing any embedding of these into an ambient space. Euclidean point clouds just happen to be a good and common source of simplicial complexes in TDA, and allow for visualization of the obtained clustering in a way that easily relates to the original data.
An important step in preprocessing many kinds of input data in TDA is constructing a simplicial complex satisfying certain theoretical properties. In particular, if the input data come from points sampled from a topological space , one may wish for the homology of the complex to coincide with the homology of .
Two constructions for which some such guarantees exist are the alpha complex [alpha] and the Vietoris–Rips (VR) complex [vietoris1927]. Both can be seen as taking a point cloud and producing a filtered simplicial complex , i.e. a sequence with the property that whenever . We wish to work with a single simplicial complex, not a filtration, so we use persistent homology (see e.g. [ghrist2008barcodes]) to find the filtration scale for which has the appropriate homology. Of course, since in practice one probably has little or no knowledge of
has the appropriate homology. Of course, since in practice one probably has little or no knowledge ofitself, one cannot necessarily know the “correct” to consider. However, it is often the case in TDA that long-lived homological features — that is to say, homology classes that remain non-trivial under the induced maps for large — express interesting properties of the underlying space. We therefore choose a to consider by looking for a scale within the range of a manageable number of long-lived features and few short-lived ones in the degree under consideration.
In the following experiments, we simplify the setup in the algorithm in figure 3 by performing the subroutine in a somewhat ad hoc semi-manual way. Specifically, all the images of the ’s lie in or in these experiments, so we manually pick out the subspaces in question. Then, the points in are orthogonally projected onto each of the subspaces. A point is determined to lie on subspace if has norm at least , while the onto all other subspaces has norm less than . The simplex is then said to be in cluster number . If the above is not true for any of the subspaces, is considered unclustered.
In many of the experiments that follow, many points in end up determined as “unclustered” because they project well onto neither of the -dimensional subspaces, or project too well onto multiple of them, as described in 2.3. This is not necessarily a problem, as the parts that are clustered contain a lot of useful information. Moreover, the problem can be reduced by choosing less ad hoc subspace clustering methods than we are currently employing.
To ease visualization, we focus on simplicial complexes that naturally live in or because they arise from point clouds.
3.1 Wedge of a sphere and two circles
In this experiment, we consider a noisy sampling of realized as a central unit sphere with unit circles wedged onto antipodal points. We sampled points uniformly randomly from the central sphere, adding radial uniform noise with amplitude . The circles were sampled using points each, again with a radial noise of . This yields a point cloud with points, which is shown in figure 4. The VR complex is certainly a suboptimal choice of simplicial complex to build on this kind of data, but we chose it to demonstrate that our method works well also for such an overly dense complex. The complex, constructed at scale and denoted within this section, has -simplices and -simplices, and the Betti numbers are , , , as for itself.
We focus on clustering the -simplices of the complex. The image of in is shown in figure 5. The points are colored according to which of the two one-dimensional subspaces they are deemed to belong to. The determination was made by a simple criterion of projecting well enough onto one of the lines, but not the other. Points that project well onto both or neither are considered unclustered and shown as red. Figure 6 shows this clustering pulled back to the complex itself, excluding the unclustered edges. Observe how the method separates the -simplices of the VR complex in a manner that is sensitive to the two non-bounding cycles that generate homology in degree (the two circles).
We also repeated the experiment with one of the circles in moved to be attached to the other circle instead of the sphere. This space is obviously homotopy equivalent to , but is geometrically very different. figure 7 shows the result. Observe that the sphere is now captured by its adjacent circle, and that the unclustered edges tend to be those near where the two circles intersect.
3.2 Punctured plane
In this experiment we uniformly randomly sample points from a unit square in with three disks of radius cut out. The points are seen as faint does in figure 8. We construct the alpha complex at parameter , and denote it by in this section. It has Betti numbers , and . There are -simplices and -simplices. We again focus on the -simplices for clustering. The codomain of is now , and the image is clustered according to three -dimensional linear subspaces. The result is shown in figure 8, and we again observe how the obtained clustering occurs with respect to the punctures of the square.
We next perform clustering of the edges of two different tori.
3.3.1 From a point cloud
We uniformly randomly sampled points from the unit square and map these under to produce a point sample of a torus in . The points were then given a uniformly random noise of amplitude in both radii. Again a VR complex was built, at scale . It has -simplices and -simplices, and has the homology of a torus, i.e. , , . VR was chosen in order for the clustering task to be more complicated than in a more orderly alpha complex.
3.3.2 A triangulation of the flat torus
As a smaller, more abstract and noise-free example, we consider a triangulation of a flat torus. The considered triangulation consists of a simplicial complex with vertices, -simplices and -simplices. The image of its -simplices in under is shown in figure 11. The arrangement into a perfect hexagon means that there are in fact three subspaces that can be chosen for clustering. The clusters are shown in figure 12. The arrangement into a hexagon, and therefore the result of three instead of the expected two clusters, disappear if one breaks some of the symmetry in the triangulation, for example by having some of the diagonal edges go the opposite direction.
3.4 Clustering -simplices
We have illustrated our method only on -simplices so far for ease of visualization. To point out that it also performs well in other dimensions, we sampled points (each) from two spheres of radius centered at and , each with a radial uniform random noise with amplitude . We computed the alpha complex at parameter , so as to create a rather messy region between the spheres. There are -simplices and -simplices, and , and as expected. Our clustering method performs as expected, producing clusters of that correspond to homological features, as is shown in figure 13.
3.5 Comparison with graph spectral clustering
It is worth comparing clustering obtained from our method with the ones obtained by clustering the nodes of the graph underlying each simplicial complex using the graph spectral clustering algorithm. Figure 14 shows the results of graph spectral clustering on the nodes of the graph underlying the complex in figure 7. The two first graph Laplacian eigenvectors were used to map the nodes into , and then -means was used to find two clusters. Similarly, figure 15 displays three clusters on the nodes of the graph underlying the complex representing a punctured plane with three holes in figure 8. The nodes are mapped to using the three first graph Laplacian eigenvectors, after which -means was used to find three clusters. In both cases we see that the clusters do not reflect any obviously meaningful property of the underlying data, unlike our method, which clusters in a way sensitive to homology.
4 Conclusions and future work
In this paper we have presented a novel clustering method for simplicial complexes, one that is sensitive to the homology of the complex. We see the method as a contribution to an emerging field of spectral TDA [cascade, barbarossa2018learning]. Our results suggest that the algorithm can be used to extract homological features for simplices of any degree. Experiments in various simplicial complexes demonstrate the ability of the method to accurately detect edges belonging to different non-bounding cycles. Similar results, not shown in this article for practical considerations of visualization, have been obtained by clustering simplices in higher dimensions. The sub-problem of the structure of the linear subspaces in the image of , and how to accurately cluster based on demand, require further investigation of both a mathematical and a algorithmic nature.
While our method seems robust to noise in the underlying data, a more thorough investigation into the output’s dependence on noise, and the output’s dependence on the scale at which a point-cloud-derived simplicial complex is built, is warranted.
Moreover, it has not eluded us that the method as outlined is not restricted to clustering just simplices. Other finitely generated chain complexes, such as discrete Morse complexes or cubical complexes, naturally lend themselves to the same analysis. One may also want to consider if there are theoretical implications even in the smooth case.
Further development will also include enlarging the target of the projection in to include non-zero eigenvectors of , as in graph spectral clustering. Preliminary results indicate that this yields a further refinement of the homologically sensitive clusters into “fair” subdivisions.
Finally, further work needs to explore the effects of weighting. Both structural weighting, i.e. deriving weights from the local connectivity properties of the complex, as is often done with graph spectral clustering, and weighting originating from the underlying data itself, as is common in TDA.
A potential future application that we suspect fits our method well is collaboration networks [patania2017], where -fold collaborations clearly cannot accurately be encoded as pairwise ones.
Alpha complexes and persistent homology were computed using GUDHI [GUDHI]. Eigenvector computations were done with SLEPc [SLEPc].
We would like to thank K. Hess for valuable discussions.
Both authors were supported by the Swiss National Science Foundation grant number 200021_172636.