## I Introduction

Molecules, protein interaction networks, XML documents, social media interactions, and image segments all have in common that they can be modeled by labeled graphs. The ability to represent topological and semantic information causes graphs to be among the most versatile data structures in computer science. In the age of Big Data, huge amounts of graph data are collected and the demand to analyze them increases with its collection.
We focus on the special case of clustering large sets of small labeled graphs. Our main motivation stems from the need to cluster large-scale molecular databases for drug discovery, such as PubChem^{1}^{1}1https://pubchem.ncbi.nlm.nih.gov/, ChEMBL^{2}^{2}2https://www.ebi.ac.uk/chembl/, ChemDB^{3}^{3}3http://cdb.ics.uci.edu/ or synthetically constructed de-novo databases [KUW+2010], which contain up to a billion molecules. However, the presented approach is not limited to this use case.

Clustering techniques aim to find homogeneous subsets in a set of objects.
Classical approaches do not interpret the objects directly,
but abstract them by utilizing some intermediate representation, such as feature vectors or pairwise distances.
While the abstraction over pairwise distances is beneficial in terms of generality, it can be disadvantageous in the case of *intrinsic high dimensional* datasets [CN2001, GKL2003]. In this case the *concentration effect* may cause the pairwise distances to loose their relative contrast; i.e. the distances converge towards a common value [BGRS1999]. The concentration effect is closely related to a bad *clusterability* [AB2009].

Sets of graphs are usually clustered by transforming the graphs to feature vectors or by using graph theoretic similarity measures.

Typical feature extraction methods for graphs are: counting graphlets, that is, small subgraphs [WWK2008, SVP+2009], counting walks [VSKB2010], and using eigenvectors of adjacency matrices (spectral graph theory) [FPV2014]. The enumeration of all subgraphs is considered intractable even for graphs of moderate size, because there exist up to exponentially many subgraphs wrt the graph size. Many efficient clustering algorithms have been proposed for vector space. Therefore, the transformation to feature vectors might look beneficial in the first place. However, the above mentioned feature extraction methods tend to produce a large amount of distinct and often unrelated features. This results in datasets with a high intrinsic dimensionality [TK2006, KMS2014a]. Additionally, the extracted features only approximate the graph structure, which implies that feature vectors cannot be transformed back into a graph. Hence, the interpretability of clustering algorithms which perform vector modifications (e.g. calculating centroids) is limited.

Besides the utilization of various feature extraction techniques, it is possible to compare graphs directly using graph theoretic distances such as (maximum) common subgraph derived distances [BS1998, WSKR2001, FV2001] or the graph edit distance [Bun1997]. The computation of the previously mentioned graph theoretic distances is NP-hard and as shown in [KMS2014a] their application results in datasets with a high intrinsic dimensionality as well. High quality clustering methods for arbitrary (metric) and high dimensional datasets furthermore require a superlinear number of exact distance computations. These factors render graph theoretic distance measures in combination with generic clustering algorithms infeasible for large-scale datasets.

Subspace and projected clustering methods tackle high dimensional datasets by identifying subspaces in which well separated clusters exist. However, generic subspace algorithms come with a high runtime burden and are often limited to an euclidean vector space.

Our structural projection clustering algorithm approaches the dimensionality problems by explicitly selecting cluster representatives in form of common subgraphs. Consider a feature mapping to binary feature vectors that contain one feature for each subgraph that is found in the complete dataset. For each graph, its feature vector has binary entries encoding the presence of the associated substructure graph. Selecting a common subgraph as cluster representative is another way of selecting a subspace in these feature vectors consisting of all features associated with subgraphs of .

Our main contributions in this paper are: We present a novel structural projection clustering algorithm for datasets of small labeled graphs which scales linearly with the dataset size. A set of representatives provides an intuitive description of each cluster, supports the clustering process, and helps to interpret the clustering results. Up to our knowledge, this is the first approach actively selecting representative sets for each cluster based on a new ranking function. The candidates for the representatives are constructed using frequent subgraph sampling. In order to speed up the computation, we suggest a new error bounded sampling strategy for support counting in the context of frequent subgraph mining. Our experimental evaluation shows that our new approach outperforms competitors in its runtime and quality.

The paper is structured as follows: Section II provides an overview of related clustering algorithms. Basic definitions are given in Section III. Section IV presents the main algorithm and a runtime analysis. Our experimental evaluation in which we compare our new algorithm to SCAP [SKK2014], PROCLUS [APW+1999]

and Kernel K-Means

[Gir2002] is presented in Section V.## Ii Related Work

Several clustering algorithms for graph and molecule data have been proposed in the last years. TK2006 presented an EM algorithm using a binomial mixture model over very high dimensional binary vectors, indicating the presence of all frequent substructures. Two years later, TK2008 presented a similar graph clustering algorithm using a Dirichlet Process mixture model and pruning the set of frequent substructures to achieve smaller feature vectors. A K-Median like graph clustering algorithm has been presented by FVS+2009, that maps each graph into the euclidean space utilizing the edit distance to some pivot elements. A median graph is then selected based on the distance to the euclidean median. Furthermore, a parallel greedy overlapping clustering algorithm has been presented by SBSK2011. It adds a graph to a cluster whenever a common substructure of a user-defined minimum size exists. However, none of the previously mentioned algorithms are suitable for large datasets as a result of their high computational complexity. XProj [ATW+2007] uses a projection-based approach by selecting all (enumerated) frequent substructures of a fixed size as cluster representatives. The approach scales well with the dataset size but is limited to trees. A generalization to graphs would result in a huge performance degradation. Furthermore, there exist some hybrid approaches, that pre-cluster the dataset by using a vector-based representation and refine the results using structural clustering algorithms. The most relevant with respect to large-scale datasets is the SCAP algorithm proposed by SKK2014.

Of course many subspace algorithms are also applicable after using a feature extraction method. Giving a comprehensive overview on subspace techniques is out of the scope of this article. However, there are two algorithms that are of special interest for this work: In [YM2003]

it is shown that frequent pattern mining can be used for feature selection in vector space. This relates to XProj and our approach in the way, that the selection of a graph representative—with the help of frequent substructure mining—is another way of selecting a subspace in the feature space of substructures. In the later evaluation we will compare ourself to the PROCLUS

[APW+1999]algorithm. It is a fast projected clustering algorithm with noise detection, that selects features by minimizing variance. The algorithm has been studied intensively and performed well in various subspace clustering comparisons

[MGAS2009, PM2006].## Iii Preliminaries

An *undirected labeled graph* consists of a finite set of *vertices* , a finite set of *edges* and a labeling function , where is a finite set of *labels*.
is used as a short term for .
A *path* of length is a sequence of vertices such that and for .
Let and be two undirected labeled graphs. A *(label preserving) subgraph isomorphism* from to is an injection , where and . Iff there exists a subgraph isomorphism from to we say is *supported* by , is a *subgraph* of , is a *supergraph* of or write .
If there exists a subgraph isomorphism from to and from to , the two graphs are isomorphic.
A *common subgraph* of and is a graph , that is subgraph isomorphic to and .
Furthermore, the *support* of a graph over a set of graphs is the fraction of graphs in , that support . is said to be *frequent*, iff its support is larger or equal than a *minimum support threshold* . A frequent subgraph is *maximal*, iff there exists no frequent supergraph of . For a set of graphs , we write for the set of all frequent subgraphs and for the set of all maximal frequent subgraphs.
A *clustering* of a graph dataset is a partition of . Each *cluster* consists of a set of graphs and is linked to a *set of cluster representatives* which are itself undirected labeled graphs. Please note, that we consider each graph in our dataset to be a distinct object. As a result, it is possible to have isomorphic graphs in a single set.

##
Iv The *StruClus* Algorithm

A high level description of the *StruClus* algorithm is given in Algorithm 1. Initially, it partitions the dataset using a lightweight pre-clustering algorithm. Afterwards, the clustering is refined using an optimization loop similar to the K-Means algorithm. In order to fit the number of clusters to the dataset structure (i.e., to achieve a good cluster separation and homogeneity, see Sections IV-D and IV-B), we use a cluster splitting and merging strategy in each iteration.

An important ingredient of our algorithm is the set of representatives for each cluster .
Representatives serve as an intuitive description of the cluster and define the substructures over which intra cluster similarity is measured.
The set is chosen such that for every graph there exists at least one representative which is subgraph isomorphic (i.e., supported by ).
With the exception of a single *noise* cluster the following invariant holds after each iteration:

(1) |

The representative set of a cluster is constructed using maximal frequent subgraphs of (see Section IV-A). Having a representative set instead of a single representative has the advantage, that graphs composed of multiple common substructures can be represented. In order to be meaningful and human interpretable, the cardinality of is limited by a user defined value . Figure 1 shows two example clusters and their representatives of a real world molecular dataset generated with *StruClus*.

### Iv-a Stochastic Representative Mining

We construct the representative set of a cluster using maximal frequent subgraphs of . Since the set may have exponential size wrt the maximal graph size in , we restrict ourselves to a subset of candidate representatives using a randomized maximal frequent connected subgraph sampling technique from ORIGAMI [HCS+2007] combined with a new stochastic sampling strategy for support counting. In a second step, the final representative set is selected using a ranking function (see Section IV-B1).

ORIGAMI constructs a maximal frequent connected subgraph over a set of graphs by extending a random frequent vertex with frequent paths of length one leading to a graph .
In a first step, all frequent vertices and all frequent paths of length one are enumerated with a single scan of . Then, for each extension, a random vertex of is chosen and a random, label preserving path is connected to it in a forward (creating a new vertex) or backward (connecting two existing vertices) fashion. After each extension, the support is evaluated by
solving a subgraph isomorphism test for all graphs in . If , the extension is permanently added to or otherwise removed. If no further extension is possible without violating the minimum support threshold, a maximal frequent subgraph has been found. This process is justified by the *monotonicity property* of subgraphs of graphs in :

(2) |

While ORIGAMI greatly improves performance in comparison with enumeration algorithms, the subgraph isomorphism tests for each extension remain a major performance bottleneck for *StruClus*. For this reason, we have added a stochastic sampling strategy for support counting.

Initially, we draw a random sample . Then

is an estimator for the parameter

, whereis the true probability of the underlying Bernoulli distribution. We are interested if the true value of

is smaller than the minimum support threshold. Without loss of generality, let us focus on the casein the following. We can take advantage of a binomial test under the null hypothesis, that

and thereby determine the probability of an error, if we assume . With a predefined significance level we can decide if the sample gives us enough confidence to justify our assumption. If we cannot discard our null hypothesis, we continue by doubling the sample size and repeat the process. In the extreme case we will therefore calculate the exact value of .The statistical test is repeated for each extension and each sample size doubling. As a consequence, a multiple hypothesis testing correction is necessary to bound the real error for to be a maximal frequent substructure of .

###### Proposition Iv-A.1.

Let be a set of undirected labeled graphs, the minimal sample size, the set of all frequent paths of length one, a minimum support threshold, and the

-quantile of the sorted (increasing order) graph sizes of each graph in

. Then the maximal number of binomial tests to construct a maximal frequent substructure over is bounded by:###### Proof.

The sample size is doubled times if the test never reaches the desired significance level. The size of some is bounded by the size of each supporting graph. In the worst case is supported by the ()-largest graphs of . The graph size of the smallest supporting graph is then equal to the -quantile of the sorted graph sizes in increasing order. The number of backward extensions is bounded by the number of vertex pairs times the number of applicable extensions . Additionally, we need to check forward extensions for each vertex to conclude that is maximal. ∎

Finally, with the value of Proposition IV-A.1 we are able to apply a Bonferroni correction to our significance level. We can afford a relative high error, because the selection of the final representatives will filter out bad candidates. However, the used significance level has an influence on the runtime. On the one hand a high error leads to many bad candidates and we need to increase the number of candidates to mine. On the other hand a low error will lead to larger sample sizes to reject the null hypothesis. A maximal error of has turned out to be a good choice during our experimental evaluation.

### Iv-B Update Representatives

#### Iv-B1 Representative Selection

In its role as a cluster description, a good representative explains a large portion of its cluster. Accordingly, it should (a) be supported by a large fraction of and (b) cover a large fraction of vertices and edges of each graph supporting . We define the coverage of a graph by a representative as . The two criteria are closely related to the cluster homogeneity. A uniform cluster, that is, a cluster that contains only isomorphic graphs, can achieve optimal values for both criteria. Vice versa, the monotonicity property (2) implies non-optimal values for inhomogeneous clusters for at least one of the two criteria. As homogeneous clusters are desired, we use a product of the two criteria for our ranking function.

In order to discriminate clusters from each other, a cluster representative should be cluster specific as well. Thus, its support in the rest of the dataset should be low. For this reason, we use the following ranking function for a dataset , cluster and representative :

(3) |

Finally, we select the highest ranked sampled subgraphs from as cluster representatives .

#### Iv-B2 Balancing Cluster Homogeneity

Besides the representative selection, the choice of the minimum support threshold for representative mining has an influence on the cluster homogeneity. The fraction of unsupported graphs for a cluster after updating the cluster representatives is bounded by (). The bound clearly relates to criteria (a) from the representative ranking (see Section IV-B1). However, unsupported graphs are removed from the cluster and will be assigned to a different cluster at the end of the current iteration (see Section IV-C). Due to the monotonicity property (2), this process of sorting out graphs by choosing a minimum support threshold below , will increase the size of the representatives and therefore our coverage value, i.e criteria (b). A decrease of the minimum support threshold will lead to an increase in the size of the representatives. Subsequently, this process will also reduce the cluster cardinality. Therefore, increasing the homogeneity to an optimal value will result in a clustering with uniform or singleton clusters, which is clearly not the desired behavior.

To get around this, we will aim towards a similar homogeneity for all clusters and choose the minimum support threshold cluster specific. However, choosing a fixed homogeneity level a priori is not an easy task, as an appropriate value depends on the dataset. Therefore we will calculate an average coverage score over all clusters and use this as a baseline adjustment. For the ease of computation, we choose a slightly simplified coverage approximation:

(4) | ||||

(5) |

Finally, we can define a linear mapping from the relative coverage relCov to a cluster specific minimum support threshold with the help of two predefined tuples and , where and . The parameter () denotes the lowest (highest) support value and () the relative coverage value mapped to the lowest (highest) minimum support:

(6) |

To result in a minimum support threshold of for all clusters (i.e. stopping the process of sorting out graphs if the clustering is balanced), we will set the values of the parameters very close or equal to and .

### Iv-C Cluster Assignment

Each graph in the dataset is assigned to its *most similar cluster* in the assignment phase. As a measure for similarity, we are summing up the squared sizes of the representatives of a cluster, which are subgraph isomorphic to . This choice of similarity is once more justified by the representative ranking criteria. We square the representative sizes, to prefer a high coverage over a high number of representatives to be subgraph isomorphic to the assigned graph.

As mentioned in Section IV-B2 it is possible that a graph will no more be supported by any representative of its cluster after updating the representatives. In this situation we will create a single noise cluster, where all graphs (of all clusters) that are not supported by any representative are collected. As the minimum support threshold is bounded by the fixed value (see Section IV-B2) and the number of representatives is limited by , it is not guaranteed that we can find an appropriate set of representatives for this most likely largely inhomogeneous noise cluster. It is therefore excluded from the invariant (1).

The problem of finding all subgraph isomorphic graphs in a graph database is also known as the *subgraph search problem* and was extensively studied in the past. We apply the fingerprint pre-filtering technique CT-Index [KKM2011], which has emerged from this research topic, to speed up the assignment phase. CT-Index enumerates trees and circles up to a specified size for a given graph and hashes the presence of these subgraphs into a binary fingerprint of fixed length. If the fingerprint of a graph has a bit set that is unset in the fingerprint of a graph we can conclude that no subgraph isomorphism from to exists, because contains a subgraph that is not present in . We calculate a fingerprint for each graph and representative and only perform a subgraph isomorphism test in our assignment phase if the fingerprint comparison cannot rule out the presence of a subgraph isomorphism.

### Iv-D Cluster Splitting and Merging

Without the operation of cluster splitting,
the above mentioned process of creating noise clusters would create at most one extra cluster in each iteration of our main loop.
A large difference in the initial and final number of clusters therefore would lead to a slow convergence of the *StruClus* algorithm towards its final result. As mentioned before, it is also possible that no representative is found at all for the noise cluster, and therefore the process of sorting out graphs from the noise cluster to increase its homogeneity is stopped completely in the worst case.
A similar situation can occur for regular clusters. For example, if a cluster is composed of uniform sets of graphs, we will require a minimum support threshold less than or equal to to sort the smallest possible number of graphs out.
For this reason, a cluster splitting step is necessary (see Algorithm 1). In this step, all clusters that have a relative coverage value below an a priori specified threshold will be merged into a single set of graphs and the pre-clustering algorithm is applied on them. The resulting clusters are added back to the clustering.

On the contrary to cluster splitting, which focuses on cluster homogeneity, cluster merging ensures a minimum separation between clusters. Separation can be measured on different levels. Many classical measures define separation as the minimum distance between two cluster elements. However, this type of definition is not suitable for projected clustering algorithms, because the comparison does not take the cluster specific subspace into account. As mentioned in the introduction, the cluster representatives in *StruClus* define the subspace of the cluster. Additionally, they serve as a description of the graphs inside the cluster itself.
A high coverage value leads to an accurate cluster description. Thus, we will define *separation* between two clusters and solely over the representatives sets and . This definition is also beneficial from a runtime perspective, as separation calculation is independent of the cluster size.
Without cluster merging it is possible that clusters with very similar representatives do exist. Although the pre-clustering will ensure that the initial clusters will have dissimilar representatives (see Section IV-E) it may happen that two clusters converge towards each other or that newly formed clusters are similar to an already existing one. Therefore, we will merge two clusters whenever their representatives are too similar.

To compare two single representatives we calculate the size of their maximum common subgraph (MCS) and use its relative size as similarity:

(7) |

The maximum of the representatives sizes is chosen as denominator to support different clusters with subgraph isomorphic representatives, which differ largely in size. Finally, we will merge two clusters and if the following condition holds:

(8) | |||

where is a minimum number of representative pairs which have a similarity greater than or equal to . Note that the calculated MCS between two representatives is supported by all the graphs , that support either or , because the subgraph isomorphism relation is transitive. The coverage for these graphs in the merged cluster is furthermore bounded by if we reuse the MCS as representative. For this reason, we recommend to set close to the number of representatives per cluster to support a large fraction of graphs in the merged cluster. The parameter is furthermore an intuitive knob to adjust the granularity of the clustering.

Finally, the representatives for the merged clusters are updated. We do not use the calculated MCSs as representatives, because better representatives may exist and the decision problem “Does an MCS larger than some threshold exist?” is computationally less demanding than calculating the MCS itself.

### Iv-E Pre-Clustering

The pre-clustering serves as an initial partitioning of the dataset. A random partitioning of all graphs would be problematic as representatives may not be found for all partitions and the found representatives are most likely not cluster specific. This will result in a high number of clusters to be merged to a few inhomogeneous clusters and in a slow convergence of the *StruClus* algorithm.

To pre-cluster the dataset , we compute maximal frequent subgraphs , as described in Section IV-A with a fixed minimum support. These frequent subgraphs serve as representative candidates for the initial clusters. To avoid very similar representatives we will first greedily construct maximal sets of dissimilar graphs. As a measure of similarity we re-use the similarity (7) and the threshold from cluster merging. In other words, we are picking all graphs from in a random order and add to our dissimilar set , if with . This process is repeated several times and the largest set is used to create one cluster for each with as single representative. Afterwards we run a regular assignment phase as described in Section IV-C. As a result of re-using we can expect, that we have well separated cluster and no cluster merging is necessary in the first iteration of the main loop (excluding the noise cluster).

### Iv-F Convergence

*StruClus* optimizes cluster homogeneity while maintaining a minimal cluster separation. As described in Section IV-B2, homogeneity is defined by two criteria wrt the cluster representatives. With an exception of the noise cluster, the representative support criteria (a) can be dismissed, because not represented graphs will be sorted out of the cluster. It was further discussed, that an optimal homogeneity can be achieved for singleton clusters. However, the later introduced cluster separation constraint will limit the granularity of the clustering, because two similar representatives will be merged.
Nevertheless, there might exist several clusterings with different granularity that respect the separation constraint. For this reason the objective function balances the coverage criteria (b) of the homogeneity and the granularity of the clustering. The parameter adjusts this granularity:

(9) |

As a consequence of the cluster splitting and merging, the objective function will fluctuate and contain local optima. We will therefore smooth the objective function. Let be the clustering after the -th iteration. Algorithm 1 will terminate after the first iteration for which the following condition holds:

(10) |

where is the averaging width and is the minimum relative increase of the objective function in iterations.

### Iv-G Runtime Analysis

The subgraph isomorphism problem and the maximum common subgraph problem are both NP-complete even in their decision variants [GJ1979]. *StruClus* solves these problems in order to calculate the support and to decide if clusters need to be merged.
Furthermore, the size of a representative can be linear in the sizes of the supporting graphs.
Thus, *StruClus* scales exponentially wrt the size of the graphs in the dataset . Nevertheless, these problems can be solved sufficiently fast for small graphs with a few hundred vertices and edges, e.g., molecular structures. We will therefore consider the graph size a constant in the following analysis and focus on the scalability wrt the dataset size. Let be the maximal number of vertices for a graph in and the maximal number of clusters during the clustering process.

##### Representative Mining

As described in Proposition IV-A.1 the number of extensions to mine a single maximal frequent subgraph from a set of graphs is bounded by . The variable is obviously a constant. Also is only bounded by and the lowest minimum support threshold , as a fraction of of all edges in must be isomorphic to be frequent. Thus, there exist at most frequent paths of length one. Note, that the number of distinct labels does not have an influence on this theoretical bound. For each mined maximal frequent subgraph we need to calculate the support over and this involves many subgraph isomorphism tests. Therefore, the runtime to mine a single maximal frequent subgraph is bounded by .

##### Representative Update

Updating the representatives involves two steps: representative mining and representative selection. Representative mining includes the calculation of the cluster specific minimum support with the help of relCov. The relCov values for each cluster can be calculated in if we maintain a sum of the graph sizes of each cluster and update this value during cluster assignment. Since the cluster sizes sum up to the runtime of the actual frequent subgraph mining is in . To rank a representative candidate , the set , the value , and the value can be calculated by a single scan over , i.e. calculating whether is subgraph isomorphic to each graph in . As the number of candidates per cluster is a constant, the runtime of the ranking is in , which is also the overall runtime of the representative update.

##### Other Parts

The runtimes of the cluster assignment and the pre-clustering are in . Cluster splitting is computable in , cluster merging in , and converge in time.

##### Overall Runtime

A single iteration of Algorithm 1 has a runtime of , and hence is linear. This is justified by the observation that . Let be the number of iterations after Algorithm 1 terminates. Then, the overall runtime is in .

## V Evaluation

In the following evaluation we compare *StruClus* to SCAP [SKK2014], PROCLUS [APW+1999] and Kernel K-Means [Gir2002] wrt their runtime and the clustering quality. Furthermore, we evaluate the influence of our sampling strategy for support counting and evaluate the parallel scaling of our implementation.

##### Hardware & Software

All tests were performed on a dual socket NUMA system (Intel Xeon E5-2640 v3) with 128 GiB of RAM. The applications were pinned to a single NUMA domain (i.e. 8 cores + HT / 64 GiB RAM) to eliminate random memory effects. Turbo Boost was deactivated to minimize external runtime influences.
Ubuntu Linux 14.04.1 was used as operating system. The Java implementations of *StruClus*, PROCLUS and Kernel K-Means were running in an Oracle Java Hotspot VM 1.8.0_66. SCAP was compiled with GCC 4.9.3 and -O3 optimization level. *StruClus* and SCAP are shared memory parallelized.

##### Test Setup & Evaluation Measures

The tests were repeated 30 times if the runtime was below 2 hours and 15 times otherwise. Quality is measured by ground truth comparisons. We use the Normalized Variation of Information (NVI) [Mei2007] and Fowlkes-Mallows (FW) [FM1983] measures for *StruClus*, PROCLUS and Kernel K-Means. While NVI and FW are established quality measures, they are not suited for overlapping clusterings as produces by SCAP. Thus, Purity [dataclustering-chap-validation] was used for comparison with SCAP.

##### Datasets

We evaluate the algorithms on synthetic datasets of different sizes and three real world datasets. The synthetic datasets have vertices and edges on average with vertex and

edge labels (weights drawn from an exponential distribution). They contain

clusters and random noise graphs. For each clusterseed patterns were randomly generated with a Poisson distributed number of vertices (mean

) and an edge probability of . Additionally, we created a common seed pool with (noise) seeds. The cluster specific graphs were then generated by connecting the cluster specific seeds with to common noise seeds. The first real world dataset (AnchorQuery) is a molecular de-novo database. Each molecule is the result of a chemical reaction of multiple purchasable building blocks. We have used reaction types, i.e., class labels, from AnchorQuery^{4}

^{4}4http://anchorquery.csb.pitt.edu/reactions/. The second real world dataset (Heterocyclic) is similar to the first one, but contains heterocyclic compounds and distinct reaction types. The third real world dataset (ChemDB) contains million molecules from ChemDB [chemdb2007]. ChemDB is a collection of purchasable molecules from chemical vendors. The dataset has no ground truth and will mainly serve as proof that we are able to cluster large real world datasets. Table I shows additional statistics about these real world datasets.

To cluster the datasets with PROCLUS the graph data was transformed to vectors by counting distinct subgraphs of size , resulting in to features on the synthetic datasets. The application to the AnchorQuery and Heterocyclic datasets results in much lower feature counts of and . The same vectors were used as features space representation for Kernel K-Means. Additionally, we have evaluated various other graph kernels with explicit feature mapping for PROCLUS and Kernel K-Means, such as the Weisfeiler-Lehman shortest path, Weisfeiler-Lehman subtree and fixed length random walk kernels [SSvL+2011]. However, we observed feature vectors with even higher dimensionality and clustering results with lower quality. For this reason, the results of these graph kernels are omitted here.

##### Algorithm Configuration

The parameters of *StruClus* were set as follows: maximal error for support counting: ; number of representative candidates: ; minimum support threshold calculated with: ; splitting threshold: ; minimum separation ; convergence determined with: .
SCAP has two parameters: (a) the minimum size for a common substructure, that must be present in a single cluster and (b) a parameter that influences the granularity of the fingerprint-based pre-partitioning. (a) was set to of the mean seed size for the synthetic dataset, otherwise to . (b) was set to the highest number that results in a reasonable clustering quality, as the clustering process is faster with fine grained pre-partitioning, but cannot find any clusters over the pre-partition boundary. For the synthetic dataset this was .
PROCLUS has two parameters as well: (a) the number of clusters and (b) the average dimensionality of the cluster subspaces. (a) was set to the number of clusters in the ground truth. The ChemDB dataset, which has no ground truth, was too large for PROCLUS. (b) was set to , for which we got best quality results. The number of clusters is the only parameter of Kernel K-Means and was set in exact the same manner as for PROCLUS. Please note, that the a priory selection of this optimal value gives PROCLUS and Kernel K-Means an advantage over their competitors during the evaluation.

##### Results

As shown in Table II, *StruClus*

outperforms all competitors in terms of quality on the synthetic datasets. The differences of all corresponding mean quality values are always larger than small multiples of the standard deviations. For small synthetic datasets, SCAP was the fastest algorithm. This changes for larger data set sizes.

*StruClus*outperforms SCAP by a factor of at size . Furthermore, we were unable to cluster the largest dataset with SCAP in less than days. PROCLUS and Kernel K-Means were the slowest algorithms with a huge gap to their next competitor. The sublinear growth of

*StruClus*’s runtime for the smaller datasets sizes is caused by our sampling strategy for support counting. Running

*StruClus*with exact support counting yields a linear growth in runtime. It is important to mention that the clustering quality is not significantly influenced by the support counting strategy.

Our implementation of *StruClus* scales well with the number of cores as shown in Fig. 2. With cores we get a speedup of . Including the Hyper-Threading cores, we get a speedup of .

The evaluation results wrt the real world datasets are summarized in Table III. We were unable to cluster the AnchorQuery and the ChemDB datasets with PROCLUS, Kernel K-Means and SCAP (even for high values of parameter (b)) in less than days. For the chemical reaction datasets, *StruClus* also needs more time compared to a synthetic dataset of similar size. This runtime increase can be partially explained by the larger graph sizes. However, other hidden parameters of the datasets, such as the size of maximal common substructures have an influence on the runtime as well. *StruClus*

had some runtime outliers on the Heterocyclic dataset. They were caused by small temporary clusters with large (

vertices) representatives. The high runtime of the SCAP algorithm on the AnchorQuery dataset is a bit surprising, as the common substructures processed by SCAP are limited in their size (maximum vertices). We consider a larger frequent pattern search space to be the reason for this runtime increase. Kernel K-Means was surprisingly slow on the Heterocyclic dataset and took more than hours for a single run. We have therefore created a random subset with a size of graphs for it.*StruClus* always outperforms the competitors wrt the quality scores. For the AnchorQuery dataset *StruClus* created clusters on average. The high score for the Purity measure shows, that *StruClus* splitted some of the real clusters, but keeps a well inter cluster separation. ChemDB was clustered by *StruClus* in hours with clusters. As a consequence of the high runtime of the ChemDB measurement and the lack of competitors, we repeated the test only 3 times. The aCov value for the final clustering was on average. This highlights the ability of *StruClus* to cluster large-scale real-world datasets.

## Vi Conclusion

In this paper we have presented a new structural clustering algorithm for large scale datasets of small labeled graphs. With the help of explicitly selected cluster representatives, we were able to achieve a linear runtime in the worst case wrt the dataset size. A novel support counting sampling strategy with multiple hypothesis testing correction was able to accelerate the algorithm significantly without lowering the clustering quality. We have furthermore shown, that cluster homogeneity can be balanced with a dynamic minimum support strategy for representative mining. A cluster merging and splitting step was introduced to achieve a well separated clustering even in the high dimensional pattern space. Our experimental evaluation has shown that our new approach outperforms the competitors wrt clustering quality, while attaining significantly lower runtimes for large scale datasets.
Although we have shown that *StruClus* greatly improves the clustering performance compared to its competitors, de-novo datasets with several billion of molecules are still outside the scope of this work. For this reason, we consider the development of a distributed variant of the algorithm to be future work. Another consideration to further improve the quality of the algorithm is to integrate a discriminative frequent subgraph miner for representative mining. The integration of the discriminative property into the mining process has the advantage, that higher quality representative candidates are mined. This will result in a lower number of necessary candidate patterns, which has a positive effect on the runtime. Furthermore, it allows to mine highly discriminant, non-maximal subgraphs. However, it is non-trivial to extend the support counting sampling strategy to such miners. Additionally, discriminative scores are usually non-monotonic on the subgraph lattice [YCHY2008, TCG+2010], which imposes another runtime burden.

## Acknowledgment

This work was supported by the German Research Foundation (DFG), priority programme *Algorithms for Big Data (SPP 1736)*. We would like to thank Nils Kriege for providing a fast subgraph isomorphism and Madeleine Seeland, Andreas Karwath, and Stefan Kramer for providing their SCAP implementation.