Log In Sign Up

Dissecting graph measure performance for node clustering in LFR parameter space

Graph measures that express closeness or distance between nodes can be employed for graph nodes clustering using metric clustering algorithms. There are numerous measures applicable to this task, and which one performs better is an open question. We study the performance of 25 graph measures on generated graphs with different parameters. While usually measure comparisons are limited to general measure ranking on a particular dataset, we aim to explore the performance of various measures depending on graph features. Using an LFR graph generator, we create a dataset of 11780 graphs covering the whole LFR parameter space. For each graph, we assess the quality of clustering with k-means algorithm for each considered measure. Based on this, we determine the best measure for each area of the parameter space. We find that the parameter space consists of distinct zones where one particular measure is the best. We analyze the geometry of the resulting zones and describe it with simple criteria. Given particular graph parameters, this allows us to recommend a particular measure to use for clustering.


page 9

page 11


Comparative study of histogram distance measures for re-identification

Color based re-identification methods usually rely on a distance functio...

PageRank and The K-Means Clustering Algorithm

We introduce a graph clustering algorithm that generalizes k-means to gr...

How to choose the most appropriate centrality measure?

We propose a new method to select the most appropriate network centralit...

cellPACKexplorer: Interactive Model Building for Volumetric Data of Complex Cells

Given an algorithm the quality of the output largely depends on a proper...

Optimizing an Organized Modularity Measure for Topographic Graph Clustering: a Deterministic Annealing Approach

This paper proposes an organized generalization of Newman and Girvan's m...

Pandemonium: a clustering tool to partition parameter space – application to the B anomalies

We introduce the interactive tool pandemonium to cluster model predictio...

1 Introduction

Graph nodes clustering is one of the central tasks in graph structure analysis. It provides a partition of nodes into disjoint clusters (communities), which are groups of nodes that are characterized by strong mutual connections or similar external connections. It can be of practical use for graphs representing real-life systems, such as social networks or industrial processes. Clustering allows to infer some information about the system: the nodes of the same cluster are highly similar, while the nodes of different clusters are dissimilar. The technique can be applied without any labeled data to extract important insights about a network.

There are different approaches to clustering, including ones based on modularity optimization [newman2004finding, blondel2008fast], label propagation algorithm [raghavan2007near, barber2009detecting], Markov cluster process [van2000graph, enright2002efficient]

, and spectral clustering 

[von2007tutorial]. In this work, we use an approach based on choosing a closeness measure on a graph, which allows one to use any metric clustering algorithm (e.g., [yen2009graph]).

The choice of the measure significantly affects the quality of clustering. Classical measures are the Shortest Path [buckley1990distance] and the Commute Time [gobel1974random] distances. The former is the minimum number of edges in a path between a given pair of nodes. The latter is the expected number of steps from one node to the other and back in a random walk on the graph. There is a number of other measures, including recent ones (e.g.,[estrada2017accounting, jacobsen2018generalized]); many of them are parametric. Despite the fact that graph measures are compatible with any metric algorithm, in this paper, we restrict ourselves to the kernel -means algorithm (e.g., [fouss2016algorithms]).

We base our research on a generated set of graphs. There are various algorithms to generate graphs with community structures. The well-known ones are the Stochastic Block Model [holland1983stochastic] and Lancichinetti–Fortunato–Radicchi benchmark [lancichinetti2008benchmark]

(hereafter, LFR). The first one is an extension of the Erdős–Rényi model with different intra- and intercluster probabilities of edge creation. The second one involves power law distributions of node degrees and community sizes. There are other generation models, e.g., Naive Scale-free Clustering 

[pasta2017topology]. We choose the LFR model: although it misses some key properties of real graphs, like diameter or the clustering coefficient, this model has been proven to be effective in meta-learning [prokhorenkova2019using].

There are many measure benchmarking studies considering node classification and clustering for both generated graphs and real-world datasets, including [fouss2012experimental, sommer2016comparison, sommer2017modularity, avrachenkov2017kernels, ivashkin2016logarithmic, guex2018randomized, guex2019covariance, aynulin2019efficiency, aynulin2019impact, courtain2020randomized, leleux2020sparse]. Despite a large number of experimental results, an exact theory is still a matter of the future. One of the most interesting theoretical results on graph measures is [luxburg2010getting], where some unattractive features of the Commute Time distance on large graphs were explained theoretically, and a reasonable amendment was proposed to fix the problem. Beyond the complexity of such proofs, there is still very little empirical understanding of what effects need to be proven. Our empirical work has two main differences from the previous ones. First, we consider a large number of graph measures, which for the first time gives a fairly complete picture. Second, unlike the previous studies aimed at revealing the global leaderboard, we are looking for the leading measures for each set of the LFR parameters.

We aim to explore the performance of of the 25 most popular measures in the graph clustering problem on a set of generated graphs with different parameters. We assess the quality of clustering with every considered measure and determine the best measure for every region of the graph parameter space.

Our contributions are as follows:

  • We generate a dataset of 11780 graphs covering the entire parameter space of the LFR generator;

  • We consider a broad set of measures and rank them by clustering performance on this dataset;

  • We determine the graph features that are responsible for measure leadership;

  • We find the regions of each measure’s leadership in the graph parameter space.

Our framework for clustering with graph measures as well as the collected dataset are available at

2 Definitions

2.1 Kernel -means

The original -means algorithm [lloyd1982least, macqueen1967some] clusters objects in Euclidean space. It requires coordinates of the objects to determine the distances between them and centroids. The algorithm can be generalized to use the degree of closeness between the objects without defining a particular space. This technique is called the kernel trick, usually it is used to bring non-linearity to linear algorithms. The algorithm that uses the kernel trick is called kernel -means (see, e.g., [fouss2016algorithms]). For graph node clustering scenario, we can use graph measures as kernels for the kernel -means.

Initially, the number of clusters is known and we need to set the initial state of centroids. The results of clustering with -means are very sensitive to the initial state. Usually, the algorithm runs several times with different initial states (trials) and chooses the “best” trial. There are different approaches to the initialization; we consider three of them: random data points, -means++ [arthur2006k], and random partition. We combine all these strategies to reduce the impact of the initialization strategy on the result.

2.2 Closeness measures

For a given graph , is the set of its vertices and is its adjacency matrix. A measure on is a function , which gets two nodes and returns closeness (larger means closer) or distance (larger means farther).

A kernel on a graph is a graph nodes’ closeness measure that has an inner product representation. Any symmetric positive semidefinite matrix is an inner product matrix (also called Gram matrix). A kernel matrix is a square matrix that contains similarities for all pairs of nodes in a graph.

To use kernel -means, we need kernels. Despite that not all closeness measures we consider are Gram matrices, we treat them as kernels. The applicability of this approach was confirmed in [fouss2016algorithms]. For the list of measures below, we use the word “kernel” only for the measures that satisfy the precise definition of kernel.

Classical measures Shortest Path distance [buckley1990distance] (SP) and Commute Time distance [gobel1974random] (CT) are defined as distances, so we need to transform them into similarities to use as kernels. We apply the following distance to closeness transformation [chebotarev2005duality, borg2005modern]:


where is a distance matrix, is the matrix of ones,

is the identity matrix, and

is the number of nodes.

In this paper, we examine 25 graph measures (or, more exactly, 25 parametric families of measures). We present these measures grouped by type similarly to [avrachenkov2017kernels]:

  • Adjacency Matrix based kernels and measures.

    • Katz kernel: , , where is the spectral radius of  [katz1953new] (also known as Walk proximity [chebotarev2006proximity] or von Neumann diffusion kernel [kandola2003learning, shawe2004kernel]).

    • Communicability kernel , , where means matrix exponential  [fouss2006experimental, estrada2007statistical, estrada2008communicability].

    • Double Factorial closeness: , [estrada2017accounting].

  • Laplacian Matrix based kernels and measures, where is the degree matrix of ,

    is the diagonal matrix with vector

    x on the main diagonal.

    • Forest kernel: , (also known as Regularized Laplacian kernel) [chebotarev1995proximity].

    • Heat kernel: ,  [chung1998coverings].

    • Normalized Heat kernel: , ,

    • Absorption kernel: ,  [jacobsen2018generalized].

  • Markov Matrix based kernels and measures.

    • Personalized PageRank closeness: ,

    • Modified Personalized PageRank: , [kirkland2012group].

    • PageRank heat closeness: , [chung2007heat].

    • Randomized Shortest Path distance. Using and the matrix of the SP distances first get  [yen2008family]:


      Then ; , and finally, . Here and are element-wise multiplication and division, is the column vector of the diagonal elements of . The kernel version can be obtained with (1).

    • Free Energy distance. Using from (2): ; ;  [kivimaki2014developments]. Kernel version can be obtained with (1).

  • Sigmoid Commute Time kernels.

    • Sigmoid Commute Time kernel:



      is the matrix of Commute Time proximity, std is the standard deviation of all elements of a matrix,

      is the element-wise sigmoid function


    • Sigmoid Corrected Commute Time kernel. First of all, we need the Corrected Commute Time kernel [luxburg2010getting]: , where , is the centering matrix , is the vector of diagonal elements of and is the sum of all elements of . Then, apply (3) replacing with to obtain .

Occasionally, element-wise logarithm is applied to the resulting kernel matrix [chebotarev2013studying, ivashkin2016logarithmic]. We apply it to almost all investigated measures and consider the resulting measures separately from their plain versions (see Table 1). For some measures, like Forest kernel, this is well-known practice [chebotarev2013studying], while for others, like Double Factorial closeness, this transformation, to the best of our knowledge, is applied for the first time. The considered measures and their short names are summarized in Table 1.

Short name
Family Plain Log Full name
Adjacency matrix based kernels Katz logKatz Katz kernel
Comm logComm Communicability kernel
DF logDF Double Factorial closeness
Laplacian based kernels For logFor Forest kernel
Heat logHeat Heat kernel
NHeat logNHeat Normalized Heat kernel
Abs logAbs Absorption kernel
Markov matrix based kernels and measures PPR logPPR Personalized PageRank closeness
MPPR logMPPR Modified Personalized PageRank
HPR logHPR PageRank heat closeness
RSP - Randomized Shortest Path kernel
FE - Free Energy kernel
Sigmoid Commute Time SCT - Sigmoid Commute Time kernel
SCCT - Sigmoid Corrected Commute Time kernel
SP-CT - Linear combination of SP and CT
Table 1: Short names of considered kernels and other measures.

3 Dataset

We collected a paired dataset of graphs and the corresponding results of clustering with each measure mentioned in Table 1. In this section, we describe the graph generator, the sampling strategy, the graph characteristic features, and the pipeline for the measure score calculation.

We use Lancichinetti–Fortunato–Radicchi (LFR) graph generator. It generates non-weighted graphs with ground truth non-overlapping communities. The model has five mandatory parameters: the number of nodes (), the power law exponent for the degree distribution (), the power law exponent for the community size distribution (), the fraction of intra-community edges incident to each node (), and either minimum degree (min degree) or average degree (avg degree). There are also several extra parameters: maximum degree (max degree), minimum community size (min community), maximum community size (max community). Not the whole LFR parameter space corresponds to common real-world graphs; most of such graphs are described with and  (e.g., [fotouhi2019evolution]). However, there is also an interesting case of bipartite/multipartite-like graphs with . Our choice is to consider the entire parameter space to cover all theoretical and practical cases.

For the generation, we consider

. It is impossible to generate a dataset with a uniform distribution of all LFR parameters, because

and parameters are located on rays. We transform and to to bring their scope to the interval. In this case, “realistic” settings with take up 62% of the variable range. Also, as avg degree feature is limited by the number of nodes of a particular graph, we decided to replace it with density (). It belongs to . Using all these considerations, we collected our dataset by uniformly sampling parameters for LFR generator from the set {, , , , density} and generating graphs with these parameters. Additionally, we filter out all disconnected graphs.

In total, we generated 11780 graphs. It is worth noting that the generator fails for some sets of parameters, so the resulting dataset is not uniform (see Fig. 2). In our study, non-uniformity is not a critical issue, because we are interested in local effects, rather than global leadership. Moreover, true uniformity for LFR parameter space is impossible, due to the unlimited scope of parameters.

Figure 1: Distribution of graph features in the dataset

LFR parameters
, , , , etc.


graph & ground
truth partition

measure parameter

(Katz, Forest etc.)

initialization strategy
(-means++, etc.)

(several trials)

trial choosing criteria
(inertia, modularity)

ARI score

Figure 2: Measuring ARI clustering score for a particular graph, measure, and measure parameter

For our research, we choose a minimum set of the features that describe particular properties of graphs and are not interchangeable.

The LFR parameters can be divided into three groups by the graph properties they reflect:

  • The size of the graph and the communities: , , min community, max community;

  • The density and uniformity of the node degree distribution: , min degree, avg degree, max degree. As avg degree depends on , it is distributed exponentially, so we use instead;

  • The cluster separability: . As parameter considers only the ratio between the number of inter-cluster edges and the number of nodes but ignores overall density, we use modularity [newman2004finding] as a more appropriate measure for cluster separability.

Thus, the defined set of features {, , , avg degree, modularity} is enough to consider all graph properties mentioned above. Although modularity is a widely used measure, it suffers from resolution limit problems [fortunato2007resolution]. We acknowledge that this may cause some limitations in our approach, which should be the topic of further research.

For every generated graph, we calculate the top ARI score for every measure [hubert1985comparing]. The popular NMI clustering score is known to be biased towards smaller clusters, according to [gosgens2019systematic]. We choose ARI as a clustering score that is both well known and unbiased. As soon as every measure has a parameter, we perform clustering for a range of parameter values (we transform the parameter to become in the [0, 1] interval and then choose 16 values linearly spaced from 0 to 1). For each value, we run trials of -means (6 trials for each of three initialization methods).

Fig. 2 shows the pipeline we use to calculate ARI score for a given LFR parameter set, a measure, and a measure parameter. Measure parameters are not the subject of our experiments, so for every measure we just take the result of the measure with the value of the parameter that gives the best ARI score.

Because of the need to iterate over graphs, measures, parameter values, and initializations, the task is quite computationally complex. The total computation time was 20 days on 18 CPU cores and 6 GPUs.

4 Results

4.1 Global leadership in LFR space

As a rough estimate of the measures’ applicability, we calculate global leadership on our generated dataset. We divide the dataset into parts corresponding to two different cases of clustering: associative (graphs with modularity

on the ground truth partition) and the complementary dissociative case. The negative modularity case basically corresponds to the setup of LFR generator.

We rank the measures by their ARI score on every graph of the dataset. The aggregated rank is defined as the position of the measure in this list, averaged over the dataset (see Tables 2(a) and 2(b); smaller rank is better). It is important to note that the global leadership does not give a comprehensive advice on which measure is better to use, because for a particular graph, the global leader can perform worse than the local winner. Here, we consider the entire LFR space, not just its zone corresponding to common real-world graphs, so the ranking may differ from those obtained for restricted settings.

# Measure Rank Wins, % ARI
1 RSP 4.1 40.0 0.67
2 SCCT 5.1 50.5 0.68
3 logNHeat 5.3 34.7 0.66
4 logHeatPR 5.3 34.9 0.66
5 FE 5.5 35.8 0.66
6 logKatz 5.5 39.5 0.66
7 logPPR 6.2 35.1 0.65
8 logComm 6.3 40.5 0.64
9 logModifPPR 6.5 34.4 0.65
10 SCT 7.3 36.1 0.64
11 SP-CT 7.5 32.6 0.64
12 logAbs 8.1 33.8 0.63
13 logFor 8.8 33.8 0.60
14 logHeat 9.2 31.1 0.58
15 NHeat 9.6 35.1 0.56
16 HeatPR 10.3 32.2 0.59
17 Comm 11.5 26.4 0.52
18 logDF 12.4 22.7 0.46
19 Heat 13.7 27.2 0.46
20 Katz 15.2 10.1 0.43
21 DF 16.0 12.1 0.37
22 PPR 17.8 11.1 0.35
23 For 20.4 7.9 0.19
24 Abs 21.1 7.1 0.16
25 ModifPPR 22.1 4.7 0.12
(a) Associative graphs. The win percentage is calculated among 6777 graphs in the dataset.
# Measure Rank Wins, % ARI
1 SCCT 3.7 63.9 0.70
2 RSP 7.0 18.6 0.46
3 SP-CT 8.1 16.6 0.45
4 SCT 8.1 14.5 0.46
5 NHeat 8.5 10.1 0.40
6 logHeatPR 8.6 14.5 0.41
7 FE 8.9 15.5 0.43
8 logNHeat 9.0 13.9 0.39
9 logPPR 9.4 13.8 0.39
10 Katz 9.8 2.8 0.34
11 Comm 10.3 5.2 0.33
12 logModifPPR 10.8 13.5 0.37
13 logKatz 10.9 14.4 0.37
14 Abs 11.0 13.3 0.35
15 DF 12.2 2.6 0.27
16 logAbs 12.8 12.1 0.35
17 HeatPR 15.9 0.6 0.16
18 logFor 15.9 5.3 0.22
19 Heat 15.9 0.7 0.09
20 PPR 16.4 0.5 0.15
21 logHeat 16.6 0.7 0.10
22 logDF 18.1 3.7 0.08
23 logComm 18.2 0.7 0.07
24 For 21.4 0.0 0.02
25 ModifPPR 21.7 0.1 0.02
(b) Dissociative graphs. The win percentage is calculated among 5003 graphs in the dataset.
Table 2: Leaderboards for associative and dissociative cases. The ARI column shows the mean ARI across the dataset.

Table 2(a) shows that there are several leading measures whose quality is not much different. The best measures are RSP (by rank) and SCCT (by the number of wins and the mean ARI). Dissociative case has an undisputed leader, SCCT (Table 2(b)).

The above division into two cases does not exhaust all the variety of graphs. For the further more precise study of the measures’ applicability, we will look for the leadership zones of each measure.

4.2 Feature importance study

First of all, we find out which graph features among the LFR parameters are important for the choice of the best measure and which are not. To do that, we use Linear Discriminant Analysis [mika1999fisher]

(LDA). This method finds a new basis in the feature space to classify a dataset in the best way. It also shows how many components of basis are required to fit the majority of data.


Explained variance.

(b) Features’ contribution to LDA components.
Figure 3: The results of LDA analysis.

Fig. 2(a) shows that the first two components account for about 90% of the explained variance. Fig. 2(b) shows that these components include only , avg degree, and modularity. The fact that is not included means that the size of the graph as well as the density are not of primary importance for choosing the best measure. So is not measuring the diversity of cluster sizes.

To detect the zones of measure leadership, we need to know the leadership on average in each area of space rather than the wins in particular points. To determine the local measure leadership, we need to introduce a filtering algorithm that for every point of the space returns the leading measure depending on the closest data points. As the choice of measure is mainly dependent on three features {, avg degree, modularity}, we can limit our feature space to them.

4.3 Gaussian filter in feature space

Using a filter in the feature space, we can suppress the noise and reveal the actual zones of leadership for the measures. We use the Gaussian filter with a scale parameter . For every given point of the space, it takes the data points that are closer than and averages ARIs of the chosen points with a weight . This allows to give larger weights to closer points. If there are less than three data points inside the sphere with a radius, the filter returns nothing, allowing to ignore the points with insufficient data in their vicinity.

Before applying the filter, we prepare the dataset. First, we isolate the case when several measures reach into a separate measure called “several”. Also, we normalize the standard deviation of every feature distribution to one.

To choose , we apply the filter with different values of and look at the number of connected components in the feature space. The needed

should be large enough to suppress the noise, however, it should not suppress small zones. Guided by this heuristic, we choose












Wins 7874 1544 1043 444 265 10 4 3 2 2
Table 3: The leaderboard of measure wins after filtering with

After filtering with , the leaderboard of measure wins changes (see Table 3). Only four measures keep their positions: SCCT, RSP, logComm, and logHeatPR. There is also a special case of several winning measures (named “several”), when the predicted partition reaches for several measures. The presence of several winners makes it difficult to analyze zones of measure’s leadership, so we decided to exclude this case from the detailed analysis in this work. Filtering shows that these four measures do have zones of leadership, otherwise they would be filtered out. We can plot the entire feature space colored by the leadership zones of the measures (see Fig. 4). As the resulting space is 3D, we indicate its slices by their coordinates.

(a) Slices by .
(b) Slices by avg degree.
(c) Slices by modularity.
Figure 4: The feature space {, avg degree, modularity} divided into the leadership zones of six measures.

The zones of measure leadership can be described by the following approximate criteria:

  • SCCT: in many domains of the parameter space;

  • RSP: up to 5, modularity in ;

  • logComm: modularity in 0..0.3, avg degree up to 100;

  • logHeatPR: modularity above 0.3, avg degree up to 50;

  • several: high modularity or high avg degree.

5 Conclusions

In this work, we have shown that the global leadership of measures does not provide comprehensive knowledge about graph measure performance in clustering tasks. We demonstrated that among 25 measures, SCCT is the best measure for the LFR graphs both by winning rate and ranking. However, there are also smaller distinct zones of leadership for RSP, logComm, and logHeatPR. Other measures, including those with high rank, fail to form their leadership zones.

Our results do not contradict those of other experimental works and, moreover, refine them by providing new findings. LogComm was first introduced in [ivashkin2016logarithmic] and won in the competitions on graphs generated with a particular set of SBM parameters. This study confirms its leadership, but only for a certain type of graphs. Another interesting finding is logHeatPR, which shows unexpectedly good performance within its zone of leadership.

Accoring to LDA analysis results, the leadership of measure is determined mainly by {, avg degree, modularity}. One of the interesting consequences is that the leadership does not depend on . This effect could be caused by the fact that we limited the size of the graphs to . It is not guaranteed to be preserved for large graphs.

This study is based on the LFR benchmark data. More research is needed to determine how well the results for LFR fit with to real-world. This would assess the applicability of our findings to practical cases.

It should be noted that our study is insensitive to the non-uniformity of the generated dataset. While manipulations with this dataset may affect the global leaderboard, they cannot change the local leadership, which is the focus of the present work.