1 Introduction
Graph nodes clustering is one of the central tasks in graph structure analysis. It provides a partition of nodes into disjoint clusters (communities), which are groups of nodes that are characterized by strong mutual connections or similar external connections. It can be of practical use for graphs representing reallife systems, such as social networks or industrial processes. Clustering allows to infer some information about the system: the nodes of the same cluster are highly similar, while the nodes of different clusters are dissimilar. The technique can be applied without any labeled data to extract important insights about a network.
There are different approaches to clustering, including ones based on modularity optimization [newman2004finding, blondel2008fast], label propagation algorithm [raghavan2007near, barber2009detecting], Markov cluster process [van2000graph, enright2002efficient]
, and spectral clustering
[von2007tutorial]. In this work, we use an approach based on choosing a closeness measure on a graph, which allows one to use any metric clustering algorithm (e.g., [yen2009graph]).The choice of the measure significantly affects the quality of clustering. Classical measures are the Shortest Path [buckley1990distance] and the Commute Time [gobel1974random] distances. The former is the minimum number of edges in a path between a given pair of nodes. The latter is the expected number of steps from one node to the other and back in a random walk on the graph. There is a number of other measures, including recent ones (e.g.,[estrada2017accounting, jacobsen2018generalized]); many of them are parametric. Despite the fact that graph measures are compatible with any metric algorithm, in this paper, we restrict ourselves to the kernel means algorithm (e.g., [fouss2016algorithms]).
We base our research on a generated set of graphs. There are various algorithms to generate graphs with community structures. The wellknown ones are the Stochastic Block Model [holland1983stochastic] and Lancichinetti–Fortunato–Radicchi benchmark [lancichinetti2008benchmark]
(hereafter, LFR). The first one is an extension of the Erdős–Rényi model with different intra and intercluster probabilities of edge creation. The second one involves power law distributions of node degrees and community sizes. There are other generation models, e.g., Naive Scalefree Clustering
[pasta2017topology]. We choose the LFR model: although it misses some key properties of real graphs, like diameter or the clustering coefficient, this model has been proven to be effective in metalearning [prokhorenkova2019using].There are many measure benchmarking studies considering node classification and clustering for both generated graphs and realworld datasets, including [fouss2012experimental, sommer2016comparison, sommer2017modularity, avrachenkov2017kernels, ivashkin2016logarithmic, guex2018randomized, guex2019covariance, aynulin2019efficiency, aynulin2019impact, courtain2020randomized, leleux2020sparse]. Despite a large number of experimental results, an exact theory is still a matter of the future. One of the most interesting theoretical results on graph measures is [luxburg2010getting], where some unattractive features of the Commute Time distance on large graphs were explained theoretically, and a reasonable amendment was proposed to fix the problem. Beyond the complexity of such proofs, there is still very little empirical understanding of what effects need to be proven. Our empirical work has two main differences from the previous ones. First, we consider a large number of graph measures, which for the first time gives a fairly complete picture. Second, unlike the previous studies aimed at revealing the global leaderboard, we are looking for the leading measures for each set of the LFR parameters.
We aim to explore the performance of of the 25 most popular measures in the graph clustering problem on a set of generated graphs with different parameters. We assess the quality of clustering with every considered measure and determine the best measure for every region of the graph parameter space.
Our contributions are as follows:

We generate a dataset of 11780 graphs covering the entire parameter space of the LFR generator;

We consider a broad set of measures and rank them by clustering performance on this dataset;

We determine the graph features that are responsible for measure leadership;

We find the regions of each measure’s leadership in the graph parameter space.
Our framework for clustering with graph measures as well as the collected dataset are available at https://github.com/vlivashkin/pygkernels.
2 Definitions
2.1 Kernel means
The original means algorithm [lloyd1982least, macqueen1967some] clusters objects in Euclidean space. It requires coordinates of the objects to determine the distances between them and centroids. The algorithm can be generalized to use the degree of closeness between the objects without defining a particular space. This technique is called the kernel trick, usually it is used to bring nonlinearity to linear algorithms. The algorithm that uses the kernel trick is called kernel means (see, e.g., [fouss2016algorithms]). For graph node clustering scenario, we can use graph measures as kernels for the kernel means.
Initially, the number of clusters is known and we need to set the initial state of centroids. The results of clustering with means are very sensitive to the initial state. Usually, the algorithm runs several times with different initial states (trials) and chooses the “best” trial. There are different approaches to the initialization; we consider three of them: random data points, means++ [arthur2006k], and random partition. We combine all these strategies to reduce the impact of the initialization strategy on the result.
2.2 Closeness measures
For a given graph , is the set of its vertices and is its adjacency matrix. A measure on is a function , which gets two nodes and returns closeness (larger means closer) or distance (larger means farther).
A kernel on a graph is a graph nodes’ closeness measure that has an inner product representation. Any symmetric positive semidefinite matrix is an inner product matrix (also called Gram matrix). A kernel matrix is a square matrix that contains similarities for all pairs of nodes in a graph.
To use kernel means, we need kernels. Despite that not all closeness measures we consider are Gram matrices, we treat them as kernels. The applicability of this approach was confirmed in [fouss2016algorithms]. For the list of measures below, we use the word “kernel” only for the measures that satisfy the precise definition of kernel.
Classical measures Shortest Path distance [buckley1990distance] (SP) and Commute Time distance [gobel1974random] (CT) are defined as distances, so we need to transform them into similarities to use as kernels. We apply the following distance to closeness transformation [chebotarev2005duality, borg2005modern]:
(1) 
where is a distance matrix, is the matrix of ones,
is the identity matrix, and
is the number of nodes.In this paper, we examine 25 graph measures (or, more exactly, 25 parametric families of measures). We present these measures grouped by type similarly to [avrachenkov2017kernels]:

Adjacency Matrix based kernels and measures.

Katz kernel: , , where is the spectral radius of [katz1953new] (also known as Walk proximity [chebotarev2006proximity] or von Neumann diffusion kernel [kandola2003learning, shawe2004kernel]).

Communicability kernel , , where means matrix exponential [fouss2006experimental, estrada2007statistical, estrada2008communicability].

Double Factorial closeness: , [estrada2017accounting].


Laplacian Matrix based kernels and measures, where is the degree matrix of ,
is the diagonal matrix with vector
x on the main diagonal.
Forest kernel: , (also known as Regularized Laplacian kernel) [chebotarev1995proximity].

Heat kernel: , [chung1998coverings].

Normalized Heat kernel: , ,
[chung1997spectral]. 
Absorption kernel: , [jacobsen2018generalized].


Markov Matrix based kernels and measures.

Personalized PageRank closeness: ,
[page1999pagerank]. 
Modified Personalized PageRank: , [kirkland2012group].

PageRank heat closeness: , [chung2007heat].

Randomized Shortest Path distance. Using and the matrix of the SP distances first get [yen2008family]:
(2) Then ; , and finally, . Here and are elementwise multiplication and division, is the column vector of the diagonal elements of . The kernel version can be obtained with (1).


Sigmoid Commute Time kernels.

Sigmoid Commute Time kernel:
(3) where
is the matrix of Commute Time proximity, std is the standard deviation of all elements of a matrix,
is the elementwise sigmoid function
[yen2007graph]. 
Sigmoid Corrected Commute Time kernel. First of all, we need the Corrected Commute Time kernel [luxburg2010getting]: , where , is the centering matrix , is the vector of diagonal elements of and is the sum of all elements of . Then, apply (3) replacing with to obtain .

Occasionally, elementwise logarithm is applied to the resulting kernel matrix [chebotarev2013studying, ivashkin2016logarithmic]. We apply it to almost all investigated measures and consider the resulting measures separately from their plain versions (see Table 1). For some measures, like Forest kernel, this is wellknown practice [chebotarev2013studying], while for others, like Double Factorial closeness, this transformation, to the best of our knowledge, is applied for the first time. The considered measures and their short names are summarized in Table 1.
Short name  
Family  Plain  Log  Full name 
Adjacency matrix based kernels  Katz  logKatz  Katz kernel 
Comm  logComm  Communicability kernel  
DF  logDF  Double Factorial closeness  
Laplacian based kernels  For  logFor  Forest kernel 
Heat  logHeat  Heat kernel  
NHeat  logNHeat  Normalized Heat kernel  
Abs  logAbs  Absorption kernel  
Markov matrix based kernels and measures  PPR  logPPR  Personalized PageRank closeness 
MPPR  logMPPR  Modified Personalized PageRank  
HPR  logHPR  PageRank heat closeness  
RSP    Randomized Shortest Path kernel  
FE    Free Energy kernel  
Sigmoid Commute Time  SCT    Sigmoid Commute Time kernel 
SCCT    Sigmoid Corrected Commute Time kernel  
SPCT    Linear combination of SP and CT 
3 Dataset
We collected a paired dataset of graphs and the corresponding results of clustering with each measure mentioned in Table 1. In this section, we describe the graph generator, the sampling strategy, the graph characteristic features, and the pipeline for the measure score calculation.
We use Lancichinetti–Fortunato–Radicchi (LFR) graph generator. It generates nonweighted graphs with ground truth nonoverlapping communities. The model has five mandatory parameters: the number of nodes (), the power law exponent for the degree distribution (), the power law exponent for the community size distribution (), the fraction of intracommunity edges incident to each node (), and either minimum degree (min degree) or average degree (avg degree). There are also several extra parameters: maximum degree (max degree), minimum community size (min community), maximum community size (max community). Not the whole LFR parameter space corresponds to common realworld graphs; most of such graphs are described with and (e.g., [fotouhi2019evolution]). However, there is also an interesting case of bipartite/multipartitelike graphs with . Our choice is to consider the entire parameter space to cover all theoretical and practical cases.
For the generation, we consider
. It is impossible to generate a dataset with a uniform distribution of all LFR parameters, because
and parameters are located on rays. We transform and to to bring their scope to the interval. In this case, “realistic” settings with take up 62% of the variable range. Also, as avg degree feature is limited by the number of nodes of a particular graph, we decided to replace it with density (). It belongs to . Using all these considerations, we collected our dataset by uniformly sampling parameters for LFR generator from the set {, , , , density} and generating graphs with these parameters. Additionally, we filter out all disconnected graphs.In total, we generated 11780 graphs. It is worth noting that the generator fails for some sets of parameters, so the resulting dataset is not uniform (see Fig. 2). In our study, nonuniformity is not a critical issue, because we are interested in local effects, rather than global leadership. Moreover, true uniformity for LFR parameter space is impossible, due to the unlimited scope of parameters.
For our research, we choose a minimum set of the features that describe particular properties of graphs and are not interchangeable.
The LFR parameters can be divided into three groups by the graph properties they reflect:

The size of the graph and the communities: , , min community, max community;

The density and uniformity of the node degree distribution: , min degree, avg degree, max degree. As avg degree depends on , it is distributed exponentially, so we use instead;

The cluster separability: . As parameter considers only the ratio between the number of intercluster edges and the number of nodes but ignores overall density, we use modularity [newman2004finding] as a more appropriate measure for cluster separability.
Thus, the defined set of features {, , , avg degree, modularity} is enough to consider all graph properties mentioned above. Although modularity is a widely used measure, it suffers from resolution limit problems [fortunato2007resolution]. We acknowledge that this may cause some limitations in our approach, which should be the topic of further research.
For every generated graph, we calculate the top ARI score for every measure [hubert1985comparing]. The popular NMI clustering score is known to be biased towards smaller clusters, according to [gosgens2019systematic]. We choose ARI as a clustering score that is both well known and unbiased. As soon as every measure has a parameter, we perform clustering for a range of parameter values (we transform the parameter to become in the [0, 1] interval and then choose 16 values linearly spaced from 0 to 1). For each value, we run trials of means (6 trials for each of three initialization methods).
Fig. 2 shows the pipeline we use to calculate ARI score for a given LFR parameter set, a measure, and a measure parameter. Measure parameters are not the subject of our experiments, so for every measure we just take the result of the measure with the value of the parameter that gives the best ARI score.
Because of the need to iterate over graphs, measures, parameter values, and initializations, the task is quite computationally complex. The total computation time was 20 days on 18 CPU cores and 6 GPUs.
4 Results
4.1 Global leadership in LFR space
As a rough estimate of the measures’ applicability, we calculate global leadership on our generated dataset. We divide the dataset into parts corresponding to two different cases of clustering: associative (graphs with modularity
on the ground truth partition) and the complementary dissociative case. The negative modularity case basically corresponds to the setup of LFR generator.We rank the measures by their ARI score on every graph of the dataset. The aggregated rank is defined as the position of the measure in this list, averaged over the dataset (see Tables 2(a) and 2(b); smaller rank is better). It is important to note that the global leadership does not give a comprehensive advice on which measure is better to use, because for a particular graph, the global leader can perform worse than the local winner. Here, we consider the entire LFR space, not just its zone corresponding to common realworld graphs, so the ranking may differ from those obtained for restricted settings.


Table 2(a) shows that there are several leading measures whose quality is not much different. The best measures are RSP (by rank) and SCCT (by the number of wins and the mean ARI). Dissociative case has an undisputed leader, SCCT (Table 2(b)).
The above division into two cases does not exhaust all the variety of graphs. For the further more precise study of the measures’ applicability, we will look for the leadership zones of each measure.
4.2 Feature importance study
First of all, we find out which graph features among the LFR parameters are important for the choice of the best measure and which are not. To do that, we use Linear Discriminant Analysis [mika1999fisher]
(LDA). This method finds a new basis in the feature space to classify a dataset in the best way. It also shows how many components of basis are required to fit the majority of data.
Fig. 2(a) shows that the first two components account for about 90% of the explained variance. Fig. 2(b) shows that these components include only , avg degree, and modularity. The fact that is not included means that the size of the graph as well as the density are not of primary importance for choosing the best measure. So is not measuring the diversity of cluster sizes.
To detect the zones of measure leadership, we need to know the leadership on average in each area of space rather than the wins in particular points. To determine the local measure leadership, we need to introduce a filtering algorithm that for every point of the space returns the leading measure depending on the closest data points. As the choice of measure is mainly dependent on three features {, avg degree, modularity}, we can limit our feature space to them.
4.3 Gaussian filter in feature space
Using a filter in the feature space, we can suppress the noise and reveal the actual zones of leadership for the measures. We use the Gaussian filter with a scale parameter . For every given point of the space, it takes the data points that are closer than and averages ARIs of the chosen points with a weight . This allows to give larger weights to closer points. If there are less than three data points inside the sphere with a radius, the filter returns nothing, allowing to ignore the points with insufficient data in their vicinity.
Before applying the filter, we prepare the dataset. First, we isolate the case when several measures reach into a separate measure called “several”. Also, we normalize the standard deviation of every feature distribution to one.
To choose , we apply the filter with different values of and look at the number of connected components in the feature space. The needed
should be large enough to suppress the noise, however, it should not suppress small zones. Guided by this heuristic, we choose
.
SCCT 
RSP 
logComm 
logHeatPR 
several 
Abs 
logHeat 
logNHeat 
NHeat 
Comm 


Wins  7874  1544  1043  444  265  10  4  3  2  2 
After filtering with , the leaderboard of measure wins changes (see Table 3). Only four measures keep their positions: SCCT, RSP, logComm, and logHeatPR. There is also a special case of several winning measures (named “several”), when the predicted partition reaches for several measures. The presence of several winners makes it difficult to analyze zones of measure’s leadership, so we decided to exclude this case from the detailed analysis in this work. Filtering shows that these four measures do have zones of leadership, otherwise they would be filtered out. We can plot the entire feature space colored by the leadership zones of the measures (see Fig. 4). As the resulting space is 3D, we indicate its slices by their coordinates.



The zones of measure leadership can be described by the following approximate criteria:

SCCT: in many domains of the parameter space;

RSP: up to 5, modularity in ;

logComm: modularity in 0..0.3, avg degree up to 100;

logHeatPR: modularity above 0.3, avg degree up to 50;

several: high modularity or high avg degree.
5 Conclusions
In this work, we have shown that the global leadership of measures does not provide comprehensive knowledge about graph measure performance in clustering tasks. We demonstrated that among 25 measures, SCCT is the best measure for the LFR graphs both by winning rate and ranking. However, there are also smaller distinct zones of leadership for RSP, logComm, and logHeatPR. Other measures, including those with high rank, fail to form their leadership zones.
Our results do not contradict those of other experimental works and, moreover, refine them by providing new findings. LogComm was first introduced in [ivashkin2016logarithmic] and won in the competitions on graphs generated with a particular set of SBM parameters. This study confirms its leadership, but only for a certain type of graphs. Another interesting finding is logHeatPR, which shows unexpectedly good performance within its zone of leadership.
Accoring to LDA analysis results, the leadership of measure is determined mainly by {, avg degree, modularity}. One of the interesting consequences is that the leadership does not depend on . This effect could be caused by the fact that we limited the size of the graphs to . It is not guaranteed to be preserved for large graphs.
This study is based on the LFR benchmark data. More research is needed to determine how well the results for LFR fit with to realworld. This would assess the applicability of our findings to practical cases.
It should be noted that our study is insensitive to the nonuniformity of the generated dataset. While manipulations with this dataset may affect the global leaderboard, they cannot change the local leadership, which is the focus of the present work.