Community detection in node-attributed social networks: a survey

12/20/2019 ∙ by Petr Chunaev, et al. ∙ 0

Community detection is a fundamental problem in social network analysis consisting in unsupervised dividing social actors (nodes in a social graph) with certain social connections (edges in a social graph) into densely knitted and highly related groups with each group well separated from the others. Classical approaches for community detection usually deal only with network structure and ignore features of its nodes (called node attributes), although many real-world social networks provide additional actors' information such as interests. It is believed that the attributes may clarify and enrich the knowledge about the actors and give sense to the communities. This belief has motivated the progress in developing community detection methods that use both the structure and the attributes of network (i.e. deal with a node-attributed graph) to yield more informative and qualitative results. During the last decade many such methods based on different ideas have appeared. Although there exist partial overviews of them, a recent survey is a necessity as the growing number of the methods may cause repetitions in methodology and uncertainty in practice. In this paper we aim at describing and clarifying the overall situation in the field of community detection in node-attributed social networks. Namely, we perform an exhaustive search of known methods and propose a classification of them based on when and how structure and attributes are fused. We not only give a description of each class but also provide general technical ideas behind each method in the class. Furthermore, we pay attention to available information which methods outperform others and which datasets and quality measures are used for their evaluation. Basing on the information collected, we make conclusions on the current state of the field and disclose several problems that seem important to be resolved in future.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and the problem statement

1.1. Overview

Community detection is a fundamental problem in social network analysis consisting, roughly speaking, in (unsupervised) dividing social actors into densely knitted and highly related groups with each group well separated from the others. Classical approaches for community detection mainly deal only with the structure of social networks and ignore features of the social network actors. There exist a variety of different methods for this task (see [Fortunato2010]) which have shown their efficiency in multiple experiments (see [Leskovec2010, Lancichinetti2009]). However, real-world networks clearly provide more information about social actors than just connections between them. Usually, the actors fulfil their profiles with age, gender, interests, etc., and other information that is traditionally called node attributes. According to [Wasserman1994], attributes form the second dimension, besides the structural one, in social network representation. There are other classical approaches such as -means which may use only node attributes to detect communities but completely ignore links between social actors thus not exploiting all available information. A reasonable generalisation of the methods of both types are the ones that take into account both network structure and actors attributes. This is a relatively novel direction [Bothorel2015] in social network analysis which is quite promising as simultaneous usage of structure and attributes may clarify and enrich the knowledge about the social actors, to give meaning to the detected communities and describe the powers that form them.

During the last decade many methods based on different ideas and techniques have appeared in this direction. Although there exist some partial overviews of them, especially in Related Works sections of published papers and in the survey [Bothorel2015] published in 2015, a recent summary of the subject is a necessity as the growing number of the methods may cause uncertainty in practice.

In this paper we aim to clarify the overall situation by proposing a clear classification of the methods and providing a comprehensive survey of the available results in the area. We not only group and analyse the corresponding methods but also focus on practical aspects, including the information which methods outperform others and which datasets and quality measures are used for evaluation.

To make the exposition more formal, we first provide the reader with necessary notation and state the problem of community detection in social networks whose actors are equipped with attributes.

1.2. Notation and the problem statement

Traditionally social networks such as online social networks or citation networks are modelled as graphs whose vertices (nodes) represent social actors (users or authors) and edges the relations between the actors (friendships, subscriptions or co-authorships). Actors’ attributes (also known as node features or semantic vectors) may be thought as multidimensional node-attached vectors whose elements contain certain features describing the actors. In what follows, we call

node-attributed or simply attributed the networks whose actors’ features are available. Clearly, edges as relations between actors may be of different type in real networks. Such networks can be represented via multilayer graphs (each layer containing certain relation type) but in this paper, for the sake of simplicity, we confine ourselves only to one-layer networks (graphs), i.e. to those with edges of one type. We however mention some papers considering community detection in attributed multilayer networks in remarks below.

Let us move to the required definitions. We represent a node-attributed social network as the triple (called a node-attributed graph) , where is the set of nodes (vertices) representing social actors, the set of edges representing the existing relations between the actors ( is an edge between nodes and ) such that , and the set of attribute vectors associated with nodes in and describing their features. The size of is denoted by or . The size of is denoted by or . The dimension of attribute vectors is . The domain of , i.e. the set of possible values of this attribute, is denoted by . In these terms, th attribute of node is referred to as . The notation introduced is summarised in Figure 1.

Figure 1. Notation used for a node-attributed network in the paper.

By community detection in a node-attributed network (graph) (also known as attributed graph clustering) we mean unsupervised dividing the attributed graph into disjoint or overlapping subgraphs (equivalently, clusters or communities) , with , such that all vertices are included in the resulting division, i.e.

where a certain balance between the following two properties is achieved:

  • structural closeness, i.e. nodes within a community are structurally close to each other, while nodes in different communities are not;

  • attribute homogeneity, i.e. nodes within a community have similar attributes, while nodes in different communities do not.

The basis for these properties is discussed in the forthcoming subsection.

Measures for structural closeness and attribute homogeneity may vary from method to method. One can evaluate the quality of community detection via different structure- and attribute-aware measures if no ground truth is available or compare the results with the ground truth, otherwise. We will mention the corresponding measures below.

Structure-attributes fusion techniques and community detection methods for node-attributed social networks are the subject of this paper. Related datasets and quality measures are also of our interest.

In what follows, the structural (topological) information contained in is referred to as network structure or topology. The attribute (semantic) information in is referred to as network attributes or semantics.

1.3. Basis for structural closeness and attribute homogeneity. Fusing topology and semantics

Structural closeness is related to the classical concept of (structural) community in terms of structural connections density. According to [Girvan2002], communities are thought as subsets of nodes with dense connections within the subsets and sparse in between. [NewmanGirvan2004] adopts the intuition that nodes within the same community should be better connected than they would be by chance to create the famous Modularity measure that became an influential tool for topology-based community detection in social networks [Bothorel2015]. Multiple Modularity modifications and other structural measures have been proposed to overcome several Modularity limitations [Chakraborty:2017:MCA:3135069.3091106], but Modularity is still a de facto standard in community detection. [NewmanGirvan2004] observes that in social networks Modularity generally belongs to but there is no particular value for good or bad community structure. In fact, any positive Modularity may indicate the presence of a structural community [Clauset2004], oppositely to zero Modularity related to a random graph. High Modularity implies the structural closeness of the nodes within communities.

Attribute homogeneity requirement is based on the social science founding (see e.g. [Marsden1993, McPherson2001, FioreDonath2005, KossinetsWatts2009]) that node attributes in social networks can reflect and affect community structure. The well-known principle of homophily in social networks states that like-minded social actors have a higher likelihood to be connected [McPherson2001]. Thus community detection process taking into account the attribute homogeneity may provide results of better quality [Bothorel2015].

According to many experiments, e.g. [Moser2009, Ye2017, Sheikh2019, Cohn2001, Getoor2003] and many other papers cited in this survey, topology and semantics often provide complementary information and thus combining them usually leads to achieving better performance in community detection. For example, the semantics may compensate the sparseness of a real network [Jia2017]. At the same time, topological information may be helpful if there are missing or noisy attributes [Sheikh2019]. As observed by [Ding2011], topology-only or semantic-only community detection is often not as effective as when both sources of information are used. From the other side, some experiments (see e.g. [Akbas2017, Zhou2009]) suggest that this is not always true and topology and semantics may be orthogonal and contradictory in some cases. Moreover, the relations between network topology and semantics may be highly non-linear [Wang2016]. Consequently, the way how one should use network topology and semantics together is a challenging problem.

1.4. Applications of community detection in attributed social networks

Community detection in node-attributed networks has not only obvious applications in marketing (recommender systems, targeted advertisements and user profiling) [Alamsyah2014], but also can effectively support other multiple advanced applications. First of all, it may be used for search engine optimization and spam detection [Ruan2013, Muslim2016]. Furthermore, community detection methods may help in counter-terrorist activities and disclosing fraudulent schemes [Muslim2016]. There also exist applications related to the analysis of networks of different nature: protein-protein interactions, genes, epidemics and other biological networks [Muslim2016].

Another area where the ideas of community detection in attributed networks are generally applied is document network clustering. Note that this direction is historically preceding to the community detection and is rich methodologically. For example, in [Neville2003], one of the first papers on community detection in attributed social networks, the following document clustering methods are mentioned: HyPursuit [Weiss1996, Modha2000, He2001], PLSA-PHITS [Cohn2001], Community-User-Topic model [Zhou2006] and Link-PLSA-LDA [Nallapati2008]. From that time many others have appeared, see e.g. the surveys [Nail2016, Aggarwal2012, Saiyad2016].

Clearly, methods from document network clustering can be adapted for community detection in attributed social networks, however social communities although have similar formal description with document clusters, have inner and more complicated forces to be formed and act. What is more, it has been shown that some methods for community detection in attributed social networks outperform preceding methods for document network clustering. In particular, Inc-Cluster111Throughout the text, methods and datasets covered by the survey are written in bold. [Zhou2010] has been shown to outperform k-SNAP [Tian2008], PCL-DC [Yang2009] to outperform PLSA-PHITS [Cohn2001], LDA-Link-Word [Erosheva2004] and Link-Content-Factorization [Zhu2007], CESNA [Yang2013] and ASCD [Qin2018] to outperform Block-LDA [Balasubramanyan2011]. Taking this into account, we do not consider methods focused on document network clustering in the present survey.

1.5. Note on multilayer networks

Generally speaking, we do not aim at considering community detection methods for attributed multilayer networks (see e.g. [Kivela2014]), where different types of vertexes and edges may present at different layers. However, we mention some of such methods from time to time in corresponding remarks. Although node-attributed single-layer networks may be considered as a particular case of the multilayer ones (or, generally, feature-rich networks [Interdonato2019]), the latter require special analysis to take into account the heterogeneity of attributes, edges and vertices on different layers. A separate survey and an extensive comparable study of such methods is an independent and useful task (see partial overviews e.g. in [Boutemine2017, Interdonato2019, Kivela2014]).

1.6. Note on subspace-based clustering

According to the above-mentioned definition of community detection in attributed social networks, we mainly confine ourselves in the survey to the methods that can use the full attribute space and find communities covering the whole network. However, there is a big class of special methods that explore subspaces of attributes and/or find significant subgraphs of the network graph, e.g. GAMer [Gunnemann2010, Gunnemann2014], DB-CSC [Gunnemann2011], SSCG [Gunnemann2013], FocusCO [Perozzi2014-2] and ACM [Wu2018]. The main idea behind the subspace-based (also known as projection-based) attributed graph clustering is that not all available semantic information is relevant to obtain good-quality communities [Gunnemann2013-1, GunnemannBoden2013], therefore one has somehow choose the appropriate attribute subspace to avoid the so-called curse of dimensionality (see [Bothorel2015, Section 3.2]) and reveal significant communities that would not be detected if all available attributes were considered.

To be precise, some of the methods that we discuss below partly use this idea, e.g. WCru [Cruz2011, CruzBathorelPoulet2012] (cf. the definition of a point of view in the papers), DVil [Villavialaneix2013], SCMAG [Huang2015], UNCut [Ye2017], DCM [Pool2014], etc., but still can work with the full attribute space. In any case, a separate survey on the subspace-based attributed graph clustering methods would be very a valuable complement to the current survey.

2. Related works and main problems in the area

There is a variety of surveys and comparative studies considering community detection in social networks without attributes, in particular, [SCHAEFFER2007, Yang2016-Survey, Fortunato2010, Coscia2011]. In opposite, the survey [Bothorel2015] seems to be the only one on community detection in attributed social networks. Obviously, since it was published in 2015, many new methods adapting different techniques have appeared in the area. Furthermore, a big amount of the methods that had been available before 2015 are not covered by [Bothorel2015], in particular, some based on objective function modification, non-negative matrix factorisation, probabilistic models, ensembles, etc. In a sense, the technique-based classification of attributed graph clustering methods in [Bothorel2015] is also sometimes confusing. For example, CODICIL [Ruan2013], a method based on assigning attribute-aware weights on graph edges, is not included in [Bothorel2015, Section 3.2. Weight modification according to node attributes], but to [Bothorel2015, Section 3.7. Other methods]. Although [Bothorel2015] is a nice highly cited survey in the area, a recent survey of community detection methods for attributed social networks is clearly required.

Besides [Bothorel2015]

, almost every paper on the topic contains a Related Works section. It typically has a short survey on preceding approaches and an attempt to classify them. We observed that many authors are just partly aware of the corresponding bibliography and this sometimes leads to repetitions in approaches. Furthermore, multiple classifications (usually technique-based) are mainly not full and even contradictory.

Another big problem in the area is a comparative study of known methods (by means of scalability, complexity and quality). Separate papers provide a limited impact on this (as usually compare their own method with few known ones), see Figures 2 and 3, and the whole picture is unclear. In fact, we are unaware of any comprehensive unified comparison of different attributed graph clustering methods. One more issue, related to the previous one, is that authors use different datasets (of various size and nature) and quality measures to evaluate their methods so that any direct comparison becomes impossible. What is more, datasets and code sources stay unavailable for comparison experiments in the majority of cases.

Figure 2. The directed graph of existing method-method comparisons. Nodes (shown only those with degree ) represent the methods classified in the present survey.
Figure 3. The most influential methods (i.e. the ones newer methods most compared with) among the methods classified in the present survey are shown in red.

Facing the above-mentioned problems, in the current survey we not only collect the existing methods but also proposed their unified classification based on the moment when topology and semantics of the network are fused and used in the corresponding algorithm. We also focus on the experimental part so that one can see which networks (with the corresponding dataset link) and quality measures are used in each paper and which methods were compared in each study. Besides this, we also provide the reader with a short description of the most influential and interesting methods for community detection in attributed social networks.

The survey covers the papers published in journals and conference proceedings before the middle of 2019. Exceptionally we sometimes note preprints available on arxiv.

3. Classification of community detection methods for attributed social networks

In previous works, the classification of methods for community detection in attributed social networks was done mostly with respect to the techniques used (e.g. distance-based or random walk-based). We partly follow this methodology at a lower level but at the upper level we group the methods by the moment when topology and semantics are used and fused in the method (with respect to the community detection (clusterisation) step), see Figure 4. Namely, we distinguish

  • early fusion methods that fuse topology and semantics before the clusterisation step,

  • simultaneous fusion methods that fuse topology and semantics during the clusterisation step,
    and

  • late fusion methods that fuse topology and semantics after the clusterisation step.

Within each fusion type, we also divide the methods into technique-used subclasses.

Figure 4. All the methods classified in the present survey. Early fusion (41%), simultaneous fusion (51%) and late fusion (8%) methods are shown in green, blue and red boxes, respectively. Edges indicate existing method-method comparisons.

A subclassification that is applied to some subclasses of early fusion methods is by the modification of the initial network topology (structure). In fact, the existing topology may be saved or modified depending on the heuristics used, therefore we distinguish

  • fixed topology methods that use the existing network topology without modifying it with respect to the semantics,
    and

  • non-fixed topology methods that modify the existing network topology with respect to the semantics, in particular, add/erase edges and/or vertices.

It is important to distinguish the cases as each one leads to certain advantages or disadvantages. For example, if one assigns edge weights between all nodes in the network, even if there is no structural connections (i.e. considers non-fixed topology) and further removes edges with tiny weights, then in the social network settings this may lead to the following: (a) nodes representing social actors who are highly related in terms of semantics may have vanishing social connections so that the resulting connection may seem unrealistic, (b) one may erase too many important connections. At the same time, the initial “fixed” topological structure may be sparse or noisy in a network and some kind of its enrichment is required. In any case, a proper balance between pure non-fixed and fixed topologies is usually necessary.

As we have already mentioned, the lowest level of classification is by fusion technique. For example, by “weight-based methods” we mean those which form a weighted graph while fusing topology and semantics. Some of the methods further use weighted graph clusterisation algorithms (and this is reasonable) but some may still transform the graph into a distance matrix and use distance-based methods for clusterisation, though are still called “weight-based”. On the other hand, “distance-based methods” are called in this way as produce a distance matrix at the fusion step.

4. Most used attributed social networks and quality measures

4.1. Attributed social networks

It can be observed that “social networks” in many papers mean not only real social networks (like Google+, Facebook, Twitter) but also citation networks (like DBLP and CiteSeer). In fact, citations and blogs are the most popular examples in experiments, while real social networks (say, with friendship connections between users) are not.

By small, medium and large networks we mean those with , and nodes. The most popular datasets used in experiments on community detection in node-attributed social networks are collected 222An interested reader can find other attributed network dataset at Mark Newman page, HPI Information Systems Group, LINQS Statistical Relational Learning Group, Stanford Large Network Dataset Collection, Laboratory of Cell Trafficking and Signal Transduction, University of Verona, Marc Plantevit page, Tore Opsahl page, UCINET networks, Interactive Scientific Network Data Repository, Citation Network Dataset. in Tables 1, 2 or 3. Below, datasets used for evaluation of each method are shown in Dataset columns333Recall that if a dataset name is written in bold, its description can be found in Tables 1, 2 or 3. Note also that other versions of the networks from Tables 1, 2 or 3 can be used in fact in different papers, and to show this we mark such datasets by *. For example, a DBLP dataset with the number of nodes and edges different from the described DBLP10K and DBLP84K is denoted by DBLP*.

In most cases, the attributes suitable for the methods discussed in the survey are represented by continuous numerical vectors. If one deals, say, with nominal, textual or graphical attributes, it is common to use TF-IDF or other similar frameworks to obtain continuous numerical vectors instead.

Network Description Source
Political Books All books in this dataset were about U.S. politics published during the 2004 presidential election and sold by Amazon.com. Edges between books means two books are always bought together by customers. Each book has only one attribute termed as political persuasion, with three values: 1) conservative; 2) liberal; and 3) neutrality Link
WebKB A classified network of 877 webpages (nodes) and 1608 hyperlinks (edges) gathered from four different universities Web sites (Cornell, Texas, Washington, and Wisconsin). Each web page is associated with a binary vector, whose elements take the value if the corresponding word from the vocabulary is present in that webpage, and otherwise. The vocabulary consists of 1703 unique words. Nodes are classified into five classes: course, faculty, student, project, or staff. Link
[Craven1998]
Twitter A collection of several tweet networks: 1) Politics-UK dataset is collected from Twitter accounts of 419 Members of Parliament in the United Kingdom in 2012. Each user has 3614-dimensional attributes, including a list of words repeated more than 500 times in their tweets. The accounts are assigned to five disjoint communities according to their political affiliation. 2) Politics-IE dataset is collected from 348 Irish politicians and political organizations, each user has 1047- dimensional attributes. The users are distributed into seven communities. 3) Football dataset contains 248 English Premier League football players active on Twitter which are assigned to 20 disjoint communities, each corresponding to a Premier League club. 4) Olympics dataset contains users of 464 athletes and organizations involved in the London 2012 Summer Olympics. The users are grouped into 28 disjoint communities, corresponding to different Olympic sports. Link 1 Link 2[Greene2013]
Lazega A corporate law partnership in a Northeastern US corporate law firm; possible attributes: (1: partner; 2: associate), office (1: Boston; 2: Hartford; 3: Providence); 71 nodes and 575 edges [Lazeda2001]
Research A research team of employees in a manufacturing company; possible attributes: location (1: Paris; 2: Frankfurt; 3: Warsaw; 4: Geneva), tenure (1: 1–12 months; 2: 13–36 months; 3: 37–60 months; 4: 61+ months); 77 nodes and 2228 edges [Cross2004]
Consult the relationship between employees in a consulting company; possible attributes: organisational level (1: Research Assistant; 2: Junior Consultant; 3: Senior Consultant; 4: Managing Consultant; 5: Partner), gender (1: male; 2: female); 46 nodes and 879 edges [Cross2004]
Table 1. Most popular small real-world attributed social networks
Network Description Source
Political Blogs A non-classified network of 1,490 webblogs (nodes) on US politics with 19,090 hyperlinks (edges) between the webblogs. Each node has an attribute describing its political leaning as either liberal or conservative (represented by and ). Link [Adamic2005]
DBLP10K

A non-classified co-author network extracted from DBLP Bibliography (four research areas of database, data mining, information retrieval and artificial intelligence) with 10,000 authors (nodes) and their co-author relationships (edges). Each author is associated with two relevant categorical attributes: prolific and primary topic. For attribute “prolific”, authors with

papers are labelled as highly prolific; authors with and papers are labelled as prolific and authors with

papers are labelled as low prolific. Node-attribute values for “primary topic” (100 research topics) are obtained via topic modelling. Each extracted topic consists of a probability distribution of keywords which are most representative of the topic.

Link [Zhou2010]
DBLP84K

A larger non-classified co-author network extracted from DBLP Bibliography (15 research areas of database, data mining, information retrieval, artificial intelligence, machine learning, computer vision, networking, multimedia, computer systems, simulation, theory, architecture, natural language processing, human-computer interaction, and programming language) with 84,170 authors (nodes) and their co-author relationships (edges). Each author is associated with two relevant categorical attributes: prolific and primary topic, defined in a similar way as in DBLP10.

Link [Zhou2010]
Cora A classified network of machine learning papers with 2,708 papers (nodes) and 5,429 citations (edges). Each node is attributed with a

-dimension binary vector indicating the absence/presence of words from the dictionary of words collected from the corpus of papers. The papers are classified into 7 subcategories: case-based reasoning,genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning and theory.

Link 1 Link 2 [Sen2008]
CiteSeer A classified citation network in the field of machine learning with 3,312 papers (nodes) and 4,732 citations (edges). Each node is attributed with a binary vector indicating the absence/presence of the corresponding words from the dictionary of the 3,703 words collected from the corpus of papers. Papers are classified into 6 classes. Link 1 Link 2 [Sen2008]
Sinanet A classified microblog user relationship network extracted from the sina-microblog website (http://www.weibo.com) with 3,490 users (nodes) and 30,282 relationships (edges). Each node is attributed with 10-dimensional numerical attributes describing the interests of the user. Link [Jia2017]
PubMed Diabetes A classified citation networks extracted from the PubMed database pertaining to diabetes. It contains 19,717 publications (nodes) and 44,338 citations (edges). Each node is attributed by a TF-IDF weighted word vector from a dictionary that consists of 500 unique words. Link
Facebook100 A non-classified Facebook users network with 6,386 users (nodes) and 435,324 friendships (edges). The network is gathered from Facebook users of 100 colleges and universities (e.g. Caltech, Princeton, Georgetown and UNC Chapel Hill) in September 2005. Each user has the following attributes: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dormitory(house), year and high school. Link [Traud2012, Traud2011]
ego-Facebook Dataset consists of ’circles’ (’friends lists’) from Facebook with 4039 nodes and 88234 edges. Facebook data was collected from survey participants using a Facebook app. The dataset includes node features (profiles), circles, and ego networks. Link [Leskovec2012]
LastFM A network gathered from the online music system Last.fm with 1,892 users (nodes) and 12,717 friendships on Last.fm (edges). Each node has 11,946-dimensional attributes, including a list of most listened music artists, and tag assignments. Link
Delicious A network of 1,861 nodes, 7,664 edges and 1,350 attributes. This is a publicly available dataset from the HetRec 2011 workshop that has been obtained from the Delicious social bookmarking system. Its users are connected in a social network generated from Delicious mutual fan relations. Each user has bookmarks, tag assignments, that is, [user, tag, bookmark] tuples, and contact relations within the social network. The tag assignments were transformed to attribute data by taking all tags that a user ever assigned to any bookmark and assigning those to the user. Link
Wiki A network with nodes as web pages. The link among different nodes is the hyperlink in the web page. 2,405 nodes, 12,761 edges, 4,973 attributes, 17 labels Link
ego-Twitter This dataset consists of ’circles’ (or ’lists’) from Twitter. Twitter data was crawled from public sources. The dataset includes node features (profiles), circles, and ego networks. Nodes 81306, Edges 1768149 Link [Leskovec2012]
Table 2. Most popular medium real-world attributed social networks
Network Description Source
Flickr A network with 100,267 nodes, 3,781,947 edges and 16,215 attributes collected from the internal database of the popular Flickr photo sharing platform. The social network is defined by the contact relation of Flickr. Two vertices are connected with an undirected edge if at least one undirected edge exists between them. Each user has a list of tags associated that he/she used at least five times. Tags are limited to those used by at least 50 users. Users are limited to those having a vocabulary of more than 100 and less than 5,000 tags. [Ruan2013] A version of the dataset
Patents A patent citation network with vertices representing patents and edges depicting the citations between. A subgraph containing all the patents from the year 1988 to 1999. Each patent has six attributes, grant year, number of claims, technological category, technological subcategory, assignee type, and main patent class. There are 1,174,908 vertices and 4,967,216 edges in the network. Link Larger dataset
ego-G+

This dataset consists of ’circles’ from Google+. Google+ data was collected from users who had manually shared their circles using the ’share circle’ feature. The dataset includes node features (profiles), circles, and ego networks. Nodes 107,614, Edges 13,673,453. Each node has four features: job title, current place, university, and workplace. A user-pair(edge) is compared using knowledge graphs based on, Category: Occupations, Category:Companies by country and industry, Category: Countries, Category:Universities and colleges by country.

link [Leskovec2012]
Table 3. Most popular large real-world attributed social networks

4.2. Measures for community detection quality

Given the set of detected communities (overlapping or not), one needs to evaluate the quality of the communities. There are two possible options depending on the network under consideration. If the network has no ground truth, one can measure structural closeness and attribute homogeneity directly. According to our observations, the most popular quality measures in this case are Modularity and Density for the former and Entropy for the latter. Many others such as Conductance, Within Cluster Sum of Squares, Intra-cluster distance, etc., are also possible. If there is ground truth, it is sometimes reasonable to compare the detected communities with the known ones. This can be done, for instance, with the following popular measures: Accuracy, Normalised Mutual Information (denoted below by NMI), Adjusted Rand Index or Rand Index (denoted below by ARI and RI, correspondingly) and -measure.

Due to space limitations, we refer the reader to the comprehensive survey [Chakraborty:2017:MCA:3135069.3091106] and to [Bothorel2015, Sections 2.2 and 4]

, where all the above-mentioned evaluation metrics and many others are precisely defined and discussed in detail.

5. Early fusion methods

These methods aim to fuse topological and semantic information before the clusterisation step so that the data obtained at the fusion step is suitable for conventional clusterisation methods.

Figure 5. The scheme of the weight-based methods (the attribute-aware weights are here calculated with the normalised matching coefficient).

5.1. Weight-based methods

The main characteristic of these methods is that the semantics is used to assign weights on edges of the network graph (the topology may be fixed or non-fixed), see Figure 10, so that the resulting weighted graph can be further clustered e.g. by a clusterisation algorithm for weighted graphs such as Weighted Louvain [Blondel2008] (requiring the adjacency matrix for the weighted graph as an input). There are also several algorithms that still find the distance matrix for and apply distance-based clusterisation algorithms such as -means and -medoids. In other words, weight-based methods remove attribute information by storing it inside the structure, namely, on the edges of the graph.

The weights are usually assigned on edges as follows:

(5.1)

where and are chosen topological similarity function and semantic similarity function for nodes and , respectively. The parameter controls the balance between topological and semantic components so that corresponds to the pure topological case and to the semantic one. Generally speaking, one may introduce a non-linear fusing function instead of (5.1), however such a choice clearly complicates the model and thus requires reasonable justification.

The fixed topology assumes that only existing edges are assigned with the weight (5.1), while the non-fixed one assigns weights on the edges of the complete graph based on . In the former case, usually for that did not exist in the initial graph , and is thus generated only by the semantic similarity component.

A very popular approach in fixed-topology case is assuming in (5.1), see Table 4. This effect may be also achieved by assuming for all . Actually this means that the weights in are based only on the semantic similarity. Clearly, this may lead to the dominance of the semantics component and the break of initial structural connection between nodes with dissimilar attributes.

The approach with in (5.1), see Table 5, seems more adequate with respect to as explicitly allows controlling the impact of both components. For unweighted graphs , it is common to put if edge exist in , and otherwise.

As for , there are several popular measures. Assume that we are given two attribute vectors and . One can define basing on

  • the (normalised) matching coefficient

    (5.2)
  • the cosine similarity

    (5.3)
  • Jaccard similarity coefficient

    (5.4)

    where and are thought as sets in the former case and as vectors with non-negative real values in the latter one,

  • Minkowski similarity

    (5.5)

    where one gets the city block norm in the denominator if and the Euclidean norm if .

The choice of is usually unclear and is determined by author’s preferences. Moreover, we are unaware of any systematic comparison of the above-mentioned measures for semantic similarity.

Algorithm Input for / Method of Clusterisation Number of clusters as input/ Clusters overlap Network size Evaluation Topology Databases Other attributed network clusterisation algorithms compared with
WNev [Neville2003] Weighted graph
MinCut [Karger1993]
MajorClust [SteinNiggemann1999]
Spectral [ShiMalik2000]
No/No Small Accuracy Fixed Synthetic
WSte1 [Steinhaeuser] Weighted graph
Threshold
No/No Large Modularity Fixed Phone Network [Madey2007]
WSte2 [Steinhaeuser2010] Similarity matrix (via Weighted graph and random walks)
Hierarchical clustering [Johnson1967, Fred2002]
No/No Large Modularity Fixed Phone Network [Madey2007]
WCom1 [Combe2012] Weighted graph
Weighted Louvain [Blondel2008]
Yes/No Small Accuracy Fixed DBLP* WCom2 [Combe2012]
DCom [Combe2012]
WCom2 [Combe2012] Distance matrix (via weighted graph)
Hierarchical agglomerative clustering
Yes/No Small Accuracy Fixed DBLP* WCom1 [Combe2012]
DCom [Combe2012]
AA-Cluster [Akbas2017, Akbas2019] Node embeddings (via weighted graph)
-medoids
Yes/No Small
Medium
Large
Density
Entropy
Fixed Political Blogs
DBLP*
Patents*

Synthetic
SA-Cluster [Zhou2009]
BAGC [Xu2014]
CPIP [Liu2015]
PWMA-MILP [Alinezhad2019] Weighted graph
Linear programming MILP [Alinezhad2019]
No/No Small RI
NMI
Fixed WebKB
KDComm [Bhatt2019] Weighted graph
Iterative Weighted Louvain
No/No Small
Medium
Large
-measure
Jaccard measure
Rank Entropy measure
Fixed ego-G+
Twitter*
DBLP* [Jia2017]
Reddit link
CPIP [Liu2015]
JCDC [Zhang2016]
UNCut [Ye2017]
SI [Newman2015]
Table 4. Weight-based approaches with in (5.1)
Algorithm in (5.1) Input for / Method of Clusterisation Number of clusters as input/ Clusters overlap Network size Evaluation Topology Databases Other attributed network clusterisation algorithms compared with
WWan [Wang2010] in theory
in experiments
Edge similarity matrix (via weighted graph)
EdgeCluster [Tang2009]
(-means variant)
Yes Small NMI
Micro-F1
Macro-F1
Non-fixed: removing edges Synthetic
BlogCatalog
Delicious
Non-overlapping co-clustering [Dhillon2003]
SAC2 [DangViennet2012] NN (unweighted) graph (via weighted graph)
(Unweighted) Louvain [Blondel2008]
No/ No Small
Medium
Density
Entropy
Non-fixed: removing edges Political Blogs
Facebook100
DBLP10K
SAC1 [DangViennet2012]
WSte2 [Steinhaeuser2010]
Fast greedy [Clauset2004] for weighted graph
WCru [Cruz2011, CruzBathorelPoulet2012] in theory
Not specified in experiments
Weighted graph
Weighted Louvain [Blondel2008]
No Medium Modularity
Intracluster distance
Fixed Twitter
CODICIL [Ruan2013] in theory
in some experiments
Weighted graph
Metis [Karypis1998]
Markov Clustering [Satuluri2009]
No Small
Medium
Large
-measure Non-fixed: adding and removing edges CiteSeer*
Flickr*
Wikipedia*
Inc-Cluster [Zhou2010]
PCL-DC [Yang2009]
Link-PLSA-LDA [Nallapati2008]
WMen [Meng2018] Not specified Weighted graph/Distance matrix for the weighted graph
SLPA [Xie2012]
Weighted Louvain [Blondel2008]
K-medoids [Yu2018]
Yes-No/
Yes-No
Small
Medium
NMI
-measure
Accuracy
Fixed Lazega
Research
Consult
LFR benchmark [Lancichinetti2008]
CODICIL [Ruan2013]
SA-Cluster [Zhou2009]
PLCA-MILP [Alinezhad2019] Weighted graph
Linear programming MILP [Alinezhad2019]
No/No Small RI
NMI
Non-fixed: adding and removing edges WebKB SCD [Li2017]
ASCD [Qin2018]
SCI [Wang2016]
PCL-DC [Yang2009]
Block-LDA [Balasubramanyan2011]
kNN-enhance [Jia2017] May be thought as , NN by semantics Distance matrix (of the augmented graph)
NN
-means
No/No Medium Accuracy
NMI
F-Measure
Modularity
Entropy
Non-fixed: adding edges Cora
Citeseer
Sinanet
PubMed Diabetes
DBLP*
PCL-DC [Yang2009]
PPL-DC [Yang2010]
PPSB-DC [Chai2013]
CESNA [Yang2013]
cohsMix [Zanghi2010]
BAGC [Xu2012]
GBAGC [Xu2014])
SA-Custer [Zhou2009]
Inc-Cluster [Zhou2010]
CODICIL [Ruan2013]
GLFM [Li2011])
IGC-CSM [Nawaz2015] source in theory
in comparison experiments
Distance matrix for the weighted graph
-Medoids
Yes/ No Medium Density
Entropy
Fixed Political Blogs
DBLP10K
SA-Cluster [Zhou2009]
SA-Cluster-Opt [Cheng2011]
AGPFC [He2019] in theory, manually tuned in experiments Fuzzy equivalent matrix
-cut set method
No/Yes Small
Medium
Density
Entropy
Fixed Political Blogs
CiteSeer

Cora
WebKB
SA-Cluster [Zhou2009]
BAGC [Xu2012]
NMLPA [Huang2019] Weighted graph
A multi-label propagation algorithm
Yes/ Yes Medium -score
Jaccard Similarity
Fixed ego-Facebook
Flickr*
[Ruan2013]
ego-Twitter
CESNA [Yang2013]
SCI [Wang2016]
CDE [Li2018]
Table 5. Weight-based approaches with in (5.1)

Now let us describe the most influential weight-based methods CODICIL [Ruan2013] and SAC2 [DangViennet2012], according to Figure 3.

5.1.1. Codicil

The method CODICIL [Ruan2013] assigns semantic weights between all the nodes in , i.e. employs the non-fixed topology scheme. To decrease complexity, nodes with highest cosine similarity values with are selected as the top- neighbours of so that the semantic similarity for and is essentially (5.3). The topological similarity weight for two nodes is defined through the relative overlap of their respective structural neighbours. Approximations of (5.3) and (5.4) are used for this purpose. Then the topological and semantic weights are combined similar to (5.1). After that, a biased edge sampling procedure that retains edges being locally relevant to each node is applied (in other words, the edges with highest similarity values are retained) to make the weighted graph sparse and enable both better runtime performance and lower memory usage in the subsequent community detection step that is performed by Metis [Karypis1998] or Multi-layer Regularised Markov Clustering [VanDongen2000, Satuluri2009]. The complexity of CODICIL is .

5.1.2. Sac2

The method SAC2 [DangViennet2012] uses (5.2) as for discrete attributes and (5.5) with if they are continuous. Textual ones are first transformed into numeric values by TF-IDF procedure. Furthermore, the corresponding is (5.3) or (5.4). The obtained are then used to assign weights (5.1), where if and are directly connected and , otherwise. After that, the weight is used to construct an (unweighted) -nearest neighbour graph as a directed graph in which each node has exactly edges, connecting to its most similar neighbours in (thus the topology is non-fixed). The parameter is set to equal to the average node degree in . The version [Dong2011] of NN algorithm with the empirical cost is applied to reduce complexity. At the community detection step, Louvain algorithm [Blondel2008] is applied to find communities in .

Remark 1.

The authors of [Berlingerio2011] consider attributed multilayer networks with different types of edges and use a similarity measure similar to (5.2) to flatten the network and put corresponding weights on single-layer network. After this fusion, any weighted graph clustering algorithm, e.g. Weighted Louvain [Blondel2008], is actually suitable for community detection. In [Papadopoulos2015], the method called CAMIR is proposed for clustering attributed multilayer

networks and assigns different weights to each attribute and edge-type. In particular, it ranks vertex properties by exploiting the information from edge-types and attributes and further constructs a unified similarity matrix (taking into account all edge types and attributes). The clusterisation step is performed via spectral clustering.

Remark 2.

SANS [Parimala2015] works with weighted directed graphs (that are out of scope of the present survey) using the matching coefficient (5.2) for semantic similarity and the so-called Weight Index (the sum of weights of incoming and outgoing edges) for topological similarity in a version of (5.1). SANS automatically determines the number of clusters via centroids and use the threshold algorithm for clustering.

Remark 3.

Edge weighting similar to (5.1) is also applied in FocusCO [Perozzi2014-2]

. Although it is not a purely unsupervised clustering approach (it requires user’s preferences on focus attributes), it allows to solve simultaneously two interesting problems: the extraction of focused local clusters and the detection of outliers in an attributed network.

Remark 4.

Let us mention that there exist approaches similar ideologically (attributes edge weights embeddings -means) but preceding to the recent algorithm AA-Cluster [Akbas2017, Akbas2019]. For example, for a given network with numerical vector attributesrs GraphEncoder [Tian2014] and GraRep [Cao2015] first obtain edge weights (5.1) with and being the cosine similarity (5.3

) and then apply different techniques (sparse autoencoder in

[Tian2014] and matrix factorization in [Cao2015] partly based on skip-gram [Mikolov2013] and DeepWalk [Perozzi2014] ideas) to obtain embeddings (low-dimensional vector representations) for the nodes of the weighted graph. The resulting embeddings are further fed to -means algorithm to detect communities. However, in opposite to [Akbas2017, Akbas2019], [Tian2014] and [Cao2015] mostly focus on embedding techniques suitable for a weighted graph and consider their different applications e.g. to classification and visualisation.

5.2. Distance-based methods

Algorithm in (5.6) Input for / Method of clusterisation Number of clusters as input/Clusters overlap Network size Evaluation Topology Databases Community detection methods for attributed graphs compared with
DCom [Combe2012] Distance matrix
Hierarchical agglomerative clustering
Yes/No Small Accuracy Non-fixed: added edges DBLP* WCom1 [Combe2012]
WCom2 [Combe2012]
DVil [Villavialaneix2013, Olteanu2013] Distance (or similarity) matrix
Stochastic kernel SOM algorithm [Villavialaneix2013, Olteanu2013]
No/No Small
Medium
NMI Non-fixed: added edges Synthetic
Medieval Notarial Deeds
SToC[Baroni2017] Formally but controlled via and Distance matrix
-close clustering [Baroni2017]
No/No Medium
Large
Modularity
Within-Cluster Sum of Squares
Non-fixed: added edges DBLP10K
DIRECTORS*
DIRECTORS-gcc*
Inc-Cluster [Zhou2010]
GBAGC [Xu2014]
@NetGA [Pizzuti2018] in general
in experiments
Distance matrix
Genetic algorithm
No/No Medium NMI Non-fixed: added edges Synthetic SA-Cluster [Zhou2009]
CSPA [Strehl2003, Elhadi2013])
Selection [Elhadi2013]
ANCA [FalihGrozavu2018, Falih2018] Maybe thought as

for summing eigenvectors of distance and similarity matrices

Distance and similarity matrices
-means for the sum of eigenvectors of the distance and similarity matrices
Yes/No Medium Adjusted Rand Index
NMI
Density
Modularity
Conductance
Entropy
Fixed Synthetic
DBLP10K
Anonymized Enron email corpus
SA-Cluster [Zhou2009]
SAC1-SAC2 [DangViennet2012]
IGC-CSM [Nawaz2015]
WSte1 [Steinhaeuser]
ILouvain [Combe2015].
Table 6. Distance-based methods

Methods considered in the precious subsection exchange the node attributes for edge weights so that one obtains a weighted graph with semantic information incorporated. Thus the topology of the network is somehow saved at the fusion step. Methods from this subsection intentionally remove the network so that the topological and semantic information is fused by a distance function between nodes and stored in a distance matrix, see Figure 6. Distance-based clusterisation methods such as -means and -medoids then can be applied. The user of such methods has to be aware of that in general the resulting clusters may contain disconnected portions of the initial graph as the graph structure is removed at the fusion step [Akbas2017, Section 3.3].

Figure 6. Distance-based approach (attributes distance is calculated with the Euclidean norm).

The usual form of the distance fusion function is

(5.6)

where and is a topological distance function and a semantic distance function for nodes and , correspondingly. Clearly, one can introduce a more complicated fusion function based on distances. The parameter influences the balance between topological and semantic information so that corresponds to the pure topological case and to the semantic one. It is common to define as short path length distance between and . The possible options for are as follows if we are given attribute vectors and :

  • Jaccard distance

    (5.7)

    where and are thought to be sets in the former case and vectors with non-negative real values in the latter one,

  • Minkowski distance

    (5.8)

    where one gets the city block norm if and the Euclidean norm if .

The distance-based methods are summarised in Table 6. Note that ANCA [FalihGrozavu2018, Falih2018] employs a bit different approach than in (5.6) but nevertheless still deals with distance matrices (with respect to certain chosen seed-nodes).

There are no highly influential methods among the distance-based ones according to Figure 3, so we are going to describe the one most interesting to us.

5.2.1. DVil

In DVil [Villavialaneix2013], (5.6) with and being different kernel functions is used to combine semantic information of different types (graph, numerical variables, factors, textual variables, etc.) and network topology between all the nodes in the network. In one of the experiments, is the shortest path length between two nodes. What is more, the topology is non-fixed there. The obtained distance matrix is then used in a stochastic kernel SOM444Throughout the text, SOM stand for self-organising maps. algorithm at the community detection step. The usage of SOM allows to simultaneously solve the problem of visualisation by projecting the nodes onto a grid of small dimension. The method DVil [Villavialaneix2013] is later developed in [Olteanu2013], where the balance between topological and semantic information is tuned automatically.

5.2.2. SToC

Semantic-Topological Clustering SToC [Baroni2017] has time complexity , where and are the number of nodes and edges in the network, respectively. SToC uses a fusing function different from (5.6), namely,

(5.9)

One can formally think that in this scheme, however the impact of the semantic and topological components is still controlled by the parameters involved in and (see below).

The topological distance in SToC is defined via (5.7) and the notion of -neighbourhood:

where the -neighbourhood of is the set of nodes reachable from with a path of length at most (being a parameter), see [Gunnemann2011]. To reduce complexity, the Jaccard distance in is approximated with a bounded error (being a one more parameter) by bottom- sketch vectors [Cohen2007], i.e. compressed representations of -neighbourhood in this case. The semantic distance for quantitative attributes (normalised to ) is calculated using the Euclidean distance (5.8), and for categorical attributes using the Jaccard distance (5.7). The resulting distance as in (5.9) is defined to be in .

Using , a cluster is defined by considering nodes that are within a maximum distance from a given node. Namely, for a given threshold , a -close cluster is a subset of the nodes in such that there exists a node such that for all , . A -close clustering of is defined as a partition of its nodes into -close clusters. At the clusterisation step, SToC iteratively extracts -close clusters from starting from random seeds (chosen through a select node function) by partial traversal of . Take into account that is contained in the set of nodes such that . Nodes assigned to a cluster are not used in further iterations, thus the clusters formed are not overlapping. Moreover, the approach does not require the number of clusters as input.

As the choice of the parameters and in SToC can be non-trivial, the authors propose an autotuning procedure. It computes optimal and via approximating the cumulative distribution of and , taking into account parameters and , provided by the user and controlling the importance of semantic and topological component, respectively.

Remark 5.

There exist distance-based methods for multilayer networks. For example, CLAMP (CLustering Attributed Multi-graPhs) [Papadopoulos2017] is an approach for clustering attributed networks with heterogeneous (numerical and categorical) attributes and multiple types of edges that uses a unified distance measure similar to (5.6), in a sense. The distance measure takes into account the importance of the node properties and the balance between the sets of attributes and edges, by assigning different weight to each of them. The clustering process adopts the gradient descent to produce fussy clusters. It is also worth mentioning that CLAMP is highly parallelisable.

Algorithm Graph augmentation Input for / Method of clusterisation Number of clusters as input/Clusters overlap Network size Evaluation Topology Databases Community detection methods for attributed graphs compared with
SA-Cluster [Zhou2009]
Inc-Cluster [Zhou2010, Cheng2012]
SA-Cluster-Opt [Cheng2011]
Semantic vertexes and structure-semantics edges Distance matrix (via neighbourhood random walks)
Modified -medoids [Zhou2009]
Yes/No Small
Medium
Density
Entropy
Non-fixed: adding edges Political Blogs
DBLP10K
DBLP84K
W-Cluster [Zhou2009] (based on (5.6))
SA-Cluster [Zhou2009]
Inc-Cluster [Zhou2010, Cheng2012]
SA-Cluster-Opt
SCMAG [Huang2015] Semantic vertexes and structure-semantics edges Distance matrix (via neighbourhood random walks)
Subspace clustering algorithm based on ENCLUS [Cheng1999]
No/Yes Medium Density
Entropy
Non-fixed: adding edges IMDB
Arnetminer bibliography
SA-Custer [Zhou2009]
GAMer [Gunnemann2014]

Table 7. Node-augmented graph distance-based methods

5.3. Node-augmented graph distance-based methods

Figure 7. Distance approaches based on an node-augmented graph (attributes are converted to new attribute nodes thus forming an augmented graph together with the initial structural nodes).

Methods from this class transform the initial graph into another node-augmented graph with new semantic nodes representing distinct node attributes, see Figure 7 and Table 7. Edges between structural and semantic nodes are added according to the node attributes in (thus the topology is non-fixed). Take into account that the resulting graph is much larger than (especially if the dimension and the sets of possible attribute values of node attributes are large) and this extremely increases the time complexity of the methods.

According to Figure 3, SA-Cluster family is one of the most influential methods for community detection in attributed graphs and therefore we now give a short description of it.

5.3.1. SA-Cluster

The method SA-Cluster [Zhou2009] transforms into with new attribute nodes representing distinct node attribute values. Namely, an attribute node represents an attribute-value pair . If for , then an attribute edge is added between and (this however cannot be applied to continuous attributes). In , two vertices are close if they are connected through many structural and/or attribute edges. The neighbourhood random walk model is further used in SA-Cluster

to estimate the node closeness in

.

To proceed, we recall several definitions from [Zhou2009]. Let be the (one-step) transition probability matrix of a graph. Given as the length of a random walk, as the restart probability, the neighbourhood random walk distance from to from the graph is defined as

where is a path from to whose length is with transition probability . Moreover,

where is the neighbourhood random walk distance matrix. One can measure then the closeness between vertices and as

(5.10)

If , the neighbourhood random walk is the same as the random walk with restart defined in [Tong2006].

To combine the structural closeness and attribute similarity in , [Zhou2009] constructs the transition probability matrix of the graph and computes the corresponding distances (5.10). Notice that at this step weights on edges in are assigned: topological edge has a weight of , semantic edges corresponding to have an edge weight of , respectively.

The resulting distance matrix, based on (5.10) for , is then fed in a -medoids type clustering algorithm. First, good initial centroids from the density point of view [Hinneburg1998] are chosen. Furthermore, a converging iterative process for the optimisation of an objective function (in order to maximize intra-cluster similarity and minimize inter-cluster similarity) is performed, with the corresponding adjustments of the edge weights .

As expected from the construction of , SA-Cluster is computationally expensive, namely, its time complexity is . In order to improve the efficiency and scalability of SA-Cluster, the methods Inc-Cluster [Zhou2010, Cheng2012] and SA-Cluster-Opt [Cheng2011] have been proposed. The main idea behind them is to reduce the number and the complexity of random walk distance computations.

5.4. Embedding-based (early fusion) methods

Figure 8. Embedding-based methods (in the simplest case, attribute vectors can be concatenated with the node embeddings and further fed to -means).

As is well-known, a graph as a traditional representation of a network brings several difficulties to network analysis. As mentioned in [Cui2019], graph algorithms suffer from high computational complexity, low parallelisability and inapplicability of machine learning methods. Novel network embedding techniques aim to tackle this by learning low-dimensional continuous vector representations (also known as embeddings) for the network nodes so that main network information is efficiently encoded555The embedding approach is an algorithmic framework for learning continuous feature representations for nodes in networks, initially proposed as node2vec [Grover2016]. Node2vec learns a mapping of nodes to low-dimensional space of features by maximizing the likelihood of preserving network neighbourhoods of nodes. As a result, embeddings reflect the structural equivalence or homophily between network nodes [Grover2016].. Additionally, the embeddings not only aim at reconstructing the initial network but also at supporting network inference such as predicting links, classification and clustering nodes (for more details, see [Cui2019, Cai2018]).

In the context of node-attributed social networks, the objective of network embedding is efficient low-dimensional encoding and combining both the network topology and semantics preserving proximities of different orders [Tang2015, Cao2015, Gao2018]. Having an embedding representation for the nodes, one can theoretically use traditional distance-based clusterisation methods such as -means and -medoids to further tackle the clusterisation problem, see Table 8.

Undoubtedly, there exists a rich bibliography on embedding techniques for networks with side information (node- and edge-attributed, heterogeneous in node and edge types) [Cui2019, Cai2018]

but in fact not all of them are reliable for the community detection task. It is worth mentioning that the task of classification (i.e. a supervised learning task) is typically considered. At the same time, some authors use embedding techniques for clusterisation in performance experiments that have been used only for classification in the original papers, e.g. in

[Gao2018] the comparison is between attributed network embedding methods include TADW [Yang2015], LANE [Huang2017b], GAE [Kipf2016], VGAE [Kipf2016], and GraphSAGE [Hamilton2017]. Taking all these fact into account, we confine ourselves in this survey only to the methods that work with node-attributed social or citation networks, have been applied to community detection and compared with other clusterisation methods.

Algorithm Embeddings Input for / Method of clusterisation Number of clusters as input/Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
PLANE [Le2014] Via a generative model and EM [Dempster1977] Node embeddings
-means
Yes/No Small
Medium
Accuracy Cora* Relational Topic Model [Chang2009]+Topic Distributions Embedding [Iwata2007]
DANE [Gao2018] Autoencoder Node Embeddings
-means
Yes/No Medium Accuracy Cora
Citeseer
PubMed Diabetes
Wiki
Embeddings obtained via TADW [Yang2015]
LANE [Huang2017b]
GAE [Kipf2016]
VGAE [Kipf2016]
GraphSAGE [Hamilton2017]
CDE [Li2018] Topology embedding matrix Topology embedding matrix and attribute matrix
Non-negative matrix factorisation
Yes/(Yes/No) Small
Medium
Accuracy
NMI
Jaccard similarity
F1-score
Cora
Citeseer
WebKB
Flickr*

Philosophers [Hunter2004]
ego-Facebook
PCL-DC [Yang2009]
Circles [Leskovec2012]
CESNA
[Yang2013]
SCI [Wang2016]
MGAE [Wang2017] Autoencoder Node embeddings
Spectral clustering
Yes/No Medium Accuracy
NMI
-score
Precision
Recall
Average Entropy
Adjusted Rand Index
Cora
CiteSeer

Wiki
Circles [Leskovec2012]
RTM [Chang2009]
RMSC [Xia2014]
Embeddings obtained via TADW [Yang2015]
VGAE [Kipf2016]
Table 8. Embedding-based (early fusion) methods

There are no highly influential methods among the embedding-based ones according to Figure 3 but we provide a short description of each one from Table 8 due to the novelty and importance of embedding techniques in the clusterisation task.

5.4.1. Plane

Probabilistic LAtent Document Network Embedding PLANE [Le2014] is a topic-based embedding method that aims to combine the following representations of each node with text attributes (e.g. in a citation network): the high-dimensional representations based on word occurrences and network topology, the representation in terms of a topic distribution (based on the Relational Topic Model [Chang2009]

) and the low-dimension representation for nodes. The representations are joint through a generative model, with the estimation of the parameters (including the corresponding node embeddings) via the maximum a posteriori estimation with EM algorithm

[Dempster1977]. It is interesting that not only observed positive links are incorporated but also virtual negative ones. For each node, the authors form a -dimensional embedding to simultaneously solve the visualisation problem. To perform community detection, the embeddings obtained are fed to -means.

5.4.2. Dane

Deep Attributed Network Embedding DANE [Gao2018] is an embedding-based algorithm using a deep model to preserve the first-order, high-order and semantic proximities in the attributed network. There are two branches composed of a multi-layer non-linear function and capturing the network topology and semantics with further mapping them into a low-dimensional space. Each branch is an auto-encoder, i.e. an unsupervised deep model widely used in machine learning [Jiang2016]

. The auto-encoders aim to minimize the reconstruction loss between the input vectors and the output embeddings to preserve the above-mentioned proximities. At that, the consistency and complementary of topology and semantics are preserved simultaneously at some point in order to obtain better structure-attribute fusion. Note that the loss function exploits an efficient most negative sampling strategy (with complexity

). The resulting output is the concatenation of the embeddings obtained by each branch. Typically for the methods from this class, community detection is performed by -means on the embeddings.

5.4.3. Cde

The method of CDE (Community Detection in attributed graphs: an Embedding approach) [Li2018] uses a special function to measure community membership similarity. Its values further are input for a procedure based on skip-gram with negative sampling [Mikolov2013] to obtain a community structure embedding matrix that encodes the latent densely-connected subgraphs and explore inherent community structures. After this, the embedding matrix is used instead of the adjacency matrix for the network. Having the structure embedding matrix and attribute matrix at hand, the actual community detection is further performed via a nonnegative matrix factorization procedure (with a unified topology- and semantics-aware objective function) that optimizes community membership with suitable iterative updating rules based on Majorization-Minimization framework [Hunter2004] (cf. [Wang2016]). The impact of topology and semantics may be varied. The resulting communities (the number is an input) may overlap or not depending on the community membership rule chosen.

5.4.4. Mgae

Marginalized Graph Autoencoder for Graph Clustering MGAE [Wang2017] takes an attributed graph as input and learns a topology and semantics with an augmented autoencoder upon them, with the graph convolutional network as a base. The authors propose to corrupt the semantics with noise with further marginalization in order to obtain a better representation from the autoencoder. By stacking multiple layers of the autoencoder, MGAE results in a deep representation for network nodes that is later fed into the spectral clustering algorithm.

Remark 6.

Community detection in heterogeneous and multilayer networks is considered e.g. in [Chang2015, Huang2017, Pei2018]. Other embedding approaches for different heterogeneous networks (in particular, node-attributed) which are used mostly for classification but theoretically can be applied for clusterisation are discussed e.g. in the comprehensive surveys [Cui2019, Cai2018].

Algorithm Attribute types Patterns Number of clusters as input/Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
AHMotif [Li2018Motif] Binary
Numerical
Motif Yes/No Medium NMI
Accuracy
Cora
WebKB
Table 9. Pattern mining-based methods

5.5. Pattern mining-based (early fusion) methods

Recall that a motif is a pattern of the interconnection occurring in real-world networks at numbers that are significantly higher than those in random networks [Milo2002] (note that a spanning tree pattern and a clique are representatives of motifs). Motifs are considered as building blocks for complex networks [Milo2002], and they may help to uncover useful information hidden in the network topology and semantics. We found just one community detection method for attributed social networks based on this idea, namely, AHMotif (Attribute Homogenous Motif-based method) [Li2018Motif], see Table 9. This method equips structural motifs identified for the network with the so-called homogeneity value based on attributes of the nodes involved in the motif. This information is then stored in a special adjacency matrix. Subsequently, the matrix is the input to the existing community detection algorithms such as Permanence [Chakraborty2014] and Affinity Propagation [Frey2007].

6. Simultaneous fusion methods

Oppositely to the early fusion methods, the simultaneous fusion ones use and fuse topology and semantics in a unified process with community detection. Some of them are based on modifications of known clusterisation algorithms such as Louvain, Normalised Cut, -means, -medoids and , or attribute-aware adaptations of heuristic approaches such as evolutionary and genetic algorithms. A big subclass of simultaneous fusion methods use non-negative matrix factorisation framework to detect communities in attributed social networks, while another subclass — generative probabilistic models — aim to statistically infer a model of the attributed network under assumption that topology and semantics are generated accordingly to some parametric distributions.

Algorithm Modified method Number of clusters as input/ Clusters overlap Network size Evaluation Databases Other attributed network clusterisation methods compared with
OCru [Cruz2011] Louvain [Blondel2008]
Added attribute Entropy minimisation
No/No Medium Modularity
Entropy
Facebook100
SAC1 [DangViennet2012] Louvain [Blondel2008]
Added attribute similarity maximisation
No/ No Small
Medium
Density
Entropy
Political Blogs
Facebook100

DBLP10K
SAC2 [DangViennet2012]
WSte2 [Steinhaeuser2010]
Fast greedy [Clauset2004] for weighted graph
I-Louvain [Combe2015] (code) Louvain [Blondel2008]
Added maximisation of attribute-based measure Inertia
No/ No Small
Medium
NMI
Accuracy
DBLP+Microsoft Academic Search
Synthetic
ToTeM [Combe2012]666The authors claim that they compare I-Louvain with ToTeM [Combe2012], “another community detection method designed for attributed graphs which exploits the two types of information”. However, it seems that there is an inaccuracy with it as [Combe2012] does not contain any method called ToTeM.
LAA/LOA [Asim2017] Louvain [Blondel2008]
Modularity gain depends on attributes
No/No Small Density
Modularity
London gang [Grund2015]
Italy gang
Polbooks
Adjnoun [Newman2006]
Football [Girvan2002]
UNCut [Ye2017] Normalised Cut
Added attribute homogenuity-aware measure Unimodality Compactness
Yes/No Small
Medium
NMI
ARI
Disney [Muller2013]
DFB [Gunnemann2013]
ARXIV [Gunnemann2013]
Political Blogs
4area [Perozzi2014-2]
Patents
SA-cluster [Zhou2009]
SSCG [Gunnemann2013]
NNM [Shiga2007]
DAEGC [Wang2019] Graph attention network [Velickovic2018]+

-means for node embeddings+Stochastic Gradient Descent

Yes/No Medium ACC
NMI
-measure
ARI
Cora
Citeseer
Pubmed
RMSC [Xia2014]
TADW [Yang2015] +-means
VGAE and GAE[Kipf2016] +-means
NetScan [Ester2006, Ge2008] An approximation algorithm for the connected -Center optimization problem Yes/Yes Small
Medium
Accuracy Professors*
Synthetic
DBLP*
BioGRID+Spellman
JointClust [Moser2007] An approximation algorithm for the Connected X Clusters problem No/No Medium Accuracy DBLP*
CiteSeer*
Corel stock photo collection
MAM [Sanchez2015] (code)

Louvain-type algorithm with attribute-aware Modularity+Outlier detection

No/No Small
Medium
Large
F1-score
Attribute-aware Modularity
Synthetic
Disney [Muller2013]
DFB [Gunnemann2013]
ARXIV [Gunnemann2013]
IMDB [Gunnemann2013]
DBLP*
Patents*
Amazon [Sanchez2013]
CODA [Gao2010]
SS-Cluster [Farzi2018] -Medoid based clustering algorithm with structural and attribute objective functions Yes/No Medium Density
Entropy
Political Blogs
DBLP10K
SA-cluster [Zhou2009, Cheng2011]
W-cluster [Cheng2011]
SNAP [Tian2008]
Adapt-SA [Li2019] Weighted -means for -dimensional representations of structure and attributes Yes/No Medium Accuracy
NMI
F-measure
Modularity
Entropy
Synthetic
WebKB
Cora
Political Blogs

CiteSeer
DBLP10K
CODICIL [Ruan2013]
SA-Cluster [Zhou2006]
Inc-Cluster [Zhou2010]
PPSB-DC [Chai2013]
PCL-DC [Yang2009]
BAGC [Xu2012]
kNAS [Boobalan2016] with added Semantic Similarity Score Yes/Yes Medium Density
Tanimoto Coefficient
DBLP*
Facebook*
Twitter*
SA-Cluster-Opt [Cheng2011]
CODICIL [Ruan2013]
NISE [Whang2016]
Table 10. Methods that modify Louvain, Normalised Cut, -means, -medoids and algorithms

6.1. Methods modifying Louvain, Normalised Cut, -means, -medoids and algorithms

The list of the methods is given in Table 10. According to Figure 3, SAC1 is one of the most influential methods for community detection in attributed social networks. Besides SAC1, we will also provide short descriptions of several other interesting methods.

6.1.1. Sac1

The method SAC1 [DangViennet2012] is based on the modification of Newman’s Modularity [Clauset2004] for a given partition of into clusters:

where the normalised link strength between nodes and is measured by comparing the existing network connection with the expected number of connections ( is the degree of ). To deal with the attributes, SAC1 uses the attribute modularity of a partition:

where is an attribute similarity function. As for the above-mentioned SAC2 [DangViennet2012], if attributes are discrete, (5.2) is used for , while if they are continuous, (5.5) with is applied. Textual ones are first transformed into numeric values by TF-IDF procedure. The similarity between the resulting representations is (5.3) or (5.4).

Next, a composite modularity is introduced as a weighted combination of structure modularity and attribute modularity

where is a fusion parameter. This function is then maximised in a way similar to that in Louvain [Blondel2008].

6.1.2. I-Louvain

The method I-Louvain [Combe2015] (source code and datasets) is based on a local optimization of a global criterion that includes Modularity [Newman2006] and a new measure called Inertia. The measure is defined by the sum of euclidean distances between attribute vectors and its centre of gravity, an average attribute vector over attribute vectors in the network. Using this notion, the authors define Inertia-based modularity for a partition that allows to compare, for each pair of elements from the same community, the expected distance with the observed distance between attributes. While considers the strength of the link between nodes in order to cluster strongly connected nodes, aims at clustering nodes whose attributes are the most similar. Like in [DangViennet2012], the community detection process consists in the optimisation of a linear combination of and similar to Louvain’s.

6.1.3. Daegc

Deep Attentional Embedded Graph Clustering DAEGC [Wang2019]

, in opposite to the embeddings-based early fusion methods, uses a goal-directed deep learning approach with a unified framework for producing embeddings and clustering. Namely,

DAEGC fuses network topology and semantics via an attentional autoencoder (a variant of the graph attention network [Velickovic2018] taking into account high-order proximity) to obtain node embeddings. Furthermore, basing on the embeddings, soft labels are generated to guide a self-training graph clustering component. These two procedures are joint and performed iteratively to benefit both embedding and clusterisation quality. DAEGC produces non-overlapping clusters where is an input.

Figure 9. The scheme of DAEGC [Wang2019].

6.1.4. kNAS

The method kNAS [Boobalan2016] starts with identification of centroids for clusters ( is an input) as nodes with high Local Outlier Factor [Breunig2000] meaning that the node is core (low Local Outlier Factor refers to outliers). Initial clusters are formed by the NN algorithm (i.e. topological similarity is achieved). Furthermore, the so-called Similarity Score responsible for nodes’ semantic similarity within clusters is measured. Taking into account the Similarity Score obtained, the clusters are merged and centroids updated in a certain way. The process of achieving topological and semantic similarity repeats until the Similarity Score is maximized.

6.2. Metaheuristic-based methods

These methods adapt metaheuristic algorithms (in particular, evolutionary algorithms and tabu search) for optimisation of an objective function that quantifies the structural closeness and attribute homogeneity of an attributed network partition. The list of the methods is given in Table 

11.

Algorithm Modified method Number of clusters as input/ Clusters overlap Network size Evaluation Databases Other attributed network clusterisation methods compared with
MOEA-SA [Li2017-2] Multiobjective evolutionary algorithm (Modularity and Attribute Similarity are maximized) No/No Small
Medium
Density
Entropy
Political Books
Political Blogs

Facebook100
ego-Facebook
SAC1-SAC2 [DangViennet2012]
SA-Cluster [Zhou2009]
MOGA-@Net [Pizzuti2019] Multiobjective genetic algorithm (optimizing Modularity, Community score, Conductance, attribute similarity) No/No Small
Medium
NMI
Cumulative NMI
Density
Entropy
Synthetic
Cora
Citeseer

Political books
Political Blogs
ego-Facebook
SA-cluster [Zhou2009], BAGC [Xu2012]
OCru [Cruz2011Entropy]
Selection [Elhadi2013]
HGPA-CSPA [Elhadi2013, Strehl2003]
JCDC [Zhang2016] Tabu search and gradient ascent for a structure-attribute-aware loss function Yes/No Small
Medium
NMI Synthetic
World trade network [Nooy2004]
Lazega
CASC [Binkiewicz2017]
CESNA [Yang2013]
BAGS [Xu2012]
Table 11. Metaheuristic-based methods

6.3. Non-negative matrix factorisation and matrix compression

Non-negative matrix factorization (NMF) is a family of algorithms that aim to approximate a non-negative matrix with high rank by a product of non-negative matrices with lower ranks so that the approximation error by means of the Frobenius norm, denoted below by , is minimal. As well known, NMF has an inherent clustering property, i.e. is able to find clusters in the input data [Lee2001]. The approximating product of matrices usually contains two factors but some algorithms [Ding2006] propose to include three or more. Often NMF is regularised (e.g. by a Lasso type conditions) to avoid bad behaviour of the approximating matrices.

As for node-attributed social networks, NMF requires a proper adaptation to fuse both topology and semantics and this has been done in several papers, see Table 12. To proceed, we need additional notation. In what follows, denotes the adjacency matrix for the initial network topology (as before, is the number of nodes), the node attribute matrix for the initial network semantics ( is the dimension of attribute vector ), the number of required clusters (it is an input in NMF approaches), the cluster membership matrix whose elements indicate the association of nodes with communities and finally denotes the cluster membership matrix whose elements indicate the association of the attributes with the communities. Other auxiliary matrices will be introduced below.

The general idea of NMF methods for attributed networks is to use known matrices , and the number of clusters in order to determine the unknown matrices and in an iterative optimisation procedure, and thus to obtain simultaneously a community partition and the corresponding semantic description for each community. Note that each element of normalised and in fact contains the probability of a node to belong to a particular community (communities may overlap in these settings). One can instead assign a node to the community with the highest probability to obtain non-overlapping communities [Wang2016].

“Matrix compression” technique will be discussed below while describing PICS algorithm [Akoglu2012].

Note that SCI [Wang2016] is one of the most influential methods for community detection in attributed social networks according to Picture 3. We will give a short description of SCI and several other NMF-based methods below.

Algorithm Factorisation/ compression type Number of clusters as input / Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
NPei [Pei2015] 3-factor NMF Yes/Yes Small
Medium
Purity Twitter
DBLP*
Relational Topic Model [Chang2009] (for documents)
3NCD [Nguyen2015] -factor NMF Yes/Yes Medium
Large
F1-score
Jaccard similarity
ego-Facebook
ego-Twitter
ego-G+
CESNA [Yang2013]
SCI [Wang2016] 2-factor NMF Yes/Yes Medium ACC
NMI
GNMI
-measure
Jaccard similarity
Citeseer
Cora
WebKB
LastFM
PCL-DC [Yang2009]
CESNA [Yang2013]
DCM [Pool2014]
JWNMF [Huang2015] 2-factor NMF Yes/Yes Small
Medium
Modularity
Entropy
NMI
Amazon Fail dataset
Disney dataset
Enron dataset
DBLP-4AREA dataset
WebKB
Citeseer
Cora
BAGC [Xu2012]
PICS [Akoglu2012]
SANS [Parimala2015]
SCD [Li2017] 2- and 3-factor NMF Yes/Yes-No Small
Medium
Accuracy
NMI
Twitter
WebKB
SCI [Wang2016]
ASCD [Qin2018] 2-factor NMF Yes/Yes-No Small
Medium
ACC
NMI
-measure
Jaccard similarity
LastFM
WebKB
Cora

Citeseer
ego-Twitter*
ego-Facebook*
Block-LDA [Balasubramanyan2011]
PCL-DC [Yang2009]
SCI [Wang2016]
CESNA [Yang2013]
Circles [Mcauley2014]
CFOND [Guo2019] - and -factor NMF Yes/(Yes/No) Medium Accuracy
NMI
Cora
CiteSeer
PubMed
Attack
Synthetic
GNMF [Cai2008]
DRCC [Gu2009]
LP-NMTF [WangNie2011]
iTopicModel [Sun2009]
MVCNMF [He2017] -factor NMF Yes/Yes Small
Medium
Density
Entropy
Political Blogs
CiteSeer
Cora
WebKB

ICDM (DBLP*)
FCAN [Hu2016]
SACTL [Xu2016]
kNAS [Boobalan2016]
PICS [Akoglu2012] Matrix compression (finding rectangular blocks) No/No Small
Medium
Anecdotal and visual study Youtube [Mislove2007]
Twitter*
Phonecall [Eagle2009]
Device [Eagle2009]
Political Books (link)
Political Blogs
Table 12. Non-negative matrix factorisation and matrix compression approaches

6.3.1. NPei

The method NPei [Pei2015] uses a constrained nonnegative matrix tri-factorization framework [Ding2006] to cluster Twitter users and messages by fusing the relations between users (i.e. topology) and content (i.e. semantics). The initial point is a user-word-tweet tripartite network represented by several adjacency matrices similar to and . The optimisation problem proposed by the authors includes however not only user-user, user-word and word-tweet adjacency matrices but also three types of network regularization [Smola2003] to model user similarity, message similarity and user interaction. The similarities are measured by a version of PageRank [Page1999] based on the cosine similarity of messages and the adjacency matrix for users. The optimisation is further performed by an iterative update algorithm [Ding2006] to obtain user cluster matrix and message cluster matrix . According to the authors, the complexity of their approach is with respect to the number of nodes in the network.

6.3.2. Sci

Semantic Community Identification SCI [Wang2016] adopts NMF for fusing topology and semantics as follows. The consistency in topology is modelled as , while the consistency in semantics as . The authors also propose to select the most relevant attributes for each community by adding an norm sparsity term to each column of matrix . This together with the models for topology and semantics leads to the following unified optimisation problem:

where controls the topology impact and the sparsity penalty. Within SCI, a local minima is found by Majorization-Minimization framework [Hunter2004]. In particular, the algorithm iteratively updates with fixed and then with fixed so that the process is guaranteed to converge. Note that, instead of using directly, the authors consider as the final community membership matrix.

6.3.3. Jwnmf

Joint Weighted Nonnegative Matrix Factorization method for clustering attributed graphs JWNMF [Huang2015] follows the same way to model topology as in [Wang2016] but with a weighted factorization for semantics, where the weights are automatically determined and updated to reduce the influence of uninformative attributes. Namely, a normalised diagonal matrix is introduced to assign a weight for each attribute and to be further used in the approximation by means of the -norm, inspired by SymNMF [Kuang2015]. The corresponding optimisation problem thus takes the form:

where is the fusion parameter. The optimisation is performed iteratively [Ding2006]. Finally, a -means variant is performed on to identify clusters. The complexity of JWNMF is .

6.3.4. Scd

The Semantic Community Detection method SCD [Li2017] introduces an additional community relationship indicator matrix whose elements describe the relationships between the corresponding communities, and set regularisation condition on it that aim to ensure the consistency of the community structure with respect to topology and semantics. The optimisation problem obtained is

where are the fusion parameters. The problem is further solved iteratively [Ding2006].

6.3.5. Ascd

Adaptive Semantic Community Detection ASCD [Qin2018] follows the general line of NMF modelling discussed above but additionally employs an adaptive parameter to control the mismatch between topology and semantics components. According to the authors, the mismatch, i.e. the effect occurring when topology is not compatible with semantics, may happen for some networks and negatively affect the clustering performance (several their experiments confirm it). For this reason, they deal with the following optimisation problem

where indicates the iteration number and is the matching coefficient that controls the trade-off between topology and semantics according to the mismatch degree. There are two versions of , namely, one is based on functions and another on the NMI between the network topology and semantics. In particular, the former matching coefficient is defined as

where is a parameter. The optimisation problem is solved by the two-step block coordinate descent ( is updated while is fixed, then vice versa).

Remark 7.

In [Ito2018], NMF-based community detection in multilayer attributed networks is considered. Let us also mention the method from [Maekawa2018] that captures the complicated relationship between topology and semantics using a nonlinear projection function between the different cluster assignments for topology and semantics and adopts the positive unlabelled learning [Liu2003] to take the effect of partially observed positive edges into the cluster assignment.

6.3.6. Pics

The method PICS [Akoglu2012] (source) is a parameter-free algorithm that not only finds clusters but also detects anomalies and bridges. It is worth mentioning however that the nodes in a cluster found by PICS may be not necessarily densely connected due to the definition of clusters in [Akoglu2012]. As for the community detection process, PICS simultaneously “compresses” the network adjacency matrix and the binary attribute matrix by finding homogeneous rectangular blocks (considered further as clusters) of low and high densities in the matrices. The MDL principle [Grunwald2007], a criterion based on lossless compression principles, is adapted for this procedure.

6.4. Pattern mining-based (simultaneous fusion) methods

Pattern mining in attributed social networks focuses on fining and extraction of patterns, e.g. subsets of specific attributes or connections, in network topology and semantics [Atzmueller2019]. This in turn helps to make sense of a network and to understand why the corresponding connections could be formed. Pattern mining methods for community detection typically use local patterns and optimisation criteria for finding informative communities not in the whole network but in its part only (e.g. [Pool2014, Atzmueller2016]). Note that there are many papers devoted to pattern and semantic subgraph mining in social networks (see the survey in [Atzmueller2019]) but the majority of them do not deal with the task of community detection.

Community detection may be also based on cliques, according to a natural assumption that a community is a subset of well-connected nodes [Bothorel2015, Khediri2017]. Recall that in graph theory, a clique is a subset of nodes in an undirected graph such that every two nodes are adjacent, i.e. the corresponding subgraph is complete. A clique is called maximal if there is no other clique that contains it.

The list of the corresponding method is presented in Table 13.

Algorithm Patterns/Cliques Number of clusters as input/Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
DCM [Pool2014] Semantic patterns (queries) Yes/Yes Small
Medium
Evaluation Delicious
LastFM
Flickr
COMODO [Atzmueller2016] Semantic patterns Yes/Yes Small
Medium
Description complexity
Community size
BibSonomy [Benz2010]
Delicious
LastFM
DCM [Pool2014]
ACDC [Khediri2017] Maximal cliques Yes/Yes Medium Density Political Blogs SA-Cluster [Zhou2009]
SAC1-SAC2 [DangViennet2012]
Table 13. Pattern mining-based (simultaneous fusion) methods

Note that DCM [Pool2014] is a rather influential method according to Picture 3 and therefore we provide its main ideas below.

6.4.1. Dcm

Description-Driven Community Detection DCM [Pool2014] (code source) searchers for patterns in binary attributes to form overlapping communities together with their proper descriptions. More precisely, each iteration of the algorithm consists of two steps aiming at reshaping the community via optimising first a certain structural quality function (the so-called community score based on local topology) and secondly a description complexity function that is based on mined concise patterns in attributes best describing the community (the patterns are called queries). Mining patterns is based on the ReMine algorithm [Zimmermann2010] that recursively splits the data into the most informative patterns. The authors underline that DCM is able to grow communities starting from small seeds of nodes or from preliminary descriptions (depending on what information is available at the beginning). At the same time, DCM is not initially created for the complete coverage of the network.

Remark 8.

In [Berlingerio2013], ABACUS (frequent pAttern mining-BAsed Community discoverer in mUltidimensional networkS) is proposed to extract communities based on the extraction of patterns from multilayer attributed social networks.

6.5. Probabilistic model-based methods

Algorithm Model features Number of clusters as input/Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
PCL-DC [Yang2009] Conditional Link Model
Discriminative Content model
Yes/No Medium NMI
Pairwise -measure
Modularity
Normalized cut
Cora
Siteseer
PHITS-PLSA [Cohn2001]
LDA-Link-Word [Erosheva2004]
Link-Content-Factorization [Zhu2007]
CohsMix [Zanghi2010] MixNet model [Snijders1997] Yes/No Small Rand Index Synthetic
Exalead.com search engine dataset
Multiple view learning [Zhang2006]
Hidden Markov Random Field [Ambroise1997]
BAGC [Xu2012]
GBAGC [Xu2014]
a Bayesian treatment on distribution parameters Yes/No Medium Modularity
Entropy
Political Blogs
DBLP10K
DBLP84K
Inc-Cluster [Zhou2010]
PICS [Akoglu2012]
VEM-BAGC [Cao2014] Based on BAGC [Xu2012] Yes/No Medium Modularity
Entropy
Political Blogs
Synthetic networks
BAGC [Xu2012]
PPSB-DC [Chai2013] Popularity-productivity stochastic block model and discriminative content model Yes/No Medium normalized mutual information (NMI)
Pairwise F measure (PWF)
Accuracy
Cora
CiteSeer
WebKB
PCL-DC [Yang2009]
PPL-DC [Yang2010]
CESNA [Yang2013] A probabilistic generative model assuming communities generate network structure and attributes No/Yes Medium
Large
Evaluation ego-Facebook
ego-G+

ego-Twitter
Wikipedia* (philosophers)
Flickr
CODICIL [Ruan2013]
Circles [Mcauley2014]
Block-LDA [Balasubramanyan2011]
Circles [Mcauley2014] A generative model for friendships in social circles Yes/Yes Medium
Large
Balanced Error Rate ego-Facebook
ego-G+

ego-Twitter
Block-LDA [Balasubramanyan2011]
Adapted Low-Rank Embedding [Yoshida2010]
SI [Newman2015] A modified version of a stochastic block model [Holland1983] Yes/No Small
Medium
Normalized mutual information (NMI) Synthetic
High school friendship network
Food web of marine species in the Weddell Sea
Harvard Facebook friendship network
malaria HVR 5 and 6 gene recombination network
NEMBP [HeFeng2017] A generative model with learning method using a nested EM algorithm with belief propagation Yes/(Yes/No) Small
Medium
Accuracy
NMI
GNMI
F-score
Jaccard
WebKB
ego-Twitter*
ego-Facebook*
CiteSeer

Cora
Wikipedia*
Pubmed
Block-LDA [Balasubramanyan2011]
PCL-DC [Yang2009]
CESNA [Yang2013]
DCM [Pool2014]
SCI [Wang2016]
NBAGC-FABAGC [Xu2017] A nonparametric and asymptotic Bayesian model selection method based on BAGC [Xu2012] No/No Medium NMI
Modularity
Entropy
Synthetic
Political Blogs
DBLP10K
DBLP84K
PICS [Akoglu2012]
Table 14. Probabilistic model-based methods

Methods from this class statistically infer a model of a clustered attributed network under the assumption that its structure and attributes are generated according to certain parametric distribution. The generative or stochastic block model are mainly used [Alinezhad2019]. Note that it is a non-trivial task to properly choose a priori distributions for topology and semantics [Akbas2017].

According to [Yang2009], there are many probabilistic models combining both topology and semantics: PHITS-PLSA combines PHITS with PLSA for community detection [Cohn2001]), [Erosheva2004] combines LDA with LDA-Link for network analysis to have the LDA-Link-Word model, [Nallapati2008] combine the mixed membership stochastic block model with LDA, and extend the LDA-Link-Word model by separating the citing documents and cited documents with LDA-Link-Word model on the citing documents and PLSA model on the cited documents. However, the majority of the methods appeared before [Yang2009] focused on document clustering which is generally out of scope of the present survey. For this reason we consider only community detection methods published after the seminal paper [Yang2009], see Table 14. We will also describe PCL-DC [Yang2009], BAGC [Xu2012], GBAGC [Xu2014], CESNA [Yang2013] and Circles [Mcauley2014] as the most influential methods for community detection in attributed social networks according to Picture 3.

6.5.1. Pcl-Dc

The method PCL-DC (Popularity-based Conditional Link Model-Discriminative Content) [Yang2009] is based on a discriminative model of combining topology and semantics for community detection. It adapts a conditional model for network structure analysis taking into account the popularity of the nodes. The impact of irrelevant attributes is reduced by the usage of a discriminative content model where attributes are automatically assigned with proper weights, depending on their discriminative power. The above-mentioned models are further combined in a unified framework with the maximum likelihood inference performed in a two-stage EM-based optimization algorithm.

6.5.2. Bagc-Gbagc

The method BAGC (Bayesian Attributed Graph Clustering) [Xu2012] employs a Bayesian probabilistic model for detecting non-overlapping communities in networks with categorical attributes. BAGC uses a generative process similar to that in CohsMix [Zanghi2010], in particular, community labels for the nodes are modelled via a multinomial distribution independently, then attributes are modelled by a multinomial distribution and edges by a Bernoulli one basing on the labels modelled earlier. However, oppositely to CohsMix [Zanghi2010], BAGC works with categorical attributes and does not treat the parameters of distributions as fixed values. More precisely, BAGC takes a Bayesian treatment on the parameters and thus considers all their possible values that leads, according to the authors, to better community detection quality. The probabilistic inference is further performed by the variational approach from [Jordan1999] together with a certain approximating procedure. GBAGC (General Bayesian framework to Attributed Graph Clustering) [Xu2014], a generalisation of BAGC for weighted attribute networks, is further proposed by the same authors.

6.5.3. Cesna

The method CESNA (Communities from Edge Structure and Node Attributes) [Yang2013] simultaneously uses the probabilistic generative model of BIGCLAM [Yang2013bigclam] for generating connections and the logistic model for attributes to infer the distribution of community memberships. The resulting communities are overlapping. Furthermore, a block-coordinate ascent method is used to update all model parameters in -time, where , that makes CESNA robust for large attributed networks.

6.5.4. Circles

The method Circles [Mcauley2014] detects users’ social circles in attributed user’s ego networks via a multimembership node clustering. Its generative model is based on hard assignment of a node to multiple circles and learns the circle-specific user profile similarity metric. To maximize the corresponding likelihood, the coordinate ascent by [MacKay2002] is used.

Remark 9.

TUCM (Topic User Community Model) [Sachan2012] proposer generative Bayesian models for detecting overlapping communities in multilayer attributed networks where different types of interactions between users are possible.

6.6. Dynamical system-based and agent-based methods

Methods from this class treat a network as a dynamic system and assume that its community structure is a consequence of certain interactions among nodes, see Table 15. Some methods assume that the interactions occur in an information propagation process, i.e. while information is sent to or received from every node. Others comprehend each node as an autonomous agent and develop a multiagent system to detect communities. In fact, these methods are not among the most influential in Picture 3 but this is probably due to their novelty. In any case, these contemporary approaches seem to be very efficient for large attributed social networks as can be easily parallelised.

Algorithm Description Number of clusters as input / Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
CPIP-CPRW [Liu2015] Content (information) propagation models: a linear approximate model of influence propagation (CPIP) and content propagation with the random walk principle (CPRW) Yes/Yes Medium F-score, Jaccard Similarity, Normalized Mutual Information (NMI) CiteSeer
Cora
ego-Facebook
PubMed Diabetes
Adamic Adar [Adamic2003]
PCL-DC [Yang2009]
Circles [Leskovec2012]
CODICIL [Ruan2013]
CESNA [Yang2013]
CAMAS [Bu2017] Each node with attributes as an autonomous agent with influence in a cluster-aware multiagent system No/Yes Medium
Large
Coverage Rate
Normalized Tightness
Normalized Homogeneity
F1-Score
Jaccard
Adjusted Rand Index
Synthetic
ego-Facebook
ego-Twitter*

ego-G+
CESNA [Yang2013]
EDCAR [GunnemannBoden2013]
SLA [Bu2019] A dynamic cluster formation game played by all nodes and clusters in a discrete-time dynamical system Yes/No Medium
Large
Density
Entropy
F1-score
Delicious
LastFM

ego-Facebook
ego-Twitter*

ego-G+
CESNA [Yang2013]
EDCAR [GunnemannBoden2013]
Table 15. Dynamical system-based and agent-based methods

7. Late fusion methods

Late fusion methods intend to fuse topology and semantics after the clusterisation step. Usually clusterings produced separately for topological (e.g. by the Louvain method [Blondel2008]) and semantic (e.g. by -means [Hartigan1979]) information are further fused via consensus (ensemble-based) clustering techniques [Lancichinetti2012, Strehl2003, Tagarelli2017, Tandon2019, Gullo2013].

Figure 10. The scheme of late fusion methods.

7.1. Consensus-based methods

Given an ensemble of clusterings, the goal is to perform a consensus clustering, i.e., a single, prototypical clustering solution that optimizes a certain objective function properly defined over information available from the clusterings in the ensemble. A recent survey on general-purpose ensemble-based clustering methods can be found in [Boongoen2018]. Besides the above-mentioned general-purpose approaches that actually have not been compared with the methods discussed in this survey, we could only find the ones in Table 16 that particularly focus on community detection in attributed social networks. Definitely, further study in this direction and comparison with other attributed network clustering methods is necessary.

According to Picture 3, there are no consensus-based methods among the most influential for community detection in attributed social networks but we will nevertheless provide some details on some of them, namely, Selection [Elhadi2013] and WCMFA [Luo2019].

7.1.1. Selection

The Selection method [Elhadi2013] switches from topology-based to semantics-based clustering when the graph structure is ambiguous. More precisely, the method relies on topology-based clusters when the so-called estimated mixing parameter

for the topology-based clustering is less then the experimental value of the mixing parameter in LFR benchmark with ground truth [Lancichinetti2008] when the NMI corresponding to the topology-based method significantly drops (the graph structure is then called ambiguous). For instance, for the Louvain method as shown in [Lancichinetti2008, Elhadi2013]. If , then the semantics-based clustering (obtained e.g. by -means) is used. The performance of Selection is particularly compared with that of HGPA (HyperGraph Partitioning Algorithm) and CSPA (Cluster-based Similarity Partitioning Algorithm), general-purposed ensemble clustering methods from [Strehl2003], in combining the topology-based Louvain clustering and the semantics-based -means clustering of the network. It is observed that Selection is able to outperform (in some sense) the tested methods by switching from the Louvain clustering to the -means one.

7.1.2. Wcmfa

The Weighted Co-association Matrix-based Fusion Algorithm WCMFA [Luo2019] takes as an input an ensemble of several clusterings based separately on topology and semantics with weights depending on topological and semantic similarity of the initial nodes. Furthermore, a weighted co-association matrix is constructed so that the co-occurrence of two nodes in the same cluster and the degree of its similarity, if the pair is indeed in the same cluster, is taken into account. The matrix is then treated as a similarity matrix for the node set that can be input for Single Link, Complete Link or Average Link clustering algorithms to find a consensus community structure.

Remark 10.

We refer the interested reader to [Tagarelli2017, Tandon2019, Gullo2013] for general-purpose clustering methods for multilayer networks.

Algorithm Combining the partitions Number of clusters as input / Clusters overlap Network size Evaluation Databases Community detection methods for attributed graphs compared with
LCru [Cruz2013] Row-manipulation in the contingency matrix for the clusterings No/No Small
Medium
ARI
Density
Entropy
Facebook
DBLP10K
Selection [Elhadi2013] Switching between the clusterings Depends on the partitions Medium NMI
Modularity
Synthetic LFR benchmark [Lancichinetti2008]
DBLP84K
BAGC [Xu2012]
OCru [Cruz2011]
SA-Cluster [Zhou2009]
HGPA-CSPA [Strehl2003]
Multiplex[HuangWangg2016] Multiplex representation scheme (attributes and structure are clustered separately as layers and then combined via consensus [Tepper2015]) No/Yes Medium
Large
F1-score Synthetic
ego-Twitter
ego-Facebook
ego-G+
CESNA [Yang2013]
3NCD [Nguyen2015]
WCMFA [Luo2019] Association matrix with weighting based on topology and semantics similarity Depends on the partitions Small Rand index
Adjusted RI
NMI
Consult [Cross2004]
London Gang [Grund2015]
Montreal Gang [Descormiers2011]
WMen [Meng2018]
Table 16. Consensus-based methods

8. Conclusion

It is shown in the survey that there exist a large amount of methods for community detection in node-attribute social networks based on different fusion techniques. In particular, 77 methods are grouped and analysed using the proposed classification criterion and much more are mentioned as relative to the topic under consideration. Moreover, we indicated the most influential methods and gave their short descriptions.

According to our analysis, several essential problems exist in the area. For example, an comprehensive comparative study is an emergency problem as the existing partial contributions to this do not allow to see the overall picture of methods’ community detection quality. Figure 2 confirms the fact that the method-method comparison graph is very sparse. What is more, even if some methods are compared with others, it does not generally mean that one purely

outperforms another as hyperparameters of the methods under consideration are usually not tuned for a particular task. Moreover, different authors use for experiments different datasets and quality metrics and this does not add clarity to the question. Another issue is that many authors are just unaware of the state-of-the-art methods and continue to compare their approaches with rather inefficient pioneering methods.

We hope that our survey is an important step in resolving the above-mentioned methodological and experimental problems.

9. Competing Interests Statement

There are no competing interests in publication of this survey paper.

10. Acknowledgements

This research is financially supported by Russian Science Foundation, Agreement 17-71-30029 with co-financing of Bank Saint Petersburg.

References