    # Optimizing an Organized Modularity Measure for Topographic Graph Clustering: a Deterministic Annealing Approach

This paper proposes an organized generalization of Newman and Girvan's modularity measure for graph clustering. Optimized via a deterministic annealing scheme, this measure produces topologically ordered graph clusterings that lead to faithful and readable graph representations based on clustering induced graphs. Topographic graph clustering provides an alternative to more classical solutions in which a standard graph clustering method is applied to build a simpler graph that is then represented with a graph layout algorithm. A comparative study on four real world graphs ranging from 34 to 1 133 vertices shows the interest of the proposed approach with respect to classical solutions and to self-organizing maps for graphs.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Large and complex graphs are natural ways of describing real world systems that involve interactions between objects: persons and/or organizations in social networks, articles in citation networks, web sites on the world wide web, proteins in regulatory networks, etc. Newman (2003). However, the complexity of real world graphs limits the possibilities of exploratory analysis: while many graph drawing and graph visualization methods have been proposed Di Battista et al. (1999); Herman et al. (2000), displaying a graph with a few hundred vertices in a meaningful way remains difficult.

A way to tackle this scalability issue is to simplify a graph prior its drawing. This can be done by finding clusters of vertices via a graph clustering method Schaeffer (2007): rather than representing the original graph, the visualization is restricted to the clusters themselves. More precisely, the graph induced by the clustering is used: each cluster forms a vertex, while edges between clusters are induced by edges between the vertices they contain. As illustrated in the survey paper Herman et al. (2000), numerous implementations of this simple idea can be done, namely all the pairwise combinations between graph clustering methods and graph visualization methods (see also, e.g., Eades et al. (2006) for recent work on planar clustered graphs and Bourqui et al. (2007) for weighted graphs).

Rather than relying on a generic graph clustering algorithm, we propose in this paper to build a clustering of a graph that is adapted to a visualization of the graph induced by the clustering. We follow the general principal of topographic mapping initiated by Prof. Kohonen with the Self Organizing Map (SOM, Kohonen (1995)): a SOM builds a clustering of a dataset in homogeneous and separated clusters that are in addition arranged on a geometrical structure chosen a priori; points assigned to close clusters in the prior structure are close in the original space.

Our method is built on an organized generalization of the popular modularity measure Newman and Girvan (2004) for graph clustering. Following the SOM rationale, the organized modularity uses a prior structure to build topologically aware clusters: nodes assigned to close clusters in the prior structure are more likely to be connected in the original graph than nodes assigned to distant clusters. As for the SOM Vesanto (1999, 2002), using a two dimensional regular grid as a prior structure enables straightforward visualization of the graph induced by the clustering.

While the modularity has interesting properties over other graph cut measures Newman (2006a, 2004, b); Lehmann and Hansen (2007), especially in the visualization context Noack (2009), it has two drawbacks shared with its organized generalization. Firstly, maximizing the (organized) modularity is a combinatorial problem and there is no simple prototype based alternating optimization scheme as available for the SOM and its non Euclidean variants Hammer et al. (2007); Boulet et al. (2008); Villa and Rossi (2007); Villa et al. (2008). Following Lehmann and Hansen (2007) for the modularity, we rely on a deterministic annealing approach Rose (1998) to solve this problem. Secondly, maximizing the (organized) modularity measure tends to miss small clusters, i.e., to aggregate them in large clusters Fortunato and Barthélemy (2007). To limit this effect, we also propose to use the intermediate states of the deterministic annealing algorithm to build finer clusterings than the one with maximal organized modularity. Combined with the prior structure, those clusterings give refined views of the graph induced by the clustering. In summary our contribution is twofold: we introduce an organized modularity measure and we use deterministic annealing to build fair and simplified visualizations of a graph, that are based on a clustering of the vertices.

The rest of this paper is organized as follows: Section 2 details the clustering approach to graph visualization, recalls the modularity definition and introduces its topographic generalization (the organized modularity). Section 3 presents the deterministic annealing scheme used to optimize the organized modularity. Section 4 describes the proposed graph visualization methodology and shows in particular how to leverage the intermediate results of the optimization algorithm to produce fuzzy layouts that limit the resolution effect induced by modularity maximization. Section 5 analyses in detail the behavior of the proposed method on a simple graph, Zachary’s Karate club social network Zachary (1977). Finally, Section 6 demonstrates the practical interests of the method on three larger real world graphs. Technical derivations are gathered in an appendix.

## 2 Topographic graph clustering

### 2.1 Graph clustering for visualization

Cognitive scalability can be brought to graph visualization methods Herman et al. (2000) via graph clustering Schaeffer (2007): the rationale is to draw the smaller and hopefully simpler graph induced by the clustering instead of the original graph. Let us define more precisely this idea.

We consider given a non oriented graph with vertices (or nodes), and weighted edges ( is the set of edges). The (symmetric) weight matrix is denoted by , and is the non negative weight between vertex and vertex ( when ). A clustering of into clusters induces a new non oriented graph defined as follows: each cluster is associated to a vertex in , i.e., and a pair belongs to the edge set if and only if there is such that and . is weighted and the corresponding weight matrix is denoted . The weights are defined by (for ).

The present paper focuses on the following graph visualization strategy: a clustering of the original graph is constructed and a graph visualization method is applied to . In general, the visual representation is completely unaware of the way has been produced: this is a simpler setting than the one of e.g. Eades et al. (2006); Bourqui et al. (2007) in which the proposed algorithms aim at representing a so-called clustered graph. In this latter context, the ultimate goal is to visualize the complete

original graph in a way that respects the (hierarchical) clustering structure. The clustering is used in this case both as a way to circumvent algorithmic difficulties (e.g., the cost of some force directed methods) and as a structural organization principle. In the present paper, the clustering method makes “a meaningful coarse graining of the graph” (to quote

Newman and Girvan (2004)) and produces a new graph which will be the main target of the visualization method.

The solution chosen in this paper is also related to but quite different from the clustered layout strategy Herman et al. (2000). In this framework, one aims at producing a visualization of a graph in which vertices that belong to a given cluster are close, while distinct clusters remain separated. Clusters are generally a by product of those algorithms rather than given beforehand (see, e.g., Noack (2007) for a recent example of such a method).

In practice, we target visualization methods in which each vertex of the new graph is represented by a glyph (e.g., a disk) while each edge is drawn as a straight segment between the corresponding vertices’ glyphs (we do not consider more sophisticated drawing, for instance with bends). Rendering hints can be used to convey information about the original graph to the user: the surface occupied by a glyph can be proportional to the size of the corresponding cluster, while the thickness of a segment can encode the weight associated to the corresponding edge.

### 2.2 Quality measures

In order to be useful, a graph visualization produced via the chosen clustering framework must be both readable and faithful. The readability issue is common to all graph visualization algorithms and has been a matter of research and debate in the graph visualization community with a particular focus on aesthetic criteria, such as bend minimization, symmetry, etc. Di Battista et al. (1999); Herman et al. (2000); Purchase (2002). Experimental studies show that a minimal number of edge crossings is one of the most important quality criteria for readability (see e.g. Ware et al. (2002)). Please note however that there is no consensus on what other readability criteria should be used to evaluate the quality of a graph representation beyond direct task based user studies. In particular, neighborhood rank preservation measures used to assess the quality of nonlinear projection methods Lee and Verleysen (2009) are likely to be irrelevant in the graph context: as showed in Fabrikant et al. (2004), displaying the edges of a graph has a major influence on the way proximities between nodes are inferred by users from the visualization.

The faithfulness issue is more specific to the cluster based visualization: as the final representation of the graph hides most of its structure, inference on the drawing can be misleading. More precisely, there are two potential problems induced respectively by the vertices and the edges in the graph. The first difficulty is that the internal connectivity structure of clusters is hidden by the glyph based representation: while color and/or shape hints could be used to give an idea of the density of connections between the vertices that have been put in a given cluster, the actual connection pattern is lost. A similar problem happens for connection patterns between vertices of distinct clusters: as the associated edges are summarized by a single edge between the induced vertices, there is no way to infer how individual vertices are connected in the original graph.

Therefore, building a faithful clustering of a graph amounts to balancing two criteria that are somewhat contradictory: on the one hand, the density of each cluster should be maximized, but on the other hand the connectivity pattern between two vertices of distinct clusters should be independent of the vertices. Intuitively, a dense cluster with a low outside connectivity is likely to have a rather complex outside connectivity pattern: if all vertices in a cluster were connected to e.g., a single outside vertex, then it might make sense to assign this vertex to the cluster. In addition, enforcing a simple between cluster connectivity has no reason to reduce edge crossing, while within cluster density maximization should on the contrary reduce the number of between cluster connection and therefore improve the readability of the graph.

It seems therefore natural to focus on within cluster density maximization as this solution should lead to a readable graph with a controlled impact on the faithfulness of the representation. Among the clustering quality criteria that favor within cluster density (see Schaeffer (2007)), we focus on the modularity introduced by Newman and Girvan in Newman and Girvan (2004). Let us denote the degree of vertex and the total weight of the graph. Then the modularity of the clustering of is

 Q((Ck)1≤k≤C)=12mC∑k=1∑i,j∈Ck(Wij−Pij), (1)

where is a symmetric matrix given by .

The rationale of the measure is to compare the weight of a link between two vertices in a cluster, , to a simple random model, , in which the weights are proportional to the degrees of the vertices and independent of the clusters. A good clustering tends to cluster vertices that are more connected that one expects based solely on the degrees of the vertices: this corresponds to and to higher values of . Maximizing the modularity tends therefore to produce dense clusters, but only when they are meaningful as measured by a deviation from the simple random model.

The modularity is a rather successful quality measure for graph clustering, especially compared to other graph cut like measures Newman (2006a, 2004, b); Lehmann and Hansen (2007). In addition, Noack (2009) has recently shown that there is a strong link between clusterings induced by modularity maximization and those obtained with some versions of the clustered layout strategy described previously. The modularity seems therefore to be a good candidate for the topographic extension studied in this paper. However, this measure suffers from a resolution problem Fortunato and Barthélemy (2007): maximizing the modularity may fail to identify small but meaningful clusters of nodes. We will tackle this problem by leveraging both the chosen optimization algorithm and the prior structure introduced below (see Section 4.2 for details). It should be noted in addition that the main concepts and techniques used in this paper are quite independent from the actual quality measure. In particular most graph cut based measures Schaeffer (2007) could be handled in a similar way as we proceed with modularity.

### 2.3 Organized modularity

The main limitation of graph clustering for visualization is that the two phases of the methodology are generally completely independent: the clustering method is not explicitly designed to help the subsequent visualization method. We propose in this paper to address this limitation via a specific quality measure for the clustering phase.

Following the SOM rationale, we derive an organized version of the modularity. We assume given a prior structure (in ) which is represented by a symmetric by matrix of prior similarities between clusters such that for all . For instance where is the prior position of cluster in . Then the organized modularity of the clustering of is

 O((Ck)1≤k≤C)=12m∑i,jSc(i)c(j)(Wij−Pij), (2)

where is the index of the cluster to which the vertex is assigned.

In the standard modularity, the term is taken into account only when and belong to the same cluster. In the organized version, this term is always taken in account, but with a weight equal to the prior similarity between and . This favors proximity in the prior structure for connected clusters. If there are indeed significant connections between vertices in two clusters and (i.e., ), then the value of will be higher if is high than if it is low. This is similar to the SOM principle in which a prototype has to be close to observations assigned to its unit but also, to a lesser extent, to observations assigned to neighboring units in the prior structure.

Maximizing gives a graph clustering that can be used to coarsen the original graph prior a standard visualization algorithm, as explained in Section 2.1. In addition, the prior structure gives a natural layout for the clustered graph: nodes of this new graph can be drawn at their corresponding positions on the grid.

## 3 Organized modularity maximization

### 3.1 Modularity maximization

The main difficulty with (organized) modularity maximization is that it is a discrete optimization problem for which there is no simple alternate minimization scheme (contrarily to the SOM or the

-means, for instance). Deriving fast modularity maximization algorithms that produce acceptable solutions has been an important research topic since the introduction of this quality measure (this is generally the case for all graph clusteringing measures). The best compromise between computational load and quality seems to be currently achieved by some type of heuristic algorithms that coarsen the graph in an iterative and multi-level way

Blondel et al. (2008); Noack and Rotta (2008).

In the present paper, we use a quite different approach which is more adapted to the organized modularity defined in the previous Section. Following Lehmann and Hansen (2007), we propose to maximize the (organized) modularity via a deterministic annealing (DA) approach Rose (1998). As pointed out in Lehmann and Hansen (2007), the main advantage of DA over simulated annealing (as used by, e.g., Reichardt and Bornholdt (2006) in the graph clustering context) is the speed of the former: deterministic annealing can use aggressive annealing schedules with a relatively small number of iterations compared to simulated annealing. In addition, intermediate results of DA can be used to limit the resolution effect induced by modularity maximization, as will be shown in Section 4.2.

To ease the derivation of the DA algorithm for modularity maximization, we use an assignment matrix notation for clusterings. An assignment matrix for a clustering of in clusters is a matrix with entries in such that for all . In other words, is equal to 1 if and only if is assigned to cluster . We denote the set of all valid assignment matrices for a clustering of in clusters ( and will be given by the context). Please note that we do not constrain an assignment matrix to have non empty clusters.

The modularity and the organized modularity of an assignment matrix are given respectively by:

 Q(M) = 12m∑i,j∑kMikMjk(Wij−Pij) (3) O(M) = 12m∑i,j∑k,lMikSklMjl(Wij−Pij) (4)

If is the symmetric matrix defined by

 Bij={0if i=j12m(Wij−Pij)if i≠j. (5)

then, maximizing is equivalent to maximizing

 F(M)=∑i,j∑k,lMikSklMjlBij (6)

as shown in Section A.1

. Finally, it should be noted that by using the identity matrix for

, one recovers the modularity: the algorithm derived for the maximization of will therefore apply to both the standard modularity and the organized version.

### 3.2 Deterministic annealing and mean field approximation

Deterministic annealing tries to solve the complex combinatorial problem of maximizing a function defined on a finite (but large) space via an analysis of the Gibbs distribution obtained as the asymptotic regime of a classical simulated annealing Kirkpatrick (1983); Cerny (1985) or via the principal of maximum entropy Rose (1998). In our case, the Gibbs distribution for temperature is

 P(M)=1ZPexp(βF(M)), (7)

where the normalization constant is given by (the sum ranges over the full space ).

The main idea of DA is to compute the expectations of the assignment matrix at a fixed temperature with respect to the Gibbs distribution, i.e., the , and then to decrease the temperature while tracking the evolution of the expectations (in simulated annealing, the Gibbs distribution is sampled rather than studied via expectation).

Unfortunately, is not linear with respect to and computing and is therefore difficult: one should resort on an exhaustive exploration of which is computationally infeasible. Following previous work on similar topics, we approximate by a distribution that factorizes (see e.g., Hofmann and Buhmann (1997); Graepel et al. (1998)). This corresponds to approximating the interaction between say and all the other variables via a mean field . Then we compute the expectation of the assignment matrix with respect to the approximating distribution.

More precisely, we consider the bi-linear cost function where is the mean field, a by matrix of partial assignment costs. For a fixed temperature , we look for a mean field that gives a distribution close to

, in the sense that the Kullback-Leibler divergence

between and is minimal, with (this corresponds to a variational approximation of Jaakkola (2000)).

At a minimum, the gradient of with respect to is zero. This leads to the following classical mean field equations (see Appendix A.2 for details):

 ∂ER(F(M))∂Ejl=∑k∂ER(Mjk)∂EjlEjk, ∀j,l. (8)

They are obtained using the main consequence of the mean field approximation, namely the independence between and for under the distribution , i.e., the fact that for .

To solve the mean field equations, we use a EM-like approach. We consider the fixed and solve the equations for (maximization phase). Then we compute the new values of the (expectation phase). This latter phase leads to the very simple standard deterministic annealing update rule:

 ER(Mik)=exp(βEik)∑lexp(βEil). (9)

Moreover, the independence property recalled above gives

 ER(F(M))=∑i≠j∑k,lER(Mik)SklER(Mjl)Bij.

Then, some straightforward calculations (see Appendix A.3 for details) show that equation (8) is fulfilled if the mean field is given by

 Ejk=2∑i≠j∑lER(Mil)SklBij, (10)

or, in matrix notations, , using the symmetry of and .

### 3.3 Phase transition and final algorithm

It is well known that deterministic annealing goes through several phase transitions when the temperature is decreased

Rose (1998); Graepel et al. (1997); Hofmann and Buhmann (1997). In order to choose an adapted annealing schedule, i.e., an increasing series of that is used to track the evolution of , it is important to find at least the first phase transition.

At the limit of infinite temperature (), it can be shown (see Appendix A.4) that the following mean field

 E0jk=2C∑i≠jBij∑lSkl, (11)

is a fixed point of the like scheme (i.e., of equations (9) and (10)). The corresponding assignment expectations are uniform, i.e.

 ER(M0ik)=1C. (12)

An analysis of the stability of the EM-like scheme (conducted in Appendix A.4) shows that the fixed point is stable when the temperature is higher than the critical temperature , where and

are the spectral radii, i.e., the largest eigenvalues in absolute value, respectively of

and

. The first phase transition will therefore happen when the temperature becomes lower than this limit. It should be noted that the critical temperature is a very conservative estimation of the temperature of the first phase transition, as it corresponds to a worst case analysis of the stability of the fixed point.

To derive the final Algorithm 1, we follow Rose (1998) rather than Lehmann and Hansen (2007). We use in particular the perturbation idea of Rose (1998): each time the temperature is lowered, some noise is injected in the mean field before running the EM-like scheme. This method favors phase transitions and symmetric breaks that are needed for a proper convergence.

Deterministic annealing is very robust and quite insensitive to the parameters that appear in Algorithm 1:

• is the number of outer iterations, i.e., the number of temperatures considered during the annealing process (a typical value is , the size of the graph);

• is the relative starting temperature above the critical temperature (typically );

• is the damping factor of the temperature (typically chosen such that the final temperature is );

• for the noise , we used a random multiplicative factor chosen uniformly in for each component of ;

• the EM-like phase is considered to be stable when the mean squared difference between the components of two values of is below the square root of the machine precision (or after 500 iterations).

Finally, the computation cost of the algorithm remains reasonable: the costly operation is the product . As explained in Newman (2006a), one can leverage the structure of to obtain a fast matrix multiplication. Indeed is made from the weight matrix and the degree matrix. Computing costs operations (where is the number of edges of the graph), while can be computed in operations ( is the number of vertices) exploiting the definition of . Therefore computing costs . Then computing the product of by is a operation while applying equation (9) costs . The total cost of one iteration of Algorithm 1 is therefore . Computing the critical temperature can be done quite efficiently via the Lanczos method Golub and van Loan (1996) in but in fact only a very rough estimate of the spectral radius is needed. Therefore, a few iterations of a power method should be enough to give a reasonable initial temperature. In practice, the computational cost of the method makes it suitable for graphs with a few thousands vertices as long as there are not too dense.

## 4 Graph visualization methodology

In principle, the proposed graph visualization methodology is rather straightforward and strongly related to e.g., exploratory data analysis with the SOM. We choose a regular grid as a prior structure and use Algorithm 1 to find an optimal clustering with respect to the organized modularity defined by equation (4). Then, as explained in Section 2.1, we display the clustering induced graph: we do not need a graph layout algorithm in this phase as each cluster has a dedicated position in the grid.

In practice, some parameters need to be carefully chosen to provide a meaningful visualization. In addition, (organized) modularity maximization faces a problem of limited resolution and leads sometimes to oversimplified clustering induced graph. We describe in this Section the proposed solutions to those issues.

### 4.1 Parameter tuning

The internal parameters of Algorithm 1 described in Section 1 do not request any particular tuning and the guidelines provided should in general give satisfactory results. On the contrary, the quality of the visual representation strongly depends on the prior structure, both in terms of the size of grid itself and in terms of the matrix. Those parameters should be optimize as automatically as possible. The general principle consists first in defining a finite set of acceptable prior structures and then in selecting the best one via some quality criterion.

Our practical goal is to end up with readable graphs: we consider therefore small prior structures, for instance square grids with an absolute maximum of 10 nodes per dimension (i.e., up to 100 clusters). The entries of the matrix are calculated via a SOM like equation

 Sij=H(λ∥xi−xj∥), (13)

where is the position of cluster in the prior structure and is either (exponential decrease) or (linear decrease). The scaling parameter is used to tune the neighboring influence and can be chosen so as to include from zero neighbor to a complete influence of all clusters on each other.

Unfortunately, as recalled in Section 2.2, there is no universally accepted quality criterion for graph visualization. Moreover, the case of clustered visualization naturally corresponds to a trade-off between internal and external connectivity uniformity. To handle this trade-off we propose to rely on dual objectives optimization principles. More precisely, we use the standard modularity measure to assess clustering quality and the number of edge crossing to assess visual quality. Rather than combining the measures to select one value of the parameters of the prior structure, we consider Pareto optimal prior structures: each selected structure is better than all others for at least one of the two criteria.

In practice, this means that the parameter tuning process is only semi-automatic: it selects some interesting visualizations from the explored set of parameters. Those visualizations can then be presented to the user for selection. As explained in Section 2.2, while the modularity is a clustering quality criterion, it favors both dense clusters and low connectivity between clusters. It is therefore reasonable to sort Pareto optima in decreasing modularity order prior user analysis. However, limiting the results to the best prior structure according to modularity will generally lead to oversimplified graphs because of the limited resolution of this measure Fortunato and Barthélemy (2007): this justifies presenting to the user Pareto optima with sub-optimal modularity but with more non empty clusters and less edge crossings.

### 4.2 Fuzzy layout

In addition, one can leverage the annealing process to produce intermediate results which can be considered as compromise between the main strategy and the clustered layout strategy Noack (2007). Indeed, as will be shown in Section 5, the expectation of the assignment matrix does not contain 0/1 values, even after some phase transitions. The fuzzy layout strategy introduced in the present section takes advantage of this fact.

Let us denote the positions associated to the clusters in the prior structure. The position of vertex in the prior structure based layout associated to the assignment matrix is given by

 pi=∑kMikxk,

and therefore, the expected position is

 ER(pi)=∑kER(Mik)xk. (14)

At the limit of zero temperature,

contains only 0 and 1, and the expected positions are exactly positions in the prior structure. However, during annealing, vertices that are difficult to classify will have in-between positions reflecting non peaked values of

.

As a by product of the annealing scheme, one can therefore provide an animated rendering of the evolution of by displaying for different values of the temperature the points (the display is somewhat similar in principle to the posterior mean projection used in the Generative Topographic Mapping Bishop et al. (1998)). To avoid cluttering the layout with overlapping vertices, we rely on an elementary simplification scheme: a complete linkage hierarchical clustering is applied to the and the dendrogram is cut at an appropriate level (for instance 5% of the minimal distance between the points of the grid). This leads to a finer clustering than the final one, with associated positions for the clusters. Many of those clusters are positioned near or on cluster positions on the grid, especially when the annealing scheme nears completion. This induces generally some overlapping between edges. To limit this effect, we use the positions computed above as initial positions of a force directed placement algorithm (such as the Fruchterman-Reingold algorithm Fruchterman and Reingold (1991)). The algorithm is used only for a few iterations that move slightly the clusters and reduce overlapping.

In practice, the user first browses through the Pareto optimal points sorted in decreasing modularity order to select some interesting visualizations. Then he/she can request an animated rendering of the most interesting graphs to assess whether some sub-structure can be found in the clusters.

## 5 Detailed analysis of a small graph

This section provides a detailed analysis of the behavior of the proposed method on a small graph. The method has been implemented in R R Development Core Team (2009), using the igraph package Csardi and Nepusz (2006) complemented by the network package Butts (2008).

We study Zachary’s Karate club social network Zachary (1977). The graph represents the friendship social network between the 34 members of a Karate club at a US university in the 70s. The graph contains 78 unweighted edges (its global density, the fraction of connected pairs of nodes, is thus equal to 13.9 %) and is represented on Figure 1 obtained with the Fruchterman-Reingold force directed algorithm Fruchterman and Reingold (1991) as implemented in igraph Csardi and Nepusz (2006)

. The transitivity of the graph, that is a measure of the probability that the adjacent vertices of a vertex are connected (see e.g.,

Wasserman and Faust (1994)), is equal to 25.6 %. The gap between the global density and the transitivity is a good indication for a larger local density and thus a relevant clustering.

### 5.1 Modularity maximization

It is well known from previous work on this graph that in terms of modularity, the optimal number of clusters is four (see e.g., Duch and Arenas (2005)). This is a consequence of the structure of the modularity which tends to peak at a certain graph specific number of clusters and then decreases (this is another manifestation of the limited resolution of this measure Fortunato and Barthélemy (2007)). For this graph, the social analysis conducted by Zachary leads to the definition of two clusters which correspond to the split between members of the club during the course of the study. Both clusters are split into two sub-clusters by modularity maximization (one ground truth cluster corresponds to clusters 1 and 4 in Figure 6 and the other to clusters 2 and 3). We have investigated the behavior of the deterministic annealing algorithm with ranging from to . The parameters of the algorithm were the following ones: , is chosen so that the final temperature is and (this relatively large value was used to obtained a convergence for each EM-like phase; a faster annealing introduced instabilities on this graph). Figure 2: Maximal modularity achieved by DA as a function of C Figure 3: Evolution of the modularity during annealing (the solid line corresponds to the theoretical temperature for the first phase transitions, dashed lines correspond to detected transitions)

As shown on Figure 2, the maximal modularity achieved by deterministic annealing (without a prior structure) reaches its maximum at clusters and then remains constant (the value of the maximum is , as reported in e.g., Duch and Arenas (2005)). This is an interesting feature of DA: the algorithm is able to produce empty clusters if this increases the objective function. In the present case, the algorithm produces a maximum of four non empty clusters, even if we ask for eight. While this is interesting in terms of maximizing the modularity, this tends to increase the resolution problems faced by this quality measure Fortunato and Barthélemy (2007). The fuzzy layout strategy proposed in Section 4.2 is therefore expected to be very useful in this context.

Figure 3 shows the evolution of the modularity during annealing for the case . More precisely, we compute the expectation of the modularity with respect to the mean field distribution:

 ER(Q(M))=∑i≠j∑kER(Mik)ER(Mjk)Bij+12m∑i(Wii−Pii) (15)

The critical temperature appears, as expected, as a very conservative estimate of the temperature of the first phase transition. This first phase transition corresponds to the detection of two well identified clusters (the ones discovered by Zachary in Zachary (1977)), while the second phase transition introduces two additional clusters. The evolution of the expectation of the assignment matrix is represented by Figures 4 and 5: each figure includes four copies of the graph. In copy number , the gray level of the node encodes the value of , i.e., of the probability for node to belong to cluster according to the approximating distribution . Figure 4: Assignment probabilities after the first phase transition: black corresponds to probabilities close to one, white to probabilities close to zero

After the first phase transition, clusters 1 and 3 start to take (partial) ownership of some of the vertices (the black ones on Figure 4), even if only 6 vertices out of 34 have a maximal assignment probability higher than 0.75. More than 40% of the vertices (14 out of 34) have rather fuzzy assignment probabilities (i.e., less than 0.5 for the maximal value). Figure 5: Assignment probabilities after the second phase transition: black corresponds to probabilities close to one, white to probabilities close to zero

After the second phase transition (shown on Figure 5), the algorithm has identified four distinct clusters accounting for 23 vertices (which are all assignments to a cluster with more than 0.75 probability). Four vertices still have a maximal assignment probability below 0.5. If the annealing is brought to the limit of a very low temperature (below the threshold of chosen here), the expectation of the assignment matrix is almost a 0/1 matrix, but such effort is not needed: in the case of the Karate club graph, all vertices have a maximal assignment probability higher than 0.5 at and a winner-take-all approach leads to a well defined clustering of the graph. Figure 6: Left: original layout numbered according to the clustering; right: clustering induced graph

Figure 6 compares the original layout numbered via the clustering to the layout of the clustering induced graph (obtained by the Fruchterman-Reingold force directed algorithm Fruchterman and Reingold (1991)). The induced graph emphasizes clearly the relation between the clusters: cluster 2 is not directly connected to clusters 1 and 4, while the connection between 3 and 4 is less dense than the ones between 1 and 3 or 4. The summary is therefore quite informative, even if, as expected, most of the structure is lost. For instance, cluster 1 exhibits a start shape which is of course lost in the glyph based representation.

### 5.2 Organized modularity maximization

For the organized modularity, we use the simplest setting compatible with the four clusters identified in the previous section: the prior structure consists in a square in which each vertex is associated to a cluster. We use the following influence matrix:

 S=⎛⎜ ⎜ ⎜⎝1λλ0λ10λλ01λ0λλ0⎞⎟ ⎟ ⎟⎠,

where specifies the amount of local influence (there is no diagonal influence). Interestingly, the value of has an impact on the number of non empty clusters obtained by deterministic annealing. We use the same parameters for the algorithm as in the previous section, while varying between and . As shown on Figure 7, a large influence tends to reduce the number of effective clusters produced by the algorithm. Figure 7: Number of non empty clusters as a function of the influence parameter λ

This can be explained by an analysis of the optimal four clusters clustering obtained before. As shown on Figure 6, there is no direct connection between cluster 2 and clusters 1 and 4. As a consequence, the corresponding entries are negative: the only way to prevent the organized modularity to be smaller than the standard one is to put cluster 2 far away from clusters 1 and 4 in the prior structure. This is not strictly feasible as there are only two pairs of cluster positions with no direct influence in (they corresponds to the diagonal of the square). As a consequence, one might put cluster 1 far away from cluster 1 or from cluster 4, but not from both. Therefore, the modularity will decrease by e.g.,

 −12m∑i∈C2,j∈C4λPij,

if cluster has no influence on cluster . Then, when is large enough, the reduction of the four clusters organized modularity will be high enough to bring its value below the one of the two or three clusters standard modularity. Those latter configurations are easier to arrange on the prior structure in a way that minimizes negative contributions: they will be preferred by the algorithm over the four clusters solution.

In fact, the behavior of the algorithm is comparable to the one of a force directed method: when there is no connection between two vertices, they only repel each other and the algorithm tries to maximize their distances. As shown on Figure 8, the representation of the graph obtained via the prior structure is quite similar to the one provide by Figure 6. The underlying clusters are completely identical and thus use the same numbering to ease comparison between the figures (this is also the case when the value of induces a clustering with two or three clusters: the obtained clusterings are the ones that maximize the standard modularity for or ).

### 5.3 Fuzzy layout Figure 9: Evolution of the fuzzy layout during annealing (the Fruchterman-Reingold algorithm was used to reduce edge overlapping)

It might seem from Figure 8 and more generally from the discussion above that the organized modularity offers no particular interest compared to a two phases approach with a maximal modularity clustering followed by a force directed layout. The behavior of the DA algorithm on the Karate graph emphasizes the fact that the modularity does not increase with the number of clusters: it tends to peak at an optimal graph specific number. As explained before, there is a complex interaction between the prior structure and the values of the modularity for different numbers of non empty clusters. In some situations, the prior structure introduces a too strong coupling between clusters and leads this way to a reduced number of non empty clusters. This might lead to a better visualization of the graph, to an over simplification or to nothing more than the two phases approach (Section 6 will illustrate this further).

The first way to address this problem is to test several prior structures and to select optimal graphs with respect to different quality measures as proposed in Section 4. A second (and complementary) approach consists in leveraging the annealing process to limit the impact of the peaking behavior of the modularity by means of the fuzzy layout strategy described in Section 4.2. An example of such layouts is given by Figure 9. The representations give a more complete picture of the graph than Figure 6 (or Figure 8). It appears for instance quite clearly that two vertices are not easy to assign to the final four clusters (they are at the boundary of cluster 1 in Figure 8). If we study carefully the original representation on Figure 1 with the added knowledge provided by Figures 6 and 9, those vertices (number 10 and 24) appear quite clearly in between two clusters. However, it would be almost impossible to obtain this information directly from the original representation. Figure 9 shows also that the composition of cluster 3 (in Figure 8, bottom left in Figure 9) is quite obvious compared to others. This is easily confirmed on Figure 6.

### 5.4 Comparison with SOM variants

As the proposed method is based on the SOM rationale, it seems natural to compare it to SOM variants that can handle graph nodes, mainly the kernel SOM with adapted kernels Villa and Rossi (2007) and the spectral SOM Villa et al. (2008). Details on those methods and on parameters optimization are postponed to Section 6.2 in order to keep the present Section focused on a detailed analysis of the Karate graph.

Following the methodology described in Section 6.2, we have built Pareto optimal points with respect to the modularity of the clustering and the number of edge crossings in the induced layout. The only global Pareto optimal point among SOM variants has been obtained by the heat kernel (see Kondor and Lafferty (2002)). It has zero edge crossing and a modularity of 0.4188. The obtained layout is identical to one of Figure 8, up to a rotation. However, the underlying clustering is slightly different, as shown by the reduced modularity. In fact, node number 10 on Figure 1 is assigned to the same cluster as node number 3 rather than to node number 34’s cluster in the optimal clustering (in terms of modularity). It turns out that node number 10 should be assigned to node number 34’s cluster according to Zachary’s analysis. Therefore, the best SOM variant fails to recover exactly the ground truth clustering, while modularity optimization does recover a finer clustering.

Other variants make similar or worse mistakes. For instance, the best result obtained by the modularity kernel SOM has a modularity of 0.409 which corresponds to two differences with the optimal modularity clustering. Node number 10 is misclassified compared to Zachery’s two clusters ground truth, while node number 24 is assigned to the correct Zachery’s cluster but to a different cluster than in the highest modularity clustering. Figure 10: Original layout numbered according to the clustering obtained by the Laplacian’s inverse kernel SOM

The best solution for the Laplacian’s inverse kernel SOM has an even worse modularity of 0.391. In this case, the differences between the proposed clustering and the ground truth are more important. Figure 10 gives the clustering obtained by the Laplacian’s inverse kernel SOM on the full layout of the Karate graph. Node number 3 is assigned to a wrong cluster according to the ground truth. While the mismatch corresponds to only one node, the status of this node is much more obvious than the one of node 10: it is connected to node number 1, the central actor one of the clusters identified sociologically. In addition, cluster number 4 seems to consist in boundary nodes rather than in a dense subset of nodes. While this SOM variant manages to reach a fair approximation of the ground truth clustering, it splits one of those clusters in a misleading way: cluster 4 is not dense and its members are more connected to outside nodes than between themselves.

In summary, the SOM variants are unable to recover exactly the ground truth clustering obtained by Zachary on this graph. In addition, the SOMs give an example of a clustering with reduced modularity (compared to optimal ones) with a possibly misleading cluster: the subgraph is not dense and nodes have more connections outside of the cluster than inside. Limitations of graph SOMs compared to our proposal will be confirmed in the next Section.

## 6 Results on larger graphs

### 6.1 Datasets

In order to confirm the conclusions reached in the previous section, the following experiments provide a comparison between the proposed method, the classical two phases approach and SOM variants on three larger graphs. More precisely, the studied graphs are:

• a coappearance network of characters in chapters of the novel Les Misérables (Victor Hugo), introduced by Knuth (1993) and available at http://www-personal.umich.edu/~mejn/netdata/lesmis.zip. This graph is undirected and weighted by the number of chapters in which each pair of characters appear together. It is larger than the graph described in Section 5 with 77 nodes representing the characters of the novel. The global density of this graph is equal to 8.7% and its transitivity is equal to 49.9%. This indicates that the local connectivity of the graph is very high compared to its global one and thus that this graph is likely to be clustered into dense subgroups. This graph is represented in Figure 11 (this layout has been obtained via the Fruchterman-Reingold force directed algorithm Fruchterman and Reingold (1991) as implemented in igraph);

• a directed, weighted network representing the neural network of the worm “C. Elegans”, introduced by

Watts and Strogatz (1998) and available at http://cdg.columbia.edu/cdg/datasets. The graph has been used as an undirected weighted graph: the weights, , of the undirected graph are simply defined from the weights of the directed graph by . This graph is connected and contains 453 nodes. Its density is equal to 2.0% and its transitivity to 12.4%. Compared to the graph from “Les Misérables”, this network has a much smaller local connectivity indicating that the clustering task should be harder. Hence, a smaller modularity is expected for the resulting clustering. This graph is represented in Figure 12 (left). While the graph from “Les Misérables” was readable, the “C. Elegans” graph is impossible to decipher when rendered by the standard Fruchterman-Reingold algorithm;

• a network representing the e-mail exchanges between members of the University Rovira i Virgili (Tarragona), introduced by Guimera et al. (2003) and available at http://deim.urv.cat/~aarenas/data/xarxes/email.zip. This graph is also connected and contains 1 133 nodes. Its global density is equal to 0.9% and its transitivity to 16.6%. As in the previous case, the transitivity of the graph is not very high but the gap between the global density and the transitivity is much larger and a larger modularity than for “C. Elegans” can then be expected. This graph is represented in Figure 12 (right) and is as undecipherable as the “C. Elegans” one when rendered with the Fruchterman-Reingold algorithm. Figure 11: Coappearance network from “Les Misérables” (left: with characters’ names; right: without label) Figure 12: Two real-world networks: Neural network of the worm C. Elegans (left) and E-mail network between the members of the University Rovira i Virgili (right)

### 6.2 Experimental setting

#### 6.2.1 Reference methods

The experiments compare the proposed method with two other approaches:

1. as explained in Section 2, the proposed approach tries to improve the standard two phases approach by building topographically ordered clusters. The reference approach consists therefore in a two phases method already used in Section 5.1: we build a good clustering via deterministic annealing maximization of the modularity and the graph induced by the clustering is rendered via a force directed placement algorithm (the Fruchterman-Reingold algorithm as implemented in igraph);

2. an alternative organized approach is the one described in Villa and Rossi (2007): Kohonen’s Self Organizing Map algorithm is used to build a topographically ordered clustering of the graph (on a prior grid) via a well chosen graph kernel. Several popular kernels are defined for graphs, including regularized versions of the Laplacian111We recall that the Laplacian of a graph is the matrix, , defined by (see, e.g., Smola and Kondor (2003)), the generalized inverse of the Laplacian Fouss et al. (2007) or the modularity kernel Zhang and Mao (2008). The latter one is defined from the positive part of the matrix involved in the definition of the modularity (see equation (5)).

A slightly different way to use the Self Organizing Map for clustering the vertices of a graph is what is called “spectral SOM” in Villa et al. (2008)

. This approach is inspired by spectral clustering (see e.g.,

von Luxburg (2007)) but using a SOM instead of the usual

-means algorithm. More precisely, the vertices of the graph are represented by vectors of

(where

is the initial number of clusters of the prior grid) that are the coordinates of the eigenvectors associated to the

smallest non zero eigenvalues of the Laplacian of the graph. Then, a usual vector SOM is applied to these vectors.

Those four SOM variants were used in Section 5.4 for the Karate graph.

#### 6.2.2 Parameters optimization

The parameters optimization procedure outlined in Section 4.1 is applied to tunable parameters of the reference methods (e.g., the number of clusters, the kernel and its parameters, etc.). Random initialization is included in the procedure for methods that are not deterministic.

In the specific case of the two phases approach, we rely on a two phases optimization: the best clustering is selected by maximization of the modularity over the number of clusters and then the layout with the lowest number of crossing edges is kept among 10 different random initializations for the Fruchterman-Reingold algorithm.

The other solutions are “all-in-one” and provide at the same time a clustering and a layout. We select therefore on Pareto optimal parameter sets. The following parameters were optimized:

• the number of clusters for the clustering algorithm was optimally chosen in (please note that all methods can produce empty clusters);

• the prior structure was encoded via a square grid selected among 3 sizes: , and (i.e., 9 clusters, 16 clusters and 25 clusters). The vertices of the grid have integer coordinates;

• the entries of the matrix were calculated via equation (13). We compared exponential decrease and linear decrease, and used at least four different scaling values, chosen according to the induced radius of influence in the prior structure;

• in the case of the SOM (kernel version or “spectral SOM”), the optimal configuration (size of the prior structure and neighborhood function) was selected in the same set of configurations as for the organized modularity;

• as the batch SOM used in this paper exhibits some dependence to the initial configuration, we used 5 different initializations for each run of the algorithm. Among them 4 were random ones and the last one was a PCA based initialization in which the prior structure is positioned on the plane spanned by the first two principal components (this is done with kernel PCA Schölkopf et al. (1998) in the case of kernel SOM and standard PCA in for the spectral SOM);

• as explained above, we used three different kernels. Among them, only the heat kernel has a parameter: it is defined as where is the Laplacian of the graph and a temperature parameter. Four values of were used;

We used the three kernels only for the analysis of “Karate” and “Les Misérables”. For the two other graphs (C. Elegans and E-mail), we restricted the analysis to the generalized inverse that has achieved almost the best results in the first experiment (comparable with the best ones obtained for the heat kernel) and that does not require the tuning of an additional parameter.

#### 6.3.1 Numerical results

We first compare the proposed method to the SOM variants with respect to the chosen quality criteria: the modularity of the clustering and number of edge crossings in the associated representation. As explained in Section 2 and 4.1, those quality criteria have been chosen in order to obtain a good compromise between a readable clustering based representation of the graph (with small number of edges crossing) and the fairness of this representation (dense clusters, i.e., high modularity). Figures 13 and 14 give the values of the quality criteria for all the experiments made on the dataset “Les Misérables”, while Tables 1, 2 and 3 give the pairs of values associated to Pareto points obtained in for each graph. We first focus on “Les Misérables” and then give more general comments. Please note that comments and conclusions about “Les Misérables” apply to the Karate graph with the only exception that the modularity kernel SOM performs in an acceptable way on the Karate graph. Detailed results are not included to avoid lengthening the article. Figure 13: Quality criteria for the kernel and spectral SOM for “Les Misérables”: x axis is the modularity and y axis is the number of edge crossings. The gray level of the points gives the initial size of the square grid and Pareto points are indicated by a black star. Solutions with a modularity lower than 0 or a number of edge crossings greater than 500 are omitted on this figure. Figure 14: Quality criteria for the optimization of the organized modularity by deterministic annealing for “Les Misérables”. The gray level of the points gives the initial size of the square grid and Pareto points are indicated by a black star.

As shown in Figures 13 and 14, the clusterings obtained on “Les Misérables” by optimizing the soft modularity have generally larger modularity values than those produced by SOM variants. In addition, only a limited subset of the parameter space leads to high modularity with SOM variants. Both outcomes were expected for kernel SOMs using the heat kernel and the generalized inverse of the Laplacian. Indeed those kernels induce feature spaces and associated clustering objectives that have no simple relation with the modularity criterion. As shown in e.g. Boulet et al. (2008); Yen et al. (2009) both kernels lead to interesting clustering results, but it is clear from the present results that one cannot expect to get high modularity clustering from them in general.

The poor results for the spectral SOM are slightly less expected as this method is related to some form of graph cut measure optimization, using the relaxation proposed in spectral clustering von Luxburg (2007). However, as recalled in Section 2.2, graph cut measures are quite different from the modularity, which probably explains the low modularity clusterings obtained by this method.

Finally, the results obtained by the modularity kernel are quite disappointing. Indeed, this kernel is strongly related to the modularity measure Zhang and Mao (2008) and should lead to some form of modularity maximization. In practice, it seems that taking only the positive part of the matrix does not capture all the complexity of the modularity, at least in the SOM context.

Of course, as our method aims at maximizing an organized version of the modularity, the higher quality of the results in terms of a closely related measure (the classical modularity) is rather natural. However, the Figures show that it also leads to a low number of edge crossings and therefore to more readable layouts. Figure 13 shows in particular that SOM variants generally fail to balance modularity and edge crossings: high modularity clusterings have frequently a significant number of edge crossings (see also Table 1).

Combining both criteria lead to the search of Pareto points. The case of “Les Misérables” is summarized by Table 1: when considered all together, the SOM approaches (kernels and spectral) lead to 4 Pareto points which are of lesser quality than the one obtained by the proposed approach: none of the Pareto points for kernel SOM is a global Pareto point for the dataset mainly because of low modularity values. In addition, and for similar reasons, the spectral SOM and the modularity kernel SOM do not produce any Pareto point. In fact, the only realistic competitive method in the SOM family is the kernel SOM based on Laplacian’s inverse. Indeed, it gives comparable results as the ones obtained with the heat kernel but without the need for a time consuming kernel parameter tuning. The two larger datasets, “C. Elegans” and “E-mail” were therefore investigated only with this kernel SOM, excluding all other variants. Pareto points are described in Tables 2 and 3.

Results from “C. Elegans” and “E-mail” generally agree with those obtained on “Les Misérables”: the kernel SOM does not produce clusterings with high modularity, but it manages sometimes to reach a low number of edge intersections in the induced layout. That said, the kernel SOM does not produce global Pareto points for “Les Misérables” and “E-mail”: on those datasets, it is therefore strictly less efficient than the proposed method on the dual objectives point of view (this was also the case for the Karate graph). The best situation for the kernel SOM is “C. Elegans” where the very low numbers of edge crossings of configurations CE1, CE2 and CE3 make them Pareto points within the full set of solutions. However, those points have very low modularities compared to other solutions.

In summary, those experiments show that the spectral SOM and the modularity kernel SOM should be avoided as they cannot produce satisfactory results in terms of the chosen quality criteria. In addition, while the heat kernel can produce interesting results, its additional kernel parameter induces a large computational cost: as shown on Figure 13 most of the computing efforts are wasted as they lead to poor solutions on both criteria. The only competitive method is the Laplacian’s inverse kernel SOM. But even if it seems to be the best SOM based method, it still produces sub-optimal solutions on both criteria for three out of four datasets.

The comparison between the classical two phases approach and our method does not lead to a clear winner as far as the chosen quality criteria are concerned. As expected, the modularities of the clusterings obtained in the two phases approach are among the highest. Interestingly, maximizing the organized modularity can lead to a higher modularity than a more direct approach (in a similar way as the SOM which can overcome the

-means in term of within cluster variance): this is the case for “Les Misérables” (see Table

1). However, in general, the two phases approach gives the clusterings with the highest modularity. In term of edge crossings, the two phases approach gives also satisfactory results: this confirms the analysis from Section 2.2, where we argued that looking for dense clusters should reduce the number of edges between clusters. On complex graphs however, our method leads to lower numbers of edges cuts: again, this confirms the somewhat contradictory nature of the two quality criteria explained in Section 2.2.

By construction, the two phases approach should give a Pareto optimal result, up to sub-optimal optimization results caused by the combinatorial nature of both optimization problems. In practice, we obtained Pareto points on all graphs. In general, those points have higher modularity and higher number of edge crossings than the ones produced by our method, as expected: our clusters are somewhat adapted to the clustering induced layout. There is therefore no obvious superiority of one method over the other. In fact, we are more interested in the final layout of the graphs than in the exact numerical results. It is therefore interesting to study the visual representations obtained different approaches, as long as they correspond to Pareto optimal points. This is done in the next subsection.

#### 6.3.2 Drawing the graph from its clustering

The aim of the proposed method is to provide a simplified but relevant representation of a large graph through the graph induced by the clustering. The present section analyses the visual representation obtained via the proposed method and reference methods. We first start with a detailed analysis of “Les Misérables”: Figure 15 gives the representation of the graph of clusters for clusterings M5 and M8 (See Table 1). Figure 15: Clustering induced graphs. Left: displayed on the prior structure (for M5 obtained by optimizing the organized modularity) and Right: displayed by Fruchterman and Reingold algorithm (for M8 obtained by optimizing the ordinary modularity).

As the graph is small enough, those layouts can be compared to the original layout numbered according to the clustering: this is provided by Figure 16. Figure 16: Original layout of “Les Misérables” labeled according to the clustering; left: M5 (optimization of the organized modularity) and right: M8 (optimization of the ordinary modularity). The number on the vertices is the cluster’s number.

The density of the graph limits the possibilities of analysis, but it appears clearly that cluster 1a from M5 corresponds to a quite isolated character (Mother Plutarch who is only connected to Mabeuf), while up to a single character (Sister Simplice), the union of clusters 2a and 2b in M5 gives cluster 2 in M8. In both cases, the summary of the graph given by the clustering seems therefore reasonable and a more detailed analysis is needed.

It turns out that both representations give a clear understanding of the story of the novel. It is based on a central group of characters (clusters 2a and 2b for M5 and 2 for M8) which includes Valjean, Cosette and Marius (among others). Several sub-stories are narrated in the novel: the story of the Bishop Myriel who is spiritual guide of Valjean (clusters 5); the story of the street children Gavroche (clusters 1 in M5 and 1b in M8); and finally the story of Fantine, poor Cosette’s mother (clusters 4). But representation M5 gives another information by separating the main characters into the ones related to Valjean (Cosette, Marius, Javert, etc. in cluster 2b) and the ones related to the Thénardier family (mister and misses Thénardier, Éponine, their daughter, etc. in cluster 2a). In addition, cluster 2b of M5 (the main characters) has more connections to cluster 1b than cluster 2a (Thenardier family) to cluster 1b. Therefore, the central position of cluster 2b remains clearly emphasized in M5, even if the representation is arguably slightly less readable than representation M8 (it should be noted that the graph induced by the clustering M5 is planar, but because the organized layout is not optimized by a graph drawing algorithm, the actual representation has one edge crossing). Hence, with a small (and avoidable) increase in the number of edge crossings but with a higher quality of clustering, the information given by clustering M5 provides a more complete picture of the original network. Figure 17: Fuzzy layout of the final configuration of “Les Misérables” (the Fruchterman-Reingold algorithm was used to reduce edge overlapping)

Additional insights can be gained with the help of the fuzzy layout methodology described in Section 4.2. Figure 17 gives an example of such layout for the final configuration of the deterministic annealing. At first, this representation might seem very different from M5 on Figure 15. In fact most of the differences can be explained by two phenomenons. Firstly, as on Figure 9 for the Karate graph, some nodes are difficult to assign to a given cluster and appear therefore in intermediate position. This explains the presence of some small clusters in between larger ones : this is the case, for example, of the small cluster at the extreme right of the Figure who is Javert, the policeman who is pursuing Valjean (one of the character of the larger cluster just above the small isolated one). But this character is also very related to the Thénardiers’ family (larger cluster below the small isolated one) because they capture and imprison him at the end of the novel. The small cluster situated between the large central cluster and the cluster at the bottom left part of the map is also interesting. It contains 3 characters that act as connections between several secondary characters and the rest of the graph: these characters belong to cluster number 5 in clusterings M5 and M8.

The second source of differences between M5 and the fuzzy layout is that several characters are extremely difficult to assign to a cluster and have therefore almost flat assignment probabilities even at low temperature. Then, the corresponding nodes assume an almost barycentric position in the fuzzy layout: this explains the presence of a central node in Figure 17. The characters assigned to this central node are not connected and this cluster should not be seen as a meaningful one. On the contrary, it contains very isolated characters who belong to clusters 5, 2 or even 1 in clustering M8. All these characters have in common to share only a (small weighted) connection with another character of the network. They are very secondary characters in the novel. None of those ambiguities could have been detected on a standard two phases approach or on the SOM variants. Figure 18: Representation of the M1 clustering obtained with the heat kernel SOM on the original layout (left) and via the prior structure (right)

In addition, the SOM variants obtain poor results on this graph. As shown in Table 1 they do not provide any global Pareto optimal points. Therefore it is not very surprising to obtain unsatisfactory visual representations from those methods. Figure 18 represents the results associated to clustering M1 in Table 1 (obtained with the heat kernel SOM). The resulting clustering has numerous non empty clusters and leads to a quite cluttered visual representation. In addition, some clusters seems completely arbitrary: for instance clusters number 5, 10, 11, 13 and 16 are a seemingly random clustering of the characters related only to the Myriel characters. The global impression is that the heat kernel SOM suffers from a tendency to build too fine clusterings on a somewhat arbitrary basis. Figure 19: Representation of the M4 clustering obtained with the Laplacian’s inverse kernel SOM on the original layout (left) and via the prior structure (right)

The cluster and layout obtained by the Laplacian’s inverse kernel SOM are given on Figure 19. Results are quite different from those of the heat kernel SOM. Indeed, the clustered layout is very easy to read as there are no edge crossing. However, the rather low modularity of M1 (see Table 1) is an indication that clusters are not as dense as for clusterings M5 and M8. In fact, the representation of M1 clustering on the original layout of the graph shows that the associated visualization is quite misleading. For instance, cluster 3 contains Mother Plutarch while its unique connection in the graph, Mabeuf, is in cluster 6. Therefore cluster 3 is not even connected. A similar problem can be seen in cluster 2. Then the main underlying assumption of the clustering induced layout is not fulfilled: some clusters are not dense at all and are then meaningless.

This thorough analysis of “Les Misérables” confirms results on the Karate graph (Section 5) and the numerical analysis from the previous section. The analysis leads to two important conclusions. Firstly, the relevance of modularity and of edge crossings is confirmed: the rather poor results obtained by SOM variants in terms of numerical results correspond either to layouts that are difficult to read (see Figure 18) and/or to misleading clustering (see Figures 18 and 19). Secondly, the proposed methodology leads to a better understanding of the graph than the two phases approach by means of different clusterings (induced by the organized modularity maximization) and via the fuzzy layout methodology.

The sizes of “C. Elegans” and “E-mail” graphs make them difficult to analyze in details. We provide therefore only some qualitative comparisons of the layouts obtained by the investigated methods.

Figure 20 gives four layouts of the graph “C. Elegans” obtained from a subset of Pareto points of Table 2:

• CE1 is the layout obtained by kernel SOM with the highest modularity (chosen to avoid misleading clusterings);

• the CE8 is the layout obtained by the two phases approach and is induced by the clustering with the highest modularity;

• CE6 and CE7 are the two Pareto points obtained by our method: CE7 has the smallest modularity but also a smaller number of edge crossings within this two solutions. These two solutions have a high modularity (larger than 0.4) and the modularity of CE6 is very close to those of CE8 but with a smaller number of edge crossings. Figure 20: Clustering induced graphs for the “C. Elegans” dataset. Top left: displayed on the prior structure (for CE1 obtained by kernel SOM); Top right: displayed by Fruchterman and Reingold algorithm (for CE8 obtained by maximization of the modularity); Bottom: displayed on the prior structure (for CE6 and CE7 obtained by maximization of the organized modularity).

Despite a very small number of edge crossings, CE1 is poorly informative: three clusters (3, 6 and 8) contain more than 80 % of the vertices of the graph and the other clusters are thus very small compared to the three largest ones. The small clusters are then not very informative and their existence probably explains the poor modularity of the clustering. In fact, all solutions provided by kernel SOM for this dataset share the same problem with some large clusters and some very small ones. This type of behavior is neither specific to the chosen graph, nor to the Laplacian’s inverse kernel as shown on Figure 19 and in Boulet et al. (2008) for the heat kernel.

CE8 shows another type of layout problems: the graph induced by the clustering is almost a complete graph (up to cluster 10, the graph is complete). The only information conveyed by this layout is that nodes in cluster 10 are only connected to nodes in cluster 9. The width of the edges can be used to infer that some clusters are only loosely connected (e.g., 12 and 14), but grasping the general organization of the graph is quite difficult with this layout.

CE6 and CE7 both have a slightly smaller modularity than CE7 but they are easier to read because of a smaller number of edge crossings. Their clustering qualities are higher than the one of CE1 (with a modularity larger than 0.4 and less imbalance between the clusters’ sizes), comparable for CE6 to the one of the clustering obtained in the two phases method CE8. Therefore, CE6 (and to a lesser extent CE7) is almost as faithful as CE8 but gives some understanding of the structure of the graph because the resulting simplified graph is not complete. As expected, a small reduction in clustering quality can lead, in some situations, to clusters that are more adapted to visual exploration. This validates again the principle of integrating the clustering process and the layout process. Figure 21: Fuzzy layout during the final cooling phase for the solution CE6 of “C. Elegans”

In addition, fuzzy layouts such as the one provided in Figure 21 can be used to analyze how clusters are built during annealing. In the particular case of solution CE6, the layout shows that, e.g., cluster 8 is formed at the end of the annealing. The analyst can investigate first the associated nodes and their connection pattern to outside clusters, or on the contrary, focus his/her attention on well established clusters.

Finally, Figure 22 provides two layouts of the graph “E-mail”. We chose to represent the solution obtained by the two phases approach (E11, see Table 3) and the Pareto point obtained by our method with the smallest number of edge crossings (E10). Figure 22: Clustering induced graphs for the “E-mail” dataset. Left: displayed on the prior structure (E10 obtained by maximization of the organized modularity); Right: displayed by Fruchterman and Reingold algorithm (E11 obtained by maximization of the modularity).

As in the previous example, the solution provided by the optimization of the organized modularity, despite a clustering quality that is slightly worse, gives a more understandable simplification of the graph than the dual phases approach. In fact, the graph associated to E11 is complete and one has to rely on the width of the edges to try to infer the importance of the relations between clusters. The layout associated to E10 seems therefore easier to grasp. In addition, the associated fuzzy layout (see Figure 23) shows that most of the clusters are well defined and identifies a few nodes that are difficult to assign to clusters. Further investigations could target those nodes. Figure 23: Fuzzy layout during the final cooling phase for the solution E10 of “E-mail”

#### 6.3.3 A brief summary of the experiments

Several conclusions can be drawn from the experiments conducted on the four real world graphs. Firstly, maximizing the organized modularity leads to good solutions with a high modularity and a acceptable number of edge crossings. In some situations, maximizing the organized modularity actually gives a higher modularity clustering, in a similar way to what can be observed for the SOM versus the -means. However, the main gain is a better trade off between modularity and readability of the clustering induced graph: we trade a small reduction in modularity for either a larger number of clusters (limiting the general tendency of the modularity to pick up a small number of clusters) or a reduced number of edge crossings (or both). All in one, organized modularity maximization brings therefore simplified representations of graphs that seem more informative than the standard two steps approach. In addition, as the modularity is almost optimal, the representations provided by maximization of the organized modularity are reasonably faithful, especially compared to the low quality results generally obtained by graph adapted SOM. While the final layout obtained by the organized modularity are not perfect (it is quite clear for instance that the left graph on Figure 15 is planar and could be rendered with no edge crossing), their ordered and regular nature makes them very readable. In conclusion, we consider that a good practice consists in combining our method to the two phases approach to get different and complementary views on the same graph. A future implementation challenge is to provide linked multi-views Becker and Cleveland (1987) of a graph that would give the analyst a visual comparison method between those views.

In addition, relying on deterministic annealing provides a path in the clustering space that is worth exploring via intermediate fuzzy layouts. The main advantage of those layouts is to increase the number of clusters and to providing hints on the final assignment of vertices. Atypical vertices are clearly pinpointed with this approach and can be further studied by the analyst. Our use of the fuzzy layout was quite limited in this paper as we believe they should be provided in an interactive environment in which the analyst can navigate in the algorithm results. Further studies are needed to validate the interest of the concept. In addition, as those layouts are closer to clustered layouts than the rest of our work, comparisons with those latter methods (e.g., Noack (2007)) would be interesting.

Those results show that the combination of modularity maximization and edge crossings minimization is very well adapted to graph visualization. Then, it is quite natural to observe that SOM variants are not adapted to the targeted application. The analysis of the Karate graph has shown that they are not able to recover the ground truth clustering. The graph “Les Misérables” showed quite strong limitations of the clustering obtained by SOM variants (with meaningless clusters) without any particular gain in term of readability of the clustering induced graph. This was confirmed on “C. Elegans” in terms of layout and on “E-mail” via quality criteria.

## 7 Conclusion

We have proposed in this paper a new organized modularity quality measure for graph clustering inspired by the topographic mapping paradigm initiated by Prof. Kohonen’s SOM. The organized modularity aims at producing a clustering of a graph that respects constraints coming from a prior structure. This prior structure is then used to display the clusters in an ordered way. A deterministic annealing scheme has been derived to maximize efficiently the organized modularity. Its iterative nature can be leveraged to provide intermediate layouts of the graph that emphasize the progressive construction of the final clustering, pinpointing sub clusters, atypical vertices, etc.

An experimental study conducted on four real world graphs ranging from 34 to 1 133 vertices has shown that the proposed method outperforms similar approaches based on adaptation of the SOM to graph data both in term of clustering quality and in term of readability of the clustering induced graphs. The proposed method gives also better or complementary results compared to a two steps approach in which one first build a graph clustering that maximizes the standard modularity and then uses a graph visualization algorithm to display the clustering induced graph. Finally, the computational cost of the whole approach (which includes optimization of an influence parameter in the prior structure) remains acceptable and is compatible with graphs with a few thousands of nodes.

## Acknowledgment

The authors thank the anonymous referees for their valuable comments that helped improving this paper.

## Appendix A Derivations of the deterministic annealing algorithm

### a.1 Equivalence between O and F

We have

 O(M)=F(M)+12m∑i∑k,lMikSklMil(Wii−Pii).

Then, as is an assignment matrix, is non zero (and equal to one) only when and therefore

 O(M) = F(M)+12m∑iSc(i)c(i)(Wii−Pii) = F(M)+12m∑i(Wii−Pii),

as for all . Therefore is independent of and maximizing is equivalent to maximizing .

### a.2 Mean field equations

Let us denote . At a minimum of , the partial derivatives must be equal to zero. Those derivatives can be computed easily from the definition of . We fist note that

 KL(E) = ∑MR(M,E)lnR(M,E)−β∑MR(M,E)(F(M)−logZF) = ∑MR(M,E)lnR(M,E)−βER(F(M))+βlogZF.

We therefore need to compute . We have

 ∂U(M,E)∂Ejl=Mjl,

and

 ∂exp(βU(M,E))∂Ejl=βMjlexp(βU(M,E))

As , we have

 ∂ZR(E)∂Ejl = β∑MMjlexp(βU(M,E)) = βZR(E)ER(Mjl).

Therefore, recalling , we have

 ∂R(M,E)∂Ejl = −exp(βU(M,E))Z2R(E)∂ZR(E)∂Ejl+1ZR(E)βMjlexp(βU(M,E)) = β(Mjl−ER(Mjl))R(M,E).

Then

 ∑MlnZR(E)∂R(M,E)∂Ejl = lnZR(E)β∑M(Mjl−ER(Mjl))R(M,E) = 0,

because does not depend on and by definition of . Moreover,

 ∑M∂lnZR(E)∂EjlR(M,E) = ∑MβZR(E)ER(Mjl)ZR(E)R(M,E) = βER(Mjl),

and therefore

 ∂∂Ejl(∑MlnZR(E)R(M,E))=βER(Mjl),

 ∂∂Ejl(∑MR(M,E)lnR(M,E))=∂∂Ejl(β∑MU(M,E)R(M,E))−βER(Mjl).

By definition of , we have

 ∑MU(M,E)R(M,E) = ∑M∑ikEikMikR(M,E) = ∑ikEikER(Mik).

Then, by independence, when . Indeed we have :

 ∂ER(Mik)∂Ejl = ∑MMik∂R(M,E)∂Ejl = β∑MMik(Mjl−ER(Mjl))R(M,E) = β(ER(MikMjl)−ER(Mik)ER(Mjl)) = 0.

The final equality comes from independence of and under when (this simplification is the motivation for using a bi-linear cost function and therefore a distribution that factorizes).

Finally, we have

 ∂∂Ejl(∑MU(M,E)R(M,E))=ER(Mjl)+∑k∂ER(Mjk)∂EjlEjk,

and therefore

 ∂KL(E)∂Ejl=β(∑k∂ER(Mjk)∂EjlEjk−∂ER(F(M))∂Ejl),

from which we obtain the mean field equations (8).

### a.3 Expectation minimization scheme

We need to compute starting from

 ER(F(M))=∑u≠v∑k,tER(Muk)SktER(Mvt)Buv.

As shown above, when , and therefore when and ,

 ∂(ER(Muk)ER(Mvt))∂Ejl=0.

Then

 ∂ER(F(M))∂Ejl = ∑u≠vBuv∑k,tSkt∂(ER(Muk)ER(Mvt))∂Ejl = ∑u≠jBuj∑k,tSktER(Muk)∂ER(Mjt)∂Ejl +∑v≠jBjv∑k,tSktER(Mvt)∂ER(Mjk)∂Ejl = 2∑u≠jBuj∑k,tSktER(Muk)∂ER(Mjt)∂Ejl,

using the symmetry of and for the last equation. Then if for all and , we set the values of to

 Ejk=2∑i≠j∑lER(Mil)SklBij,

we have obviously

 ∑k∂ER(Mjk)∂EjlEjk = 2∑k∂ER(Mjk)∂Ejl∑i≠j∑tER(Mit)SktBij = 2∑i≠jBijBij∑k,tSktER(Mit)∂ER(Mjk)∂Ejl = ∂ER(F(M))∂Ejl,

and the mean field equations are fulfilled.

### a.4 Fixed point analysis

Finding the mean field via the EM-like scheme given by equations (9) and (10) corresponds to looking for a fixed point of the following matrix valued function:

 Gjk(E)=2∑i≠jBij∑lSklexp(βEil)∑pexp(βEip). (16)

The stability of a fixed point can be analyzed with a first order Taylor expansion (where is the Frobenius norm)

 Gjk(E1)=Gjk(E0)+∑u∑t∂Gjk∂Eut(E0)(E1ut−E0ut)+o(∥E1−E0∥F),

which recalls that the stability is governed by the eigenvalues of the Jacobian matrix .

Obviously, . When , we have

 ∂Gjk∂Eut = 2Buj∑lSkl∂∂Eut(exp(βEul)∑pexp(βEup)) = 2Bujβexp(βEut)(∑pexp(βEup))2(Skt∑pexp(βEup)−∑lSklexp(βEul)) = 2βBujER(Mut)(Skt−∑lSklER(Mul))

At the limit of infinite temperature (when ), equation (9) leads to , and therefore for the high temperature fixed point ,

 ∂Gjk∂Eut(E0)=2BujβC(Skt−1C∑lSkl),

while equation (10) gives in addition

 E0jk=2C∑i≠jBij∑lSkl.

Let us denote the centering matrix ( is the