Refactoring Software Packages via Community Detection from Stability Point of View

11/26/2018
by   Mohammad Raji, et al.
0

As the complexity and size of software projects increases in real-world environments, maintaining and creating maintainable and dependable code becomes harder and more costly. Refactoring is considered as a method for enhancing the internal structure of code for improving many software properties such as maintainability. In this thesis, the subject of refactoring software packages using community detection algorithms is discussed, with a focus on the notion of package stability. The proposed algorithm starts by extracting a package dependency network from Java byte code and a community detection algorithm is used to find possible changes in package structures. In this work, the reasons for the importance of considering dependency directions while modeling package dependencies with graphs are also discussed, and a proof for the relationship between package stability and the modularity of package dependency graphs is presented that shows how modularity is in favor of package stability. For evaluating the proposed algorithm, a tool for live analysis of software packages is implemented, and two software systems are tested. Results show that modeling package dependencies with directed graphs and applying the presented refactoring method, leads to a higher increase in package stability than undirected graph modeling approaches that have been studied in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/11/2016

Package equivalence in complex software network

The public package registry npm is one of the biggest software registry....
12/03/2018

On the Relationship Between Modularity and Stability in Software Packages

Modular and well-written software is an ideal that programmers strive to...
10/13/2017

An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems

Nearly every popular programming language comes with one or more package...
01/23/2021

Präzi: From Package-based to Call-based Dependency Networks

Software reuse has emerged as one of the most crucial elements of modern...
05/04/2019

DynComm R Package -- Dynamic Community Detection for Evolving Networks

Nowadays, the analysis of dynamics in networks represents a great deal i...
11/16/2020

Dependency Solving Is Still Hard, but We Are Getting Better at It

Dependency solving is a hard (NP-complete) problem in all non-trivial co...
02/05/2021

ROBustness In Network (robin): an R package for Comparison and Validation of communities

In network analysis, many community detection algorithms have been devel...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Well-known refactoring techniques

  • Rename method. This technique may be the most simple refactoring method one can use. Simply renaming identifiers and variables will make the code clearer, more understandable and can reduce the need for comments. An appropriate name for a method, variable or a class is one that is descriptive so that a new programmer can understand its work just by a glance.

  • Inline temp. Temporary variables can make methods longer and more complicated. It is suggested that temporary variables that are being used only once or are a result of a method call be completely removed and the value assigned to them be used in the code. An example is provided below.

    Incorrect:

    1    def add_something():
    2        return 1 + 2
    3
    4    def foo():
    5        temp_variable = add_something()
    6        print "The result is " + temp_variable

    Correct:

    1    def foo():
    2        print "The result is " + add_something()
  • Extract method. Known as arguably the most important refactoring technique, Extract method aims at reducing the size of long methods by breaking them into smaller methods with descriptive variables. Many refactoring and simplifying techniques in software engineering involve breaking code and algorithms into smaller, more understandable chunks. This method is one of them.

    1    class Foo:
    2        username
    3        def __init__(self):
    4            # Some initialization code
    5            self.username = "Some username"
    6
    7        def func1(self):
    8            print "Welcome"
    9            print "You have logged in as " + self.username
    10            print "Something else"
    11
    12        def func2(self):
    13            print "Welcome"
    14            print "You have logged in as " + self.username
    15            print "Some reports"

    In the provided example, lines 8 and 9 are equal to lines 13 and 14 and can be extracted into a new method that greets the user. Extract method is considered as an important and basic refactoring technique that highly effects the cohesion of classes from which methods have been extracted. Extract method suggests the extraction of pieces of code that are used more than once (duplicate code). If this condition is met while extracting piece of code A from methods B and C, then after refactoring, both B and C will be using A and thus reducing the cohesion in their class. However, one must realize that if appropriate interfaces are not used in the code and other classes in a package use method A, then instead of reducing cohesion, coupling will be increased. A thorough study on this issue and a metric for finding appropriate pieces of code for extracting while considering the notion of cohesion is provided in [5].

    Considering our focus on graph clustering methods in refactoring, it is worth noting that some work has been done in detemining the class a method belongs to, with the help of community detection techniques [6]. However, introducing new methods and extracting them with community detection is still in need of attention.

  • Inline method. In some cases, the opposite of Extract method should be applied. Suppose method A is simple, clear and is being used only once, possibly in a stable class whose content is not likely to change. In this case, using an identifier for the code in method A only results in an extra call for no benefit. This method can be removed and its content can be used inline.

  • Replace method with Method object. This technique can be considered as an aid, in situations where Extract method becomes difficult because of the high number of temporary variables in a long method. In a case where the number of temporary variables is high, Extract method can become a cumbersome task because passing around all the temporary variables between the extracted methods can prove to be messy and finding the needed temporary variables for a piece of extracted code can take a lot of time.

    To resolve this issue, one approach is to move the long method into a new class, set the local temporary variables as class attributes and then apply Extract method. This method provides a better state, from which we can continue our refactoring using Extract method or other techniques.

  • Pull up method. Imagine a scenario in which a piece of code is duplicated in two different classes, it is best to pull that code up into a super class of those two classes.

    Before refactoring:

    1    class Person:
    2        firstname = None
    3        lastname = None
    4        def __init__(self):
    5            # Some initialization code
    6
    7    class Student(Person):
    8        studentNo = None
    9        def __init__(self):
    10            # Some initialization code
    11
    12        def makeFullName(self):
    13            return self.firstname + " " + self.lastname
    14
    15        def getStudentNo(self):
    16            return self.studentNo
    17
    18    class Employee(Person):
    19        salary = None
    20        def __init__(self):
    21            # Some initialization code
    22
    23        def makeFullName(self):
    24            return self.firstname + " " + self.lastname
    25
    26        def getSalary(self):
    27            return self.salary

    After refactoring:

    1    class Person:
    2        firstname = None
    3        lastname = None
    4        def __init__(self):
    5            # Some initialization code
    6
    7        def makeFullName(self):
    8            return self.firstname + " " + self.lastname
    9
    10    class Student(Person):
    11        studentNo = None
    12        def __init__(self):
    13            # Some initialization code
    14
    15        def getStudentNo(self):
    16            return self.studentNo
    17
    18    class Employee(Person):
    19        salary = None
    20        def __init__(self):
    21            # Some initialization code
    22
    23        def getSalary(self):
    24            return self.salary
  • Extract surrounding method. Imagine a case in which several different methods are almost identical but with a slight difference in the middle of each one. In some languages, one can pull up the duplicated code into a new method and pass the middle section to the method which it yields to. This ability is provided in some languages like Ruby and can be simulated in some other languages by passing callback functions. A Ruby example is given below.

    1    def testMethod
    2        puts "Something printed from inside testMethod"
    3        yield
    4        puts "Something printed from inside testMethod"
    5    end
    6    testMethod {puts "Something printed from the block"}
  • Replace conditional with polymorphism. This method of refactoring can help remove the complexity and code smell of conditional logic and demonstrate the principle of true object-oriented design.

2.1 Coupling and cohesion

Coupling is one of the most famous internal product attributes. Generally, two pieces of code are said to be coupled if changes in one causes the other to change. In the object-oriented paradigm, coupling between two classes is considered a bad and unwanted attribute, however a system with no coupling between its classes would mean that interaction is not occurring between the classes and therefore it would simply fail to function.

Cohesion, which almost always comes with coupling, is another important internal product attribute. In an object-oriented system, a class is said to have a high cohesion if its internal structures and methods have high connectivity with themselves. The goal in a good design is high cohesion and low coupling, meaning that classes should be cohesive and therefore fully related to their responsibility while they have a low coupling with other classes so that they can change without causing too many changes in other parts of the system. Designs with high cohesion and low coupling make the system more reliable and maintainable [9], [10].

The notions of coupling and cohesion have been excessively studied in the literature and many metrics have been proposed for measuring them. This thesis briefly surveys different approaches in the literature.

2.1.1 Basic definitions by Myers

Myers, Stevens and Constantine introduced the concept of coupling in procedural programming. Based on this, Fenton defined six different levels of coupling [11]. These levels of coupling are shown below from worst to best.

  • Content coupling. If one element branches into or changes the internal statements of another element, they are said to have content coupling.

  • Common coupling. If two elements refer to the same global variable, they are said to have common coupling.

  • Control coupling. If the data that one element sends to the other controls its behavior, then control coupling is implied.

  • Stamp coupling. Two elements are stamp coupled if they send more information to each other than necessary.

  • Data coupling. If two elements communicate with each other by parameters with no control coupling, then they are data coupled.

  • No coupling. If two elements have no communication with each other then they are not coupled.

2.1.2 Fenton and melton’s metric

Fenton and melton proposed a metric for coupling which is expressed as

(2.1)

where is the number of interactions between the two components and and is the level of the worst coupling found between and . In their metric, the coupling level is based on Myer’s classification. No coupling is given a coupling level of 0 and the next levels have higher numeric values.

Alghamdi discusses several important points about this metric [11].

  • All types of interconnections are considered equal, with equal effects on coupling.

  • The Fenton and Melton metric is an example of a inter-modular metric, meaning that it calculates the coupling between a pair of components in contrast with intrinsic metrics that measure the coupling of a component individually.

  • Coupling values approach the next level when the interconnections between two components increases.

Alghamdi also proposes a new coupling metric based on a description matrix of the system [11].

2.1.3 Chidamber and Kemerer’s suite

Chidamber and Kemerer [12] gives the first formal definition of coupling by defining coupling as any evidence that a method from one class uses a method or variable of another class. In their proposed suite, known as the CK suite, Chidamber and Kemerer give provide different metrics. The six metrics are as follows.

  • Weighted Method per Class (WMC)

  • Number Of Children (NOC)

  • Depth of Inheritence Tree (DIT)

  • Coupling Between Objects (CBO)

  • Lack of Cohesion in Methods (LCOM)

  • Response For a Class (RFC)

Among their six metrics, CBO (Coupling Between Objects) is proportional to the number of non-inheritance related couples with other classes. For measuring coupling, CBO aggregates the total number of couples a class has to another class, which implies that different couples have the same strength and effect. Hitz and Montazeri [13] argue that the CK suite does not fully conform to measurment theory.

2.1.4 Alghamdi’s coupling metric

Alghamdi’s approach is based on the idea of generating a description matrix of all the factors that effect coupling and then calculating a coupling matrix based on the collected data. An overview of this approach is depicted in Fig. 2.1

Figure 2.1: An overview of Alghamdi’s coupling metric

The description matrix is an by matrix where is the number of system components and is the number of component members. In an object-oriented system, components are represented by classes and members are class variables and methods. An example of a description matrix is depicted in Table 2.1.

Component
Table 2.1: An example of Alghamdi’s description matrix

2.1.5 A qualitative approach to coupling and cohesion

While many quantitative approaches for measuring coupling and cohesion have been proposed in the literature, few qualitative approaches have been discussed. Kelsen [14] proposes an interesting information-based method for analyzing coupling and cohesion and finding refactoring suggestions. Kelsen’s approach considers a special type of coupling, namely representational coupling. When an object calls a method of another object, some information about the callee is exposed. If the information is about the low level implementation of the callee, then representational coupling is high. If the call exposes higher level information, then representational coupling is low. Many metrics in the literature, including the works of Chidamber and Kemerer [12] can not capture representational coupling [14]

. The main reason behind this issue is that many works simply count different types of interactions and assign ordinal numbers to these interactions. Kelsen, also presents a minimum for the representational coupling inherently contained in a system, which is known as intrinsic representational coupling.

Kelsen’s approach is based on the idea that if one can find two states in a system, namely witness states, that yield different messages between objects but cannot affect the states of other objects, then this indicates that coupling can be improved and representational coupling is higher than it should. The elevator example below is borrowed from [14].

Suppose that the behavior of some elevators in a building is modeled using two classes, ElevatorControl and Elevator. Every elevator has two methods, direction() and position(), which return the direction and position of the elevator. The ElevatorControl class is responsible for handling requests and checks every elevator’s distance and position for finding the closest elevator for a request. Two different implementations for ElevatorControl’s handleRequest method can be written.

1:procedure handleRequest(Request r)
2:     
3:     Elevator //reference to closes elevator
4:     for every elevator  do
5:          Compute using and
6:         if  then
7:              
8:                             
9:     if  then
10:               
Algorithm 1 Implementation 1 for ElevatorControl.handleRequest
1:procedure handleRequest(Request r)
2:     
3:     Elevator //reference to closes elevator
4:     for every elevator  do
5:         
6:         if  then
7:              
8:                             
9:     if  then
10:               
Algorithm 2 Implementation 2 for ElevatorControl.handleRequest

In the second implementation, the task of computing en elevator’s distance is given to each elevator with the method. It is clear that the class does not need to know the distance and position of every elevator and only needs their distances, therefore less information from the class is being exposed in the second implementation and representational coupling is decreased.

Kelsen’s qualitative approach may be considered a precise and great method for measuring representational coupling, however, because of its non-quantitative nature it is not clear if it can be applied to large, real-life software systems with many classes, and its utilization in real-life scenarios is currently considered as an open problem.

2.2 Stability

Stability is the amount of likeliness, that a class or a package will not change. Stability is inherently difficult to measure because the future changes and needs of a project are not well known, however some metrics exist that try to measure stability. The importance of this stability in software metrics was first mentioned by Hitz and Montazeri [13].

Some methods for measuring the stability of a software package, utilize the history of the class’s changes in the past and try to predict its future. The changes of a class or a package is typically accessed through version control systems such as Git111http://git-scm.com and Subversion222http://subversion.apache.org, however these approaches can not be used in early stages of software design because of the lack of change history available at the time.

Robert Martin [15] takes a different approach to measuring the stability of a software package. He believes that stability is proportional to responsibility and a package is said to be responsible and independent if many other entities depend on it, while it doesn’t depend on others itself. A package is said to be irresponsible and thus unstable, if it depends on many other entities, meaning that if they change they cause to change as well. By Martin’s definition, in Fig. 2.2, X is an example of a stable package and in Fig. 2.3, Y resembles an unstable package.

Figure 2.2: An example of a stable package
Figure 2.3: An example of an unstable package

As a metric for stability, Martin defines the instability of a package as given in Eq. 2.2 where is instability, is afferent couplings and represents the number of efferent couplings. Afferent couplings is the number of classes outside the package that depend on classes within the package and efferent couplings is the number of classes within the package that depend on outside classes.

(2.2)

If a package has an instability of 0, then the has maximum stability and if the package holds a value of 1 for instability, then it would mean that the number of afferent couplings is 0 and therefore depends on other packages while no other package depends on and this would make it an extremely unstable package.

Martin also proposes the Stable Dependencies Principle (SDP) that helps the software design process by ensuring that modules that should be easily changeable not depend on modules that are harder to change [15]. In this case, packages should always have a higher metric than the ones they depend on. Concenting to this principle, one would be able to see a tree of packages, in which stable ones are placed at the bottom and the most unstable ones are at the top. The benefit of this approach is that packages that are violating SDP can be easily spotted. Any package depending on a package above it, would mean a violation of the principle.

It is important to note that not all packages should or could be fully stable, as this would cause an unchangeable and inflexible system. Also, not all packages can be unstable as this would create an irresponsible system with a large number of connections and a high coupling. It is clear that pieces of code that are likely to change should be placed into unstable packages and pieces of code that are not very likely to change in the future should be placed in stable packages. Martin argues that high level design can not be placed in unstable packages because it resembles the architectural decisions of the projects, however if high level code is placed in stable packages then it would almost be impossible to change it after the project becomes more mature and more pieces of code start depending on it. The solution to this dilemma is the use of abstract classes that can introduce the flexibility and flow of stability that is needed. The basic idea behind the Stable Abstraction Principle (SAP) is that a package has to be as abstract as it is stable. This principle ensures that the stability of a package does not contradict its flexibility. The SAP proposes a metric for measuring the abstractness of a package which is a simple ratio and is shown in Eq. 2.3 in which is the number of abstract classes inside the package and is the number of classes inside the package.

Figure 2.4: The relationship between asbtractness and instability

Martin defines three important areas in the relationship between abstractness and stability. If we set abstractness (A) as the vertical axis and instability (I) as the horizontal axis in a cartesian graph, then three spots depicted in Fig. 2.4 are as follows.

  • Zone of pain. The zone of pain is where a package is highly stable and yet its abstractness is zero. Such a package is hardly changeable.

  • Zone of uselessness. A package in this zone is highly abstract and also highly unstable and not depended on. This means that its abstractness is useless.

  • The main sequence. This is the ideal point for a package. A package near the main sequence is a package that conforms to the SAP and is as abstract as it is stable. The sequence is ideal and thus not many packages can truly be placed on this line, however the distance of a package from this ideal line can be measured.

(2.3)
(2.4)

In Eq. 2.4, is the distance from the main sequence and is its normalized version that ranges from .

3.1 Classification of clustering methods

Graph clustering methods are normally difficult to classify, however Wiggerts

[16] believes that they can generally be divided into the following methods.

  • Hierarchical methods. Hierarchical approaches are known as some of the early solutions to the problem. These methods provide a hierarchy of partitions like a tree, known as a dendrogram. A sample dendrogram is depicted in Fig. 3.2. Hierarchical methods are themselves divided into the two groups of agglomerative approaches and divisive approaches. In agglomerative approaches, the algorithm starts with placing every node inside a separate cluster. Then the algorithm starts merging the clusters based on their similarity. It is important to note that the algorithm will not stop unless told to, thereforee knowing the number of wanted partitions in the result is crucial. In divisive hierarchical approaches, the algorithm starts with a single cluster that contains all the nodes of the graph. The algorithm then splits the cluster based on the similarity between the nodes, keeping the similar ones in the same cluster. Different hierarchical algorithms are distinguished by their distance function which is responsible for determining the similarity between two given nodes.

  • Optimization based methods. These algorithms generally take an initial inaccurate clustering and with the help of a quality measure, try to enhance and improve the cluster and optimize the quality. One of the most common and famous quality measures in the literature is the modularity measure proposed by Girvan and Newman [17]

    . Various kinds of optimization techniques are applicable in this category of graph clustering algorithms, such as genetic algorithm based optimization methods, particle swarm methods, etc. A simple genetic algorithm approach can be like the following

    [18].

    1. Select a random population of partitions

    2. Generate a new population by selecting the best according to a quality measure, such as Newman’s modularity

    3. Repeating step 2 until a certain criteria is met

  • Graph theoretical based methods. Graph theoretical algorithms are methods that utilize the formal descriptions and properties of graphs and their respective subgraphs. In these methods, various subgraphs and properties are used to extract meaningful clusters from the original graph. Two important and common types of graph theoretical algorithms exist, namely aggregation algorithms and minimal spanning tree algorithms. Aggregation algorithms use the function of reduction on different nodes and merge them in each step. Different potential nodes for merging are chosen using different techniques, such as neighbourhoodness, strong connections and etc. Minimal spanning tree algorithms use the minimal spanning tree of the graph. These algorithms are normally not considered accurate as they tend to create large clusters, however some enhanced versions of these algorithms have been suggested in the literature [18].

  • Construction algorithms. These algorithms assign nodes into clusters in one pass. The bisection algorithm and density search techniques are considered as examples of such methods.

Figure 3.2: A dendrogram for the Zachary club

The minimum cut approach is the most obvious and the most easiest way of tackling the problem of community detection. In this method, one tries to find two groups/partitions in a graph for which the edges connecting the two is the least. This approach mostly falls in the area of graph partitioning, because the number of partitions in the end result must be known a priori so that one can know how many times the algorithm should be applied. It is worth noting that if the minimum cut approach were to be used with no constraint, then a trivial solution to the problem would be to place all vertices in one partition only, thus minimizing the number of edges between partitions. Clearly this solution would not give any information on the communities in a network. In the software engineering sense, the result of such a method would be a system with zero coupling and maximum cohesion, which seems the goal. However many important aspects of the software such as reusability, separation of concerns, object orientedness, flexibility, etc. will be lost. This raises the idea that maybe another measurement apart from coupling and cohesion is needed that can help find an optimum point for the two. This measure must be able to truly model and represent different objects in a software dependency network. In the subject of graph theory, a measure that can model the goodness of a partition is known as a quality measure. Using a community quality measure in the field of software engineering has only recently been discussed in the literature [6], [19].

3.2 Quality measures

The quality of a partition found by a community detection algorithm is determined with a quality measure. This measure should show how good a partition is. Many algorithms provide many partitions without equal goodness, therefore it is absolutely necessary to measure the quality of the provided partitions and detect the best. Quality functions give a number to each partition so that the partitions can be ranked and compared to one another. Arguably, the most common and famous quality function is Newman and Girvan’s Modularity [20].

Modularity is based on the idea that a random graph contains no meaningful community. Based on this idea, if one can make a similar graph to the one being analyzed with the same number of vertices, edges and degrees but with edges placed at random, then by comparing it to the original graph one can find the major differences that have created communities. To understand the notion of modularity, we start by another measure for the goodness of a partition and build on it. Let be a graph with elements of its adjacency matrix presented as , where is 1 if nodes and are connected and 0 otherwise, and being the community in which vertex belongs to. The following measure shows the fraction of edges in graph , that fall within communities.

(3.1)

where is the Kronecker delta function and is the number of edges in the graph.

This fraction takes the value of 1 when all edges fall in one community and hence is not a good enough measure.

The idea behind modularity is that a random graph does not have a meaningful community structure and thus, if generated carefully, should provide a good point of comparison. Carefully generating a random graph that can depict the features and properties of the original graph but with no meaningful community is known as providing a null model in the area of complex systems. In this case, one can provide a graph which has the same amount of vertices, edges and vertex degrees while its edges are rewired randomly, so that the graph looses its community structure. In such a graph, the probability of an edge being in between vertices

and , if connections are made at random is calculated as below.

(3.2)

where and are the degrees of vertex and respectively. Now, by using equations 3.1 and 3.2, one can calculate the modularity measure as

(3.3)

By looking at Eq. 3.3, one can see some important aspects of this measure. The Kronecker delta function makes sure that a connection between two graph nodes in two different communities makes no contribution to modularity. Two connected nodes inside a community, make a positive contribution to modularity and the contribution is inversely proportional to the degrees of the two nodes. Also two nodes that are not connected, yet still reside in one community provide a negative contribution to the overall modularity of the clustering.

3.3 A brief discussion of well known clustering methods

In this section, several common graph clustering methods are briefly studied.

3.3.1 The fast greedy method

A typical greedy method for clustering a graph while utilizing Newman’s modularity consists of the following steps.

  1. Start with each vertex in its own community, thus having communities for vertices.

  2. In each step, merge two communities whose join makes the highest increase in modularity .

  3. After joins, one community remains and a dendrogram can be created.

  4. Take the clustered solution that has the highest Q.

The simple greedy method, can waste a good deal of time when dealing with sparse graphs. In the implementation of the simple greedy approach, one has to merge many columns and rows of the sparse adjacency matrix and consequently time and space is wasted on merging elements with the value of 0. For this reason, Clauset and Newman have presented an enhanced version of the greedy method, namely the fast greedy method [21] which performs much better than many other algorithms in the literature. In the fast greedy method, some data structures such as max heaps and balanced binary trees are used with some alterations in the algorithm that results in the runtime of .

3.3.2 The edge-betweenness based method

The edge betweenness based method, proposed by Girvan and Newman [22] before presenting the modularity measure, is a graph clustering algorithm that focuses on the edges that are between communities in contrast to many other older algorithms that focus on the connections inside a community. Edge betweenness is described as the number of shortest paths between pairs of vertices that run along it. The algorithm for this method is as follows.

  1. Calculate edge betweeenness for all edges

  2. Remove the edge with the highest betweenness value

  3. Recalculate edge betweenness for the rest of the edges

  4. Repeat step 2 until no edges remain

Calculating betweenness for all edges and vertices of a graph can be calculated using Newman’s algorithm for betweenness [22] which can be calculated in time . Edge betweenness has to be recalculated for every edge removal and thus the algorithm can work in time .

3.3.3 The walktrap based method

The walktrap method is based on the notion of random walks [23]. The main idea behind the walktrap method is that random walks in a graph tend to get trapped in dense parts of the graph which could represent communities. In the walktrap method, a distance between communities is calculated based on the properties of random walks. After this step, typically an agglomerative algorithm is used to merge communities and create a dendrogram, much like other methods. This algorithm has a runtime of .

3.3.4 The leading eigenvector based method

The leading eigenvector algorithm utilizes the eigenvalues of the modularity matrix. In this algorithm one determines the eigenvector corresponding to the most positive eigenvalue of the modularity matrix and divide the network into two groups according to the signs of the elements of this vector.

3.4 Community detection for directed graphs

Community detection in directed networks is a difficult task [24]. Various algorithms for community detection in undirected graphs have been presented in the literature, however methods for directed approaches have been less common. A comprehensive survey of community detection methods for directed graphs can be found in [24] by Malliaros et al. They propose the following classification for community detection approaches in directed graph.

  1. Naive graph transformation approach. In this method, directions are simply removed from the graph and undirected community detection techniques are applied.

  2. Transformations maintaining directionality. In this category of methods, the graph is transformed to an undirected version while directionality is maintained using other methods. The original graph can be tranformed to a unipartite weighted graph or a bipartite graph for this approach. An overview of such transformations is depicted in Fig. 3.3.

  3. Extending objective functions and methodologies in directed graphs.

    Many objective functions and quality measures used in undirected graphs can be extended to directed versions, i.e. modularity, spectral clustering, page rank and random walk methods, local density clustering.

  4. Alternative approaches. Other methods that can not be placed in the first three categories also exist. Such as information theoretic approaches and blockmodeling approaches.

Figure 3.3: An example of a transformation that preserves directionality.

Although some algorithms exist for this purpose, many clustering algorithms for undirected graphs can be extended for directed graphs with the help of a direction-compliant quality measure. Several extensions of modularity for directed graphs have been proposed in the literature. Arenas et al [25] proposed an extension of modularity. Their idea is based on the fact that in a directed graph , if vertex exists with more out-links and vertex exists with more in-links, then it is more probable that in a random rewiring a link be found from to rather than the opposite. Considering the original idea of modularity, this suggests that if an edge is found from to , then this edge is contributing to a community structure more than to would, simply because it is more suprising and less random. By this definition, modularity can be altered for directed networks by changing the null model to a graph with the same number of vertices, edges, out-links and in-links as the original graph. The equation for modularity in a graph with the adjacency matrix and number of edges can then be expressed as

(3.4)

where is the Kronecker delta function, and denote the communities that nodes and belong to, and and are the number of vertex and ’s out-links and in-links respectively.

3.5 Applications of community detection in software engineering

Graph clustering is widely used in the literature as a method for finding meaning in a structure. This need for finding meaning in a complex system is generally used in four main areas of software engineering.

3.5.1 Reflexion

Reflexion is the art of bridging the gap between software and humans, when it comes to analyzing a legacy system. Reflexion analysis tries to build an understandable high level abstraction of a large system, given the source code. In the process, the source code is analyzed and mapped to a new higher level model. This cumbersome task is typically done manually, however graph clustering can be used in semi automated mappings of source code to entities with the help the user’s knowledge about the system. Some related work has been presented in the literature [26], [27].

3.5.2 Refactoring

There are many properties that can be associated with good code. Sommerville describes good code as one that is highly maintainable, dependable, efficient and usable [1]. Truly reusable code is considered gold in the software industry as it significantly effects productivity and thus lowers costs [2] and without a doubt, good code is backed by a good design. Refactoring is the art of improving the internal structure of code while leaving the outer side intact [3]. One of the problems that has been tackled in the literature is refactoring large and complicated legacy systems and also analyzing the structure of new code. Graph clustering techniques can be considered a good method for finding the correct structure and packages of a large system by analyzing the relationships in a software’s dependency graph. Some work has been done in the area of refactoring at the class level, using graph clustering algorithms [6]. Recently, some work has also been presented in the package level [19], however the lack of an accurate package analysis tool that considers important object oriented aspects, such as stability and reusability is strongly felt in the literature.

3.5.3 Parallel computing

Task to processor mappings is considered an important problem in parallel environments. The two general strategies used in such problems is placing tasks that can run concurrently on different processors, while keeping tasks that need many communications on the same processors, in order to increase locality. Graph partitioning tools have been used in some cases to map tasks to hypercube structures [28].

3.5.4 Ontologies and concept grouping

One of the areas that highly utilizes graph clustering methods is ontologies and the semantic web. Various applications have been presented in the literature. One important application is extracting new concepts and taxonomies from ontologies. Extracting more generalized concepts and relations is one of the outputs of an ontology clustering. Tang et al. presents a great survey on such methods [29]. Modularization is also considered important for the problem of ever growing and over grown ontologies. The works in [30] is one of the most recent methods in this specific area.

3.6 Partition stability

In some works the notion of partition stability, also known as robustness is considered as an important property of a good clustering algorithm. The idea is that a stable partition is one that can be recovered even if the structure of the graph is modified, as long as the change in the graph is not too extensive. It important to stress that this thesis only studies stability in the software package sense of the word and does not cover cluster stability.

4.1 Basics of modeling packages with graphs

As discussed in previous chapters, many metrics have been proposed for different software properties at the class level. At the package level, which is in a higher level in the abstraction hierarchy compared to a class, the most important property in the literature is the dependency between two packages. When a class inside a package depends on a class from another package, the former package is said to depend on the latter.

Let be a graph with the adjacency matrix . Vertices in represent classes and edge between vertex and vertex resembles a dependency between the two classes. Communities in this graph represent package structure. A dependency between two classes can be any usage of methods or variables or inheritance. Classes are being modeled to graph vertices for the sole purpose of using community detection methods for finding appropriate clusters which represent packages and different relationships between classes are not considered different.

A thorough metric for package dependencies has been proposed in [32] by Gupta et al, which takes into account the different types of connections between packages when sub-packages also exist in the software. The metric is validated using Briand’s evaluation criteria [33]. Gupta et al consider two classes of two packages connected if any of the following relationships are found between them.

  • Aggregation relationships between two classes, i.e., one class’s attribute has the type of another class

  • Class inheritence or implementing interfaces

  • Method invocation of one class by the method of another class

  • A class’s method referencing an attribute from another class

  • A class’s method has a parameter of the type of another class

  • A class’s method has a local variable of the type of another class

  • A class’s method invoking a method having a parameter of the type of another class

By Gupta et al’s metric, coupling between two packages and , where denotes the hierarchical level, is expressed as

where and are the number of elements of package and respectively at hierarchy level , and is the binary connection between elements. An example of different hierarchical levels given in [32] is depicted in Fig. 4.1. The binary connection between elements () can be calculated as

where denotes that element depends on element .

Figure 4.1: An example of different hierarchical levels

4.2 Basics of refactoring with community detection

The use of community detection methods for refactoring packages has only recently been studied in the literature by Pan et al [19]. An overview of their method is as follows.

  1. Gather software information and dependencies from Java classes and jar files.

  2. Construct an undirected weighted dependency network based on the information gathered in the first step.

  3. Apply community detection to the dependency network to find the optimal placement of classes in packages.

  4. Compare the optimized clustering with the original packages structure of the code and suggest a list of possible refactoring candidates.

In the first step of their algorithm Pan et al take into account two types of dependencies between code attributes, method accessing attribute dependency and method call dependency. Any of the two mentioned dependencies between two classes implies a dependency between the two classes.

Pan et al model package structure with the help of two different networks, namely the undirected Feature Dependency Network (uFDN) and the undirected Weighted Class Dependency Network (uWCDN). Nodes in uFDN represent features inside the software and edges represent dependencies between features. By this definition, uFDN can be expressed as

(4.1)

where and represent the set of vertices and edges in uFDN respectively and is the adjacency matrix for the network. The subscript shows that the two sets and the adjacency matrix are at the feature level. An example of a uFDN presented in [19], consisting of two communities, is shown in Fig. 4.2.

Figure 4.2: A sample uFDN

The code resembeling the network in 4.2 is given below.

1public class X
2{
3    private int a;
4    public void c() {}
5    public void b() {c();}
6    public void d() {a++; b(); c();}
7}
8public class Y
9{
10    public void f()
11    {
12        X x = new X();
13        x.c();
14    }
15    public void e() {f();}
16}

In uWCDN, only the relationship among the classes are shown. A weight is used for every class dependency that represents the number of connections between the the attributes and methods of the two classes involved in the relationship. uWCDN can be defined as

(4.2)

where denotes the set of all vertices at the class level, denotes the set of all edges and is the weighted adjacency matrix of the network. Every entry in can be shown as which is the weight between the two elements and and is used to denote the strength of a dependency between nodes and . This weight can be calculated as

(4.3)

The difference between uFDN and uWCDN is shown in Fig. 4.3.

Figure 4.3: A sample uWCDN compared to its respective uFDN

where denotes the set of all nodes reachable from within a distance of and is the set of all the features of class . It is important to note that is equal to .

The community detection algorithm used by Pan et al utilizes an older definition of modularity [34].

4.3 The importance of directed graphs in modeling package relationships

Many studies in the literature have utilized undirected community detection methods for various applications. Fortunato [35] presents a comprehensive review on undirected community detection methods. Many studies that include a directed model of a problem simply discard the information that the directions in the graph provides, and use a naive graph transformation approach. In the naive tranformation approach, graph directions are simply discarded and normal undirected community detection methods are applied to the graph. This can cause many important information to be discarded. We briefly discuss three main problems that an undirected approach can cause and how it effects refactoring and package stability.

4.3.1 Citation based cluster models

Using naive transformation approaches for undirected community detection, introduces inaccuracy in certain graphs such as the citation based model that is depicted in Fig. 4.4. In this graph, the two middle vertices can clearly form a meaningful community. The two vertices have in-links from the the same set of vertices while the vertices that they have out-links to are also the same. In the package sense, the middle community resembles a package that is more stable than the package containing the vertices from the left. Many utility packages and libraries contain packages with a similar structure. There is little or no connection between the vertices inside the package, yet they belong to the same community as they are used in similar situations.

Figure 4.4: Citation based cluster

After applying naive transformation and trying to find optimal communities in the graph in Fig. 4.4, the output simply looses the intended community structure. The output is given in Fig. 4.5. Black vertices have been put into one community by the algorithm and white vertices have been placed in another community. In this clustering, it is clear that SDP (Stable Dependencies Principle) is violated and both communities depend on each other. Using a community detection algorithm intended for undirected graphs has changed the SDP compliant structure that the programmer had intended.

Figure 4.5: Citation based cluster after naive transformation

4.3.2 Bidirected graphs and loss of information

As discussed in [24], the information needed for correct community detection is simply lost in certain graphs such as the bidirected graph shown in Fig. 4.6.

Figure 4.6: An example of a bidirected graph with two communities
Figure 4.7: An example of a bidirected graph after naive transformation

From a stability perspective, the dependency graph in Fig. 4.6 shows a two packages that fully conform to SDP. The community created by the four vertices on the right represent a very stable package that the left community is depending upon. By performing naive transformation the graph would look like Fig. 4.7. This graph has lost its community structure and the two left most vertices and the two right most vertices will be treated in the same way when it is given to a community detection method. Fig. 4.8 shows this graph after applying community detection while optimizing Newman’s modularity.

Figure 4.8: A clustered version of the graph in Fig 4.7

4.4 Stability and modularity

In this section, the relationship between the directed version of modularity and the Stability Dependencies Principle (SDP) in refactoring packages is discussed. In a scenrio where a class is chosen to be moved from one package to another using community detection methods, we show that modularity is in favor of SDP and hiding dependencies that violate SDP inside packages has a higher contribution to modularity than hiding non-violating dependencies. To show this behavior, some prior definitions are needed.

Definition 1.

A movement of class from package to package is shown as the tuple .

Definition 2.

A border node in a package is defined as a node that has connections with nodes in other packages and thus directly effects the package’s instability metric.

SDP is generally satisfied in a case where no stable package depends on an unstable package. When considering the movement of only two border classes, while all other classes and packages are left intact, then the only dependencies effecting the two package’s instability metric are the dependencies of the two border nodes. If a border node from stable package depends on a node from unstable package , then clearly SDP is violated.

Remark 4.4.1.

Let and be the out-link degree of vertices and respectively, and and be the in-link degree of vertices and . If and and node and node are border nodes, then SDP is satisfied.

Remark 4.4.2.

Let and be the out-link degree of vertices and respectively, and and be the in-link degree of vertices and . If and and node and node are border nodes, then SDP is not satisfied.

Proposition 4.4.3.

Let and be two classes in dependency graph . If a movement exists and the conditions of remark 4.4.1 holds, then the increase in modularity is more, compared to the situation in which the conditions of remark 4.4.2 holds true.

Proof.

Let denote modularity while the conditions in remark 4.4.1 holds true and denote modularity while the conditions in remark 4.4.2 holds true. and can be calculated using Eq. 3.4 as

The bar on in-link or out-link denotes that it is being calculated in the scenario of remark 4.4.2, and is therefore equivelant to the out-link and in-link in the scenario of remark 4.4.1 respectively. Thus one can write

By looking at the conditions of remark 4.4.1 and remark 4.4.2 it is clear that

The above proposition shows how modularity is compatible with the notion of SDP. Modularity is in favor of non-random structure in a network. Violating SDP would mean that a stable package is depending on an unstable package. In this scenario, the above proof shows that keeping two nodes that have violated SDP before, inside a single package is better for than keeping two nodes that did not violate SDP. It is also important to note that if and belong to two different packages, then the condition will have no contribution to modularity and therefore is not discussed.

As an example for the proved proposition, suppose that a system contains two packages and , where is unstable and is a stable package. Two slighly different versions of this system is depicted in Fig. 4.9. In both of these versions, vertices 1, 2, 3 and 4 are members of and vertices 5, 6, 7 and 8 belong to . It is clear that in condition (b), edge is violating SDP. Based on Proposition 4.4.3, we show that moving node 1 from to has more positive contribution for package modularity, than in the case of condition (a). If movement happens, then four new edges positively contribute to the overall modularity of the dependency graph while one edge’s contribution is eliminated. The reason for this is that edges between two communities provide no contribution to modularity because the kronecker delta function in Eq. 3.4 becomes zero. therefore edges , , and will have new contributions to modularity and edge will no longer have any contribution. The changes in modularity for condition (b) can be calculated using Eq. 3.4 as

By replacing with the number of edges, we have

Changes in modularity for condition (a) can be calculated the same way as follows.

The results clearly indicate that the graph gained more modularity when trying to suppress an SDP violation than when it is not.

Figure 4.9: Two different graph dependency conditions.

4.5 Proposed refactoring method

By considering the discussed importance of directed graphs in refactoring software packages and the package coupling metric proposed by [32], we present a package refactoring algorithm.

For calculating the dependencies, we use the package coupling metric provided by Gupta et al [32] at hierarchy level . This is a crucial point that must be noted. Hierarchy level is being used because it gives access to elements inside packages at level . The classes and sub-pakages in this level of hierarchy are the ones that will be analyzed for possible refactorings. In this study, only one package level is analyzed for refactoring, as deeper levels cause many open problems that need to be tackled. The most basic problem with optimizing software metrics such as coupling and cohesion in many levels of abstractness simultaneously is that cohesion inside one level can be considered as coupling in a deeper level, thus the problem of minimizing coupling contradicts with the problem of maximizing cohesion in a higher level of abstractness, i.e., the package level . therefore, in this work, only packages at level and their respective elements at level are considered.

For calculating the package dependency graph’s modularity, we use the directed and weighted version of modularity [25] expressed as

where and are respectively the output and input weights of nodes and and

The weights for an edge is equal to the edge’s coupling metric given in Eq. 4.1. These weights are used in the package dependency network, similar to the weights in uWCDN (Eq. 4.2) provided by Pan et al [19]. Considering the directedness of the network we can define an enhanced version of uWCDN, namely DWPDN (Directed, Weighted Package Dependency Network) that can be expressed as

(4.4)

where denotes the set of all vertices at hierarchy level , denotes the set of all edges at hierarchy level and is the assymetric and weighted adjacency matrix of the network at hierarchy level . Every element of can calculated as

(4.5)

where and are two elements and is the coupling function from Eq. 4.1.

The main phases of the proposed package refactoring algorithm are presented in Alg. 3.

Input: A DWPDN
Output: A list of package movement suggestions and the optimal that can be gained

1:procedure 
2:     
3:     
4:     
5:     
6:     for every node  do
7:          node ’s community
8:         for every node  do
9:               node ’s community
10:              
11:              if  then
12:                  
13:                                          
14:         if  then
15:              Add movement to suggestedMovements
16:              Move node to
17:              
18:                             
19:     return and
Algorithm 3 Proposed refactoring algorithm

5.1 Subjects

The two subjects being analyzed in this chapter are the same as the subjects in [19], namely Trama111http://trama.sourceforge.net and FrontEndForMySQL222http://frontend4mysql.sourceforge.net.

Trama is a graphical tool for manipulating and working with matrices. FrontEndForMySQL is a graphical front end for the MySQL database system and provides an easier and more user friendly environment for working with MySQL queries. Some details of the two subjects are shown in Table 5.1.

System Version Number of packages Number of classes
Trama 1.0 6 58
FrontEndForMySQL 1.0 10 56
Table 5.1: Details of the systems analyzed

The original packaging structure for Trama is depicted in Fig. 5.1. The original modularity calculated for the default packaging of Trama is calculated as 0.28 and the list of its packages is as follows.

  • visao

  • visao.renderizador

  • persistencia

  • negocio

  • negocio.leitor.Interface

  • negocio.leitor

Figure 5.1: Original packaging structure of Trama

FrontEndForMySQL is a larger system compared to Trama, with an initial package modularity of 0.21. The system’s default packaging structure is depicted in Fig. 5.2 and it contains the following packages.

  • frontendformysql

  • frontendformysql.domain.BackEnd

  • frontendformysql.domain.BackEndData

  • frontendformysql.domain.BackEndComponent.Editor

  • frontendformysql.domain.BackEndInterfaces

  • frontendformysql.domain.BackEnd.System

  • frontendformysql.domain.BackEndComponent.DriverModule

  • frontendformysql.domain.BackEndData

  • frontendformysql.domain.BackEndComponent.XMLutil

  • frontendformysql.domain.BackEndComponent.IO

  • frontendformysql.domain.BackEndComponent.DataStructures

  • frontendformysql.domain.BackEndComponent.Editor

  • frontendformysql.domain.BackEndInterfaces

  • frontendformysql.domain.BackEnd.System

  • frontendformysql.domain.BackEndComponent.DriverModule

  • frontendformysql.domain.BackEndComponent.XMLutil

  • frontendformysql.domain.BackEndComponent.IO

  • frontendformysql.domain.BackEndComponent.DataStructure

Figure 5.2: Original packaging structure of FrontEndForMySQL

5.2 Case studies and results

After applying the proposed refactoring algorithm, with considering the importance of edge directions, the clustering of Trama changes to the depicted structure in Fig. 5.3 and the suggested movements are given in Table 5.2. The new packaging of Trama has a directed modularity of 0.43 and shows an improvement over the original 0.28. It is important to note that not all movements are acceptable and the suggestions should be given to a programmer for final analysis.

Order Class name Old package Suggested package
1 Main negocio visao
2 Matriz negocio persistencia
3 ModeloTabela visao persistencia
4 JTableCustomizado visao visao.renderizador
5 JTableCustomizado$1 visao visao.renderizador
6 JTableCustomizado$2 visao visao.renderizador
7 LeitorDeModelo negocio.leitor negocio
8 Tela$23 visao persistencia
9 Tela$22 visao persistencia
10 Tela$24 visao visao.renderizador
11 Tela$3$1 visao visao.renderizador
Table 5.2: Suggested movements for Trama classes
Figure 5.3: New packaging of the Trama system after refactoring

As a comparison, an undirected version of the algorithm, using naive transformation, was applied on the Trama system. The produced clustering is shown in Fig. 5.4. In this clustering, modularity gets a value of 0.41. It is important to note that comparing the modularity of the two approaches would not be correct, as the formula for the two quality measures are inherently different. However, a comparison on package instability is shown in Table 5.3, in which is the original instability of a package, is the instability of a package after the proposed refactoring algorithm with edge directions, is applied and is the instability of a package after applying the undirected version of the algorithm.

Figure 5.4: New packaging of the Trama system after refactoring with naive transformation
Package name OI DI UI
negocio 0.478 0.529 0.6
persistencia 0 0.368 0.409
visao.renderizador 0.428 0.538 0
negocio.leitor 0 0 0
visao 0.64 0.578 0.5
negacio.leitor.Intergface 0 0 0
Table 5.3: Comparison of Trama’s instability metric for different approaches

Table 5.3 shows how two packages became more stable after applying the proposed, directed clustering algorithm, while the stability of package visao decreased by 0.078. From Fig. 5.4, it is also clear that the visao.renderizador is merged with other packages and thus is not taken into account for comparison.

The implementation of the proposed algorithms was also applied to the FrontEndForMySQL system. The original package structure for FrontEndForMySQL and its structure after refactoring are depicted in Fig. 5.5 and Fig. 5.6 respectively. The original modularity for FrontEndForMySQL is calculated as 0.21.

Figure 5.5: Original packaging of the FrontEndForMySQL system
Figure 5.6: New packaging of the FrontEndForMySQL system after refactoring

Similar to the previous case study, an undirected version of the algorithm, using a naive transformation for removing edge directions was applied to FrontEndForMySQL and the clustering result is depicted in Fig. 5.7. The comparison table for this package instability measures is given in Table 5.4.

Figure 5.7: New packaging of the FrontEndForMySQL system after refactoring with naive transformation
Package name OI DI UI
BackEndInterfaces 0 0 0.375
BackEnd 0.969 1 0.714
BackEnd.System 0.2 0 0
BackEndComponent.IO 0 0.2 0
BackEndComponent.XMLutil 0 0 0
BackEndComponent.Editor 0 0 0
BackEndComponent.DriverModule 0.818 0.25 0.25
BackEndComponent.DataStructures 0 0 0
frontendformysql 0.666 0 0.6
BackEndData 0.238 0.125 0.5
Table 5.4: Comparison of FrontEndForMySQL’s instability metric for different approaches

Table 5.4 clearly shows that the overall instability of packages is higher when edge directions are not taken into account in the refactoring algorithm.

6.1 Picasso overview

Figure 6.1: Picasso: A tool for live package dependendy analysis

Picasso applies the proposed refactoring algorithm on software packages and provides a list of class moving suggestions. An example of the suggestions that Picasso presents is depicted in Fig. 6.2. Every suggestion is a class movement from a source package to a target package.

Figure 6.2: An example of some suggestions provided by Picasso

Picasso provides many extra features that are as follows.

  • Import Java jar files and class files.

  • Import UML structures.

  • Provides an option to choose famous graphs such as the Zachary club network.

  • Calculates modularity and provides a refactored solution for a software system using Alg. 3.

  • Calculates Martin’s instability metric for software packages.

  • Hierarchically provides cluster graphs of a graph.

  • Provides an extendible messaging system for future works.

  • Provides an edited version of JSNetworkX’s force layout graph visualization algorithm.

  • Provides functions for adding and removing graph edges and nodes.

  • Provides the ability to lock graph nodes in one position for better viewing.

Picasso’s top menu provides the main functionalities of the tool. The menu bar is depicted in Fig. 6.3 and shows that the tool is in working mode and awaits a response from the Picasso server. The gray section of the top bar shows some information such as the modularity measure of the current clustering and the name of the current selected class in the dependency graph. The top buttons consist of two main groups. The left, green buttons provide directed refactoring, undirected refactoring and the original clustering of the software system being analyzed. The right, blue buttons provide the options for viewing the graph’s clustering graph, viewing the movement suggestions after refactoring and viewing instability measures for different packages. An example of the instability measure window is shown in Fig. 6.4.

Figure 6.3: Picasso’s top menu bar
Figure 6.4: An example of Picasso’s instabilities window
Figure 6.5: Picasso’s sequence diagram

6.2 Picasso’s 3rd party dependencies

Picasso utilizes many diverse 3rd party libraries. Some of these libraries have been customized and tweaked specially for Picasso. The following list contains some brief information on these libraries.

  • Coffea111https://github.com/sbilinski/coffea java analysis tool. Coffea is an open source static code analyzer for Java byte code that can export package dependency graphs in various graph file formats. Coffea is written in Python and therefore can be integrated well with Picasso.

  • D3222http://d3js.org visualization library.

    D3 stands for Data-Driven Documents, and is arguably one of the best Javascript data visualization tools that utilizes HTML5, SVG (Scalable Vector Graphics), CSS3 and Javascript capabilities and provides an extremely flexible platform for data visualization.

  • JSNetworkX333http://felix-kling.de/JSNetworkX network visualization library. This library is a port of the popular NetworkX Python graph library and is build upon the D3 platform.

  • Python’s igraph444http://igraph.org library. Python’s igraph library is used in Picasso for creating and manipulating graphs on the server side.

  • Python’s Socks-js555https://github.com/sockjs/sockjs-client library. The Socks-JS library is used by Picasso for creating a web socket messaging system that can pass graph and graph cluster information between the server and client sides of the program.

The sequence diagram in Fig. 6.5 shows how Picasso interacts with these dependencies.

7.1 Refactoring

The refactoring method presented in this work utilizes a directed and weighted version of Newman’s modularity. This requires modularity to be calculated in every step of the proposed algorithm and thus performs slower than the algorithm of Pan et al [19]. This may be considered as one of the problems that can be tackled in future works. Also, some rare problems have been found with the directed version of modularity [24] and alternative approaches should also be considered, i.e. random walk based mathods such as LinkRank.

The importance of directed dependency graphs can also be analyzed in the class level, while using an appropriate metric for class couplings and cohesion.

7.2 Tool improvements

Some improvements can be applied on the tool proposed in this work. Currently a force directed layout is used for visualizing graphs. A force directed layouts simulate physical forces between nodes and edges to aesthetically draw a graph. Spring like attractive forces that are based on Hooke’s law are typically used. The force directed layout can be enhanced with collision detection algorithms, so that nodes that are members of the same community can be grouped together instead of being mixed in with nodes from other communities. Also several problems with force directed layouts in large graphs have been pointed out in the literature [36] and radial tree layouts have been proposed as alternatives. Radial tree layouts can be considered in future implementations of the tool. An example of a radial tree layout from a tool named Barrio, provided in [36] is depicted in Fig. 7.1.

Figure 7.1: An example of a radial tree layout

Being able to force a node to be a member of a certain community while calculating the resulting modularity of the graph cluster can be considered as one of the important options in future versions of the application. Some library classes might need to be kept in their original package even though modularity is decreased by doing so.

Appendix A Server side code for Picasso

1# -*- coding: utf-8 -*-
2"""
3Picasso, ver 0.1
4Author: Mohammad A.Raji
5Depends on:
6    -sockjs-tornado for the asynchronus python server
7    -D3.js for visualizing graphs
8    -JSNetworkX for visualizing graphs
9    -igraph for community detection algorithms
10    -Coffea for extracting java dependencies
11"""
12from __future__ import division
13import os
14import tornado.ioloop
15import tornado.web
16import sockjs.tornado
17import igraph
18import time
19import json
20import hashlib
21from igraph import *
22
23# Request handles class for the index page
24class IndexHandler(tornado.web.RequestHandler):
25    def get(self):
26        self.render('picasso.html')
27
28# Connection class: responsible for all the client/server connections
29class Connection(sockjs.tornado.SockJSConnection):
30    participants = set()
31
32    def on_open(self, info):
33        # Add client to the clients list
34        self.participants.add(self)
35        if len(sys.argv) > 1:
36            if len(sys.argv) > 2:
37                if sys.argv[1] == "--famous":
38                    g = Graph.Famous(sys.argv[2])
39                    #g.to_undirected()
40                    self.broadcast(self.participants, "graph/" + refactoring.graphToString())
41            else:
42                refactoring.parseCode(sys.argv[1]);
43                self.broadcast(self.participants, "graph/" + refactoring.graphToString())
44                self.broadcast(self.participants, "labels/" + refactoring.getVertexLabels())
45
46    def on_message(self, message):
47        # Take appropriate action when a message arrives from the client
48        self.parseAndApplyMessage(message)
49
50    def on_close(self):
51        # Remove client from the clients list and broadcast leave message
52        self.participants.remove(self)
53
54    def parseAndApplyMessage(self, msg):
55        global refactoring
56        message = msg.split("/")
57        command = message[0]
58        if (len(message) > 1):
59            argument = message[1];
60
61            if command == "clusters":
62                refactoring.parseGraph(argument)
63                refactoring = Refactoring(refactoring.detectCommunities().cluster_graph())
64                self.broadcast(self.participants, "graph/" + refactoring.graphToString())
65            elif command in ["addnode", "removenode", "addedge", "removeedge"]:
66                refactoring.parseChange(command, argument)
67            elif command == "getoriginal":
68                self.broadcast(self.participants, "membership/" + str(refactoring.original_membership).strip("[]"))
69                self.broadcast(self.participants, "measures/" + Refactoring.formatMeasures(refactoring.original_modularity));
70                self.broadcast(self.participants, "instability/" + Refactoring.formatInstability(refactoring.original_package_instability));
71            elif command == "refactor":
72                refactored_results = refactoring.refactor()
73                go_membership = refactored_results[1]
74                self.broadcast(self.participants, "membership/" + str(go_membership).strip("[]"))
75                self.broadcast(self.participants, "measures/" + Refactoring.formatMeasures(refactored_results[0]));
76                self.broadcast(self.participants, "suggestions/" + Refactoring.formatSuggestions(refactored_results[2]));
77                self.broadcast(self.participants, "instability/" + Refactoring.formatInstability(refactoring.getInstabilityForEachPackage(refactoring.g, go_membership, refactoring.packages)));
78            elif command == "urefactor":
79                    refactored_results = refactoring.refactor(False)
80                    go_membership = refactored_results[1]
81                    self.broadcast(self.participants, "membership/" + str(go_membership).strip("[]"))
82                    self.broadcast(self.participants, "measures/" + Refactoring.formatMeasures(refactored_results[0]));
83                    self.broadcast(self.participants, "suggestions/" + Refactoring.formatSuggestions(refactored_results[2]));
84                    self.broadcast(self.participants, "instability/" + Refactoring.formatInstability(refactoring.getInstabilityForEachPackage(refactoring.g, go_membership, refactoring.packages)));
85
86            elif command == "fastgreedy":
87                go_membership = refactoring.detectCommunities().membership
88                self.broadcast(self.participants, "membership/" + str(go_membership).strip("[]"))
89                self.broadcast(self.participants, "measures/" + refactoring.getClusterMeasures());
90
91        else:
92            refactoring.parseGraph(command)
93            go_membership = refactoring.detectCommunities().membership
94            self.broadcast(self.participants, "membership/" + str(go_membership).strip("[]"))
95            self.broadcast(self.participants, "measures/" + refactoring.getClusterMeasures());
96
97# All refactoring and graph related capabilities
98class Refactoring():
99    g = None;
100    gc = None;
101    parsed_code_filename = None;
102    original_membership = None;
103    original_modularity = None;
104
105    def __init__(self, graph=None):
106        self.packages = dict()
107        self.original_package_instability = dict()
108        self.g = graph
109
110    def parseChange(self, command, arg):
111        if command == "addnode":
112            self.g.add_vertex(arg)
113        elif command == "removenode":
114            self.g.delete_vertices(arg)
115        elif command == "addedge":
116            from_edge = int(arg.split(",")[0])
117            to_edge = int(arg.split(",")[1])
118            self.g.add_edge(from_edge, to_edge)
119        elif command == "removeedge":
120            from_edge = int(arg.split(",")[0])
121            to_edge = int(arg.split(",")[1])
122            self.g.delete_edges((from_edge, to_edge))
123
124    def parseGraph(self, st):
125        self.g = Graph()
126        st_graph = st.split("|")
127        vertices = st_graph[0].split(";")
128        for v in vertices:
129            self.g.add_vertex(v)
130
131        edges = st_graph[1].split(";")
132        for e in edges:
133            from_edge = e.split(",")[0]
134            to_edge = e.split(",")[1]
135            self.g.add_edge(from_edge, to_edge)
136        return self.g
137
138    def graphToString(self):
139        vertexlist = []
140        for v in self.g.vs:
141            vertexlist.append(v.index)
142        vertex_str = str(vertexlist).strip("[]");
143        vertex_str = vertex_str.replace(" ", "");
144        s = str(self.g.get_edgelist()).strip("[]");
145        s = s.replace("(", "");
146        s = s.replace("),", ";");
147        s = s.replace(")", "");
148        s = s.replace(" ", "");
149        s = vertex_str + "|" + s
150        return s
151
152    @staticmethod
153    def formatMeasures(measure):
154        measures = "";
155        measures += "modularity:" + str(round(measure, 2)) # + ","
156        return measures;
157
158    @staticmethod
159    def formatSuggestions(suggestions):
160        return json.dumps(suggestions)
161
162    @staticmethod
163    def formatInstability(package_instability):
164        return json.dumps(package_instability.items())
165
166    def getClusterMeasures(self):
167        if (self.gc == None):
168            self.gc = self.detectCommunities()
169        measures = "";
170        measures += "modularity:" + str(round(self.gc.modularity, 2)) # + ","
171        return measures;
172
173    def getVertexLabels(self):
174        msg = "";
175        for v in self.g.vs:
176            msg = msg + v['label'] + ",";
177
178        msg = msg.strip(",")
179        return msg
180
181    def detectCommunities(self):
182        self.g = self.g.simplify(loops='False', multiple='False')
183        gc = self.g.as_undirected().community_fastgreedy()
184        gc = gc.as_clustering()
185        self.gc = gc
186        return gc
187
188    # This function works independently from local graph g
189    def makeDwpdnMembership(self, graph):
190        # If this graph has no label attribute at all
191        if "label" not in graph.vs.attribute_names():
192            custom_package_index = 0
193            for v in graph.vs:
194                v['label'] = str(custom_package_index) + "."
195                custom_package_index += 1
196
197        self.packages = dict()
198        membership = []
199        recent_package = 0;
200        for v in graph.vs:
201            if v['label'] == None:
202                # Make a random package name if this package is a new isolated
203                # node with no name
204                random_package_name = hashlib.md5(str(time.time())).hexdigest()[0:5] + "."
205                v['label'] = random_package_name
206            package_name = v['label'].rsplit(".", 1)[0];
207            if (self.packages.has_key(package_name)):
208                membership.append(self.packages[package_name]);
209            else:
210                self.packages[package_name] = recent_package;
211                membership.append(recent_package);
212                recent_package += 1;
213
214        return membership
215
216    # This function works independently from local graph g
217    def calculateQ(self, graph, membership):
218        Q = 0.0;
219        graph = graph.simplify(loops='False', multiple='False')
220        m = graph.ecount();
221        edge_count_factor = 2*m;
222        if graph.is_directed() == True:
223            edge_count_factor = m
224        for i in graph.vs:
225            for j in graph.vs:
226                if membership[i.index] != membership[j.index]:
227                    continue
228                else:
229                    Aij = 0
230                    if graph.are_connected(i, j):
231                        Aij = 1
232                    wi_out = i.outdegree()
233                    wj_in = j.indegree()
234                    Q += Aij - (wi_out*wj_in) / edge_count_factor
235
236        Q *= 1/edge_count_factor
237        return Q;
238
239    def getInstabilityForEachPackage(self, graph, membership, packages):
240        package_in = packages.fromkeys(packages.iterkeys(), 0);
241        package_out = packages.fromkeys(packages.iterkeys(), 0);
242        package_instability = packages.fromkeys(packages.iterkeys(), 0);
243        for e in graph.es:
244            if (membership[e.target] != membership[e.source]):
245                package_in[self.getPackageNameFromIndex(membership[e.target])] += 1;
246                package_out[self.getPackageNameFromIndex(membership[e.source])] += 1;
247
248        print package_out
249        print package_in
250        for package in packages:
251            if package_out[package] + package_in[package] != 0:
252                package_instability[package] = package_out[package] / (package_out[package] + package_in[package])
253            else:
254                package_instability[package] = 0;
255
256        return package_instability
257
258    def getPackageNameFromIndex(self, index):
259        for name, i in self.packages.iteritems():
260            if i == index:
261                return name
262
263    def refactor(self, directed=True):
264        if directed == True:
265            graph = self.g;
266        else:
267            graph = self.g.as_undirected()
268        suggested_movements = [];
269        Q_prime = -1;
270        membership = self.makeDwpdnMembership(graph);
271        # Check if there is only one package
272        if membership.count(0) == len(membership):
273            membership = range(len(membership))
274        Q = self.calculateQ(graph, membership);
275        selected_community = 0;
276        v_range = range(graph.vcount())
277        while True:
278            restart_loop = False
279            for index in v_range:
280                i = graph.vs[index]
281                for j in graph.vs:
282                    temp_membership = list(membership)
283                    temp_membership[i.index] = temp_membership[j.index];
284                    temp_Q = self.calculateQ(graph, temp_membership);
285                    if (temp_Q > Q_prime):
286                        Q_prime = temp_Q;
287                        selected_community = membership[j.index];
288                if (Q_prime > Q):
289                    suggested_movements.append((i['label'], self.getPackageNameFromIndex(membership[i.index]), self.getPackageNameFromIndex(selected_community)));
290                    membership[i.index] = selected_community;
291                    Q = Q_prime;
292                    restart_loop = True
293                    break;
294            if not restart_loop:
295                break;
296
297        print "Done refactoring"
298        return [Q_prime, membership, suggested_movements]
299
300    def parseCode(self, filename):
301        os.system("coffea -R -i " + filename + " -f gml -o temp.gml")
302        self.parsed_code_filename = filename
303        self.g = read('temp.gml')
304        self.original_membership = self.makeDwpdnMembership(self.g)
305        self.original_modularity = self.calculateQ(self.g, self.original_membership);
306        self.original_package_instability = self.getInstabilityForEachPackage(self.g, self.original_membership, self.packages)
307
308if __name__ == "__main__":
309    import logging
310    logging.getLogger().setLevel(logging.DEBUG)
311
312    # Instantiate the main refactoring object
313    refactoring = Refactoring();
314
315    # Create the router
316    Router = sockjs.tornado.SockJSRouter(Connection, '/picasso')
317
318    # Create Tornado application
319    app = tornado.web.Application(
320            [(r"/", IndexHandler)] + Router.urls
321    )
322
323    # Make Tornado app and listen on port 8081
324    port = 8081
325    app.listen(port)
326    print "Listening on port " + str(port);
327
328    # Start IOLoop
329    tornado.ioloop.IOLoop.instance().start()

Appendix B Client side Javascript of Picasso

1    last = 1;
2    conn = null;
3    labels = [];
4    $(function() {
5      colors = ['#FF7F0E', '#AEC7E8', '#2CA02C', '#D62728', '#1F77B4']
6      color = window.d3.scale.category20();
7      function log(msg)
8      {
9          console.log(msg);
10      }
11
12      function parseAndApplyMessage(msg)
13      {
14          var message = msg.split("/");
15          var command = message[0];
16          if (message.length > 1)
17          {
18              var arguments = message[1];
19              if (command == "membership")
20              {
21                  applyMembership(arguments);
22              }
23              else if (command == "graph")
24              {
25                  applyGraph(arguments);
26              }
27              else if (command == "labels")
28              {
29                  saveLabels(arguments);
30              }
31              else if (command == "measures")
32              {
33                  updateMeasures(arguments);
34              }
35              else if (command == "suggestions")
36              {
37                  setSuggestions(arguments);
38                  $("#refactor-btn").button("reset");
39                  $("#urefactor-btn").button("reset");
40              }
41              else if (command == "instability")
42              {
43                  setInstabilities(arguments);
44              }
45          }
46      }
47
48      function setSuggestions(msg)
49      {
50          var suggestions = JSON.parse(msg);
51