1.1 Wellknown refactoring techniques

Rename method. This technique may be the most simple refactoring method one can use. Simply renaming identifiers and variables will make the code clearer, more understandable and can reduce the need for comments. An appropriate name for a method, variable or a class is one that is descriptive so that a new programmer can understand its work just by a glance.

Inline temp. Temporary variables can make methods longer and more complicated. It is suggested that temporary variables that are being used only once or are a result of a method call be completely removed and the value assigned to them be used in the code. An example is provided below.

Extract method. Known as arguably the most important refactoring technique, Extract method aims at reducing the size of long methods by breaking them into smaller methods with descriptive variables. Many refactoring and simplifying techniques in software engineering involve breaking code and algorithms into smaller, more understandable chunks. This method is one of them.
1 class Foo:2 username3 def __init__(self):4 # Some initialization code5 self.username = "Some username"67 def func1(self):8 print "Welcome"9 print "You have logged in as " + self.username10 print "Something else"1112 def func2(self):13 print "Welcome"14 print "You have logged in as " + self.username15 print "Some reports"In the provided example, lines 8 and 9 are equal to lines 13 and 14 and can be extracted into a new method that greets the user. Extract method is considered as an important and basic refactoring technique that highly effects the cohesion of classes from which methods have been extracted. Extract method suggests the extraction of pieces of code that are used more than once (duplicate code). If this condition is met while extracting piece of code A from methods B and C, then after refactoring, both B and C will be using A and thus reducing the cohesion in their class. However, one must realize that if appropriate interfaces are not used in the code and other classes in a package use method A, then instead of reducing cohesion, coupling will be increased. A thorough study on this issue and a metric for finding appropriate pieces of code for extracting while considering the notion of cohesion is provided in [5].
Considering our focus on graph clustering methods in refactoring, it is worth noting that some work has been done in detemining the class a method belongs to, with the help of community detection techniques [6]. However, introducing new methods and extracting them with community detection is still in need of attention.

Inline method. In some cases, the opposite of Extract method should be applied. Suppose method A is simple, clear and is being used only once, possibly in a stable class whose content is not likely to change. In this case, using an identifier for the code in method A only results in an extra call for no benefit. This method can be removed and its content can be used inline.

Replace method with Method object. This technique can be considered as an aid, in situations where Extract method becomes difficult because of the high number of temporary variables in a long method. In a case where the number of temporary variables is high, Extract method can become a cumbersome task because passing around all the temporary variables between the extracted methods can prove to be messy and finding the needed temporary variables for a piece of extracted code can take a lot of time.
To resolve this issue, one approach is to move the long method into a new class, set the local temporary variables as class attributes and then apply Extract method. This method provides a better state, from which we can continue our refactoring using Extract method or other techniques.

Pull up method. Imagine a scenario in which a piece of code is duplicated in two different classes, it is best to pull that code up into a super class of those two classes.
Before refactoring:
1 class Person:2 firstname = None3 lastname = None4 def __init__(self):5 # Some initialization code67 class Student(Person):8 studentNo = None9 def __init__(self):10 # Some initialization code1112 def makeFullName(self):13 return self.firstname + " " + self.lastname1415 def getStudentNo(self):16 return self.studentNo1718 class Employee(Person):19 salary = None20 def __init__(self):21 # Some initialization code2223 def makeFullName(self):24 return self.firstname + " " + self.lastname2526 def getSalary(self):27 return self.salaryAfter refactoring:
1 class Person:2 firstname = None3 lastname = None4 def __init__(self):5 # Some initialization code67 def makeFullName(self):8 return self.firstname + " " + self.lastname910 class Student(Person):11 studentNo = None12 def __init__(self):13 # Some initialization code1415 def getStudentNo(self):16 return self.studentNo1718 class Employee(Person):19 salary = None20 def __init__(self):21 # Some initialization code2223 def getSalary(self):24 return self.salary 
Extract surrounding method. Imagine a case in which several different methods are almost identical but with a slight difference in the middle of each one. In some languages, one can pull up the duplicated code into a new method and pass the middle section to the method which it yields to. This ability is provided in some languages like Ruby and can be simulated in some other languages by passing callback functions. A Ruby example is given below.
1 def testMethod2 puts "Something printed from inside testMethod"3 yield4 puts "Something printed from inside testMethod"5 end6 testMethod {puts "Something printed from the block"} 
Replace conditional with polymorphism. This method of refactoring can help remove the complexity and code smell of conditional logic and demonstrate the principle of true objectoriented design.
2.1 Coupling and cohesion
Coupling is one of the most famous internal product attributes. Generally, two pieces of code are said to be coupled if changes in one causes the other to change. In the objectoriented paradigm, coupling between two classes is considered a bad and unwanted attribute, however a system with no coupling between its classes would mean that interaction is not occurring between the classes and therefore it would simply fail to function.
Cohesion, which almost always comes with coupling, is another important internal product attribute. In an objectoriented system, a class is said to have a high cohesion if its internal structures and methods have high connectivity with themselves. The goal in a good design is high cohesion and low coupling, meaning that classes should be cohesive and therefore fully related to their responsibility while they have a low coupling with other classes so that they can change without causing too many changes in other parts of the system. Designs with high cohesion and low coupling make the system more reliable and maintainable [9], [10].
The notions of coupling and cohesion have been excessively studied in the literature and many metrics have been proposed for measuring them. This thesis briefly surveys different approaches in the literature.
2.1.1 Basic definitions by Myers
Myers, Stevens and Constantine introduced the concept of coupling in procedural programming. Based on this, Fenton defined six different levels of coupling [11]. These levels of coupling are shown below from worst to best.

Content coupling. If one element branches into or changes the internal statements of another element, they are said to have content coupling.

Common coupling. If two elements refer to the same global variable, they are said to have common coupling.

Control coupling. If the data that one element sends to the other controls its behavior, then control coupling is implied.

Stamp coupling. Two elements are stamp coupled if they send more information to each other than necessary.

Data coupling. If two elements communicate with each other by parameters with no control coupling, then they are data coupled.

No coupling. If two elements have no communication with each other then they are not coupled.
2.1.2 Fenton and melton’s metric
Fenton and melton proposed a metric for coupling which is expressed as
(2.1) 
where is the number of interactions between the two components and and is the level of the worst coupling found between and . In their metric, the coupling level is based on Myer’s classification. No coupling is given a coupling level of 0 and the next levels have higher numeric values.
Alghamdi discusses several important points about this metric [11].

All types of interconnections are considered equal, with equal effects on coupling.

The Fenton and Melton metric is an example of a intermodular metric, meaning that it calculates the coupling between a pair of components in contrast with intrinsic metrics that measure the coupling of a component individually.

Coupling values approach the next level when the interconnections between two components increases.
Alghamdi also proposes a new coupling metric based on a description matrix of the system [11].
2.1.3 Chidamber and Kemerer’s suite
Chidamber and Kemerer [12] gives the first formal definition of coupling by defining coupling as any evidence that a method from one class uses a method or variable of another class. In their proposed suite, known as the CK suite, Chidamber and Kemerer give provide different metrics. The six metrics are as follows.

Weighted Method per Class (WMC)

Number Of Children (NOC)

Depth of Inheritence Tree (DIT)

Coupling Between Objects (CBO)

Lack of Cohesion in Methods (LCOM)

Response For a Class (RFC)
Among their six metrics, CBO (Coupling Between Objects) is proportional to the number of noninheritance related couples with other classes. For measuring coupling, CBO aggregates the total number of couples a class has to another class, which implies that different couples have the same strength and effect. Hitz and Montazeri [13] argue that the CK suite does not fully conform to measurment theory.
2.1.4 Alghamdi’s coupling metric
Alghamdi’s approach is based on the idea of generating a description matrix of all the factors that effect coupling and then calculating a coupling matrix based on the collected data. An overview of this approach is depicted in Fig. 2.1
The description matrix is an by matrix where is the number of system components and is the number of component members. In an objectoriented system, components are represented by classes and members are class variables and methods. An example of a description matrix is depicted in Table 2.1.
Component  …  
…  
2.1.5 A qualitative approach to coupling and cohesion
While many quantitative approaches for measuring coupling and cohesion have been proposed in the literature, few qualitative approaches have been discussed. Kelsen [14] proposes an interesting informationbased method for analyzing coupling and cohesion and finding refactoring suggestions. Kelsen’s approach considers a special type of coupling, namely representational coupling. When an object calls a method of another object, some information about the callee is exposed. If the information is about the low level implementation of the callee, then representational coupling is high. If the call exposes higher level information, then representational coupling is low. Many metrics in the literature, including the works of Chidamber and Kemerer [12] can not capture representational coupling [14]
. The main reason behind this issue is that many works simply count different types of interactions and assign ordinal numbers to these interactions. Kelsen, also presents a minimum for the representational coupling inherently contained in a system, which is known as intrinsic representational coupling.
Kelsen’s approach is based on the idea that if one can find two states in a system, namely witness states, that yield different messages between objects but cannot affect the states of other objects, then this indicates that coupling can be improved and representational coupling is higher than it should. The elevator example below is borrowed from [14].
Suppose that the behavior of some elevators in a building is modeled using two classes, ElevatorControl and Elevator. Every elevator has two methods, direction() and position(), which return the direction and position of the elevator. The ElevatorControl class is responsible for handling requests and checks every elevator’s distance and position for finding the closest elevator for a request. Two different implementations for ElevatorControl’s handleRequest method can be written.
In the second implementation, the task of computing en elevator’s distance is given to each elevator with the method. It is clear that the class does not need to know the distance and position of every elevator and only needs their distances, therefore less information from the class is being exposed in the second implementation and representational coupling is decreased.
Kelsen’s qualitative approach may be considered a precise and great method for measuring representational coupling, however, because of its nonquantitative nature it is not clear if it can be applied to large, reallife software systems with many classes, and its utilization in reallife scenarios is currently considered as an open problem.
2.2 Stability
Stability is the amount of likeliness, that a class or a package will not change. Stability is inherently difficult to measure because the future changes and needs of a project are not well known, however some metrics exist that try to measure stability. The importance of this stability in software metrics was first mentioned by Hitz and Montazeri [13].
Some methods for measuring the stability of a software package, utilize the history of the class’s changes in the past and try to predict its future. The changes of a class or a package is typically accessed through version control systems such as Git^{1}^{1}1http://gitscm.com and Subversion^{2}^{2}2http://subversion.apache.org, however these approaches can not be used in early stages of software design because of the lack of change history available at the time.
Robert Martin [15] takes a different approach to measuring the stability of a software package. He believes that stability is proportional to responsibility and a package is said to be responsible and independent if many other entities depend on it, while it doesn’t depend on others itself. A package is said to be irresponsible and thus unstable, if it depends on many other entities, meaning that if they change they cause to change as well. By Martin’s definition, in Fig. 2.2, X is an example of a stable package and in Fig. 2.3, Y resembles an unstable package.
As a metric for stability, Martin defines the instability of a package as given in Eq. 2.2 where is instability, is afferent couplings and represents the number of efferent couplings. Afferent couplings is the number of classes outside the package that depend on classes within the package and efferent couplings is the number of classes within the package that depend on outside classes.
(2.2) 
If a package has an instability of 0, then the has maximum stability and if the package holds a value of 1 for instability, then it would mean that the number of afferent couplings is 0 and therefore depends on other packages while no other package depends on and this would make it an extremely unstable package.
Martin also proposes the Stable Dependencies Principle (SDP) that helps the software design process by ensuring that modules that should be easily changeable not depend on modules that are harder to change [15]. In this case, packages should always have a higher metric than the ones they depend on. Concenting to this principle, one would be able to see a tree of packages, in which stable ones are placed at the bottom and the most unstable ones are at the top. The benefit of this approach is that packages that are violating SDP can be easily spotted. Any package depending on a package above it, would mean a violation of the principle.
It is important to note that not all packages should or could be fully stable, as this would cause an unchangeable and inflexible system. Also, not all packages can be unstable as this would create an irresponsible system with a large number of connections and a high coupling. It is clear that pieces of code that are likely to change should be placed into unstable packages and pieces of code that are not very likely to change in the future should be placed in stable packages. Martin argues that high level design can not be placed in unstable packages because it resembles the architectural decisions of the projects, however if high level code is placed in stable packages then it would almost be impossible to change it after the project becomes more mature and more pieces of code start depending on it. The solution to this dilemma is the use of abstract classes that can introduce the flexibility and flow of stability that is needed. The basic idea behind the Stable Abstraction Principle (SAP) is that a package has to be as abstract as it is stable. This principle ensures that the stability of a package does not contradict its flexibility. The SAP proposes a metric for measuring the abstractness of a package which is a simple ratio and is shown in Eq. 2.3 in which is the number of abstract classes inside the package and is the number of classes inside the package.
Martin defines three important areas in the relationship between abstractness and stability. If we set abstractness (A) as the vertical axis and instability (I) as the horizontal axis in a cartesian graph, then three spots depicted in Fig. 2.4 are as follows.

Zone of pain. The zone of pain is where a package is highly stable and yet its abstractness is zero. Such a package is hardly changeable.

Zone of uselessness. A package in this zone is highly abstract and also highly unstable and not depended on. This means that its abstractness is useless.

The main sequence. This is the ideal point for a package. A package near the main sequence is a package that conforms to the SAP and is as abstract as it is stable. The sequence is ideal and thus not many packages can truly be placed on this line, however the distance of a package from this ideal line can be measured.
(2.3) 
(2.4) 
In Eq. 2.4, is the distance from the main sequence and is its normalized version that ranges from .
3.1 Classification of clustering methods
Graph clustering methods are normally difficult to classify, however Wiggerts
[16] believes that they can generally be divided into the following methods.
Hierarchical methods. Hierarchical approaches are known as some of the early solutions to the problem. These methods provide a hierarchy of partitions like a tree, known as a dendrogram. A sample dendrogram is depicted in Fig. 3.2. Hierarchical methods are themselves divided into the two groups of agglomerative approaches and divisive approaches. In agglomerative approaches, the algorithm starts with placing every node inside a separate cluster. Then the algorithm starts merging the clusters based on their similarity. It is important to note that the algorithm will not stop unless told to, thereforee knowing the number of wanted partitions in the result is crucial. In divisive hierarchical approaches, the algorithm starts with a single cluster that contains all the nodes of the graph. The algorithm then splits the cluster based on the similarity between the nodes, keeping the similar ones in the same cluster. Different hierarchical algorithms are distinguished by their distance function which is responsible for determining the similarity between two given nodes.

Optimization based methods. These algorithms generally take an initial inaccurate clustering and with the help of a quality measure, try to enhance and improve the cluster and optimize the quality. One of the most common and famous quality measures in the literature is the modularity measure proposed by Girvan and Newman [17]
. Various kinds of optimization techniques are applicable in this category of graph clustering algorithms, such as genetic algorithm based optimization methods, particle swarm methods, etc. A simple genetic algorithm approach can be like the following
[18].
Select a random population of partitions

Generate a new population by selecting the best according to a quality measure, such as Newman’s modularity

Repeating step 2 until a certain criteria is met


Graph theoretical based methods. Graph theoretical algorithms are methods that utilize the formal descriptions and properties of graphs and their respective subgraphs. In these methods, various subgraphs and properties are used to extract meaningful clusters from the original graph. Two important and common types of graph theoretical algorithms exist, namely aggregation algorithms and minimal spanning tree algorithms. Aggregation algorithms use the function of reduction on different nodes and merge them in each step. Different potential nodes for merging are chosen using different techniques, such as neighbourhoodness, strong connections and etc. Minimal spanning tree algorithms use the minimal spanning tree of the graph. These algorithms are normally not considered accurate as they tend to create large clusters, however some enhanced versions of these algorithms have been suggested in the literature [18].

Construction algorithms. These algorithms assign nodes into clusters in one pass. The bisection algorithm and density search techniques are considered as examples of such methods.
The minimum cut approach is the most obvious and the most easiest way of tackling the problem of community detection. In this method, one tries to find two groups/partitions in a graph for which the edges connecting the two is the least. This approach mostly falls in the area of graph partitioning, because the number of partitions in the end result must be known a priori so that one can know how many times the algorithm should be applied. It is worth noting that if the minimum cut approach were to be used with no constraint, then a trivial solution to the problem would be to place all vertices in one partition only, thus minimizing the number of edges between partitions. Clearly this solution would not give any information on the communities in a network. In the software engineering sense, the result of such a method would be a system with zero coupling and maximum cohesion, which seems the goal. However many important aspects of the software such as reusability, separation of concerns, object orientedness, flexibility, etc. will be lost. This raises the idea that maybe another measurement apart from coupling and cohesion is needed that can help find an optimum point for the two. This measure must be able to truly model and represent different objects in a software dependency network. In the subject of graph theory, a measure that can model the goodness of a partition is known as a quality measure. Using a community quality measure in the field of software engineering has only recently been discussed in the literature [6], [19].
3.2 Quality measures
The quality of a partition found by a community detection algorithm is determined with a quality measure. This measure should show how good a partition is. Many algorithms provide many partitions without equal goodness, therefore it is absolutely necessary to measure the quality of the provided partitions and detect the best. Quality functions give a number to each partition so that the partitions can be ranked and compared to one another. Arguably, the most common and famous quality function is Newman and Girvan’s Modularity [20].
Modularity is based on the idea that a random graph contains no meaningful community. Based on this idea, if one can make a similar graph to the one being analyzed with the same number of vertices, edges and degrees but with edges placed at random, then by comparing it to the original graph one can find the major differences that have created communities. To understand the notion of modularity, we start by another measure for the goodness of a partition and build on it. Let be a graph with elements of its adjacency matrix presented as , where is 1 if nodes and are connected and 0 otherwise, and being the community in which vertex belongs to. The following measure shows the fraction of edges in graph , that fall within communities.
(3.1) 
where is the Kronecker delta function and is the number of edges in the graph.
This fraction takes the value of 1 when all edges fall in one community and hence is not a good enough measure.
The idea behind modularity is that a random graph does not have a meaningful community structure and thus, if generated carefully, should provide a good point of comparison. Carefully generating a random graph that can depict the features and properties of the original graph but with no meaningful community is known as providing a null model in the area of complex systems. In this case, one can provide a graph which has the same amount of vertices, edges and vertex degrees while its edges are rewired randomly, so that the graph looses its community structure. In such a graph, the probability of an edge being in between vertices
and , if connections are made at random is calculated as below.(3.2) 
where and are the degrees of vertex and respectively. Now, by using equations 3.1 and 3.2, one can calculate the modularity measure as
(3.3) 
By looking at Eq. 3.3, one can see some important aspects of this measure. The Kronecker delta function makes sure that a connection between two graph nodes in two different communities makes no contribution to modularity. Two connected nodes inside a community, make a positive contribution to modularity and the contribution is inversely proportional to the degrees of the two nodes. Also two nodes that are not connected, yet still reside in one community provide a negative contribution to the overall modularity of the clustering.
3.3 A brief discussion of well known clustering methods
In this section, several common graph clustering methods are briefly studied.
3.3.1 The fast greedy method
A typical greedy method for clustering a graph while utilizing Newman’s modularity consists of the following steps.

Start with each vertex in its own community, thus having communities for vertices.

In each step, merge two communities whose join makes the highest increase in modularity .

After joins, one community remains and a dendrogram can be created.

Take the clustered solution that has the highest Q.
The simple greedy method, can waste a good deal of time when dealing with sparse graphs. In the implementation of the simple greedy approach, one has to merge many columns and rows of the sparse adjacency matrix and consequently time and space is wasted on merging elements with the value of 0. For this reason, Clauset and Newman have presented an enhanced version of the greedy method, namely the fast greedy method [21] which performs much better than many other algorithms in the literature. In the fast greedy method, some data structures such as max heaps and balanced binary trees are used with some alterations in the algorithm that results in the runtime of .
3.3.2 The edgebetweenness based method
The edge betweenness based method, proposed by Girvan and Newman [22] before presenting the modularity measure, is a graph clustering algorithm that focuses on the edges that are between communities in contrast to many other older algorithms that focus on the connections inside a community. Edge betweenness is described as the number of shortest paths between pairs of vertices that run along it. The algorithm for this method is as follows.

Calculate edge betweeenness for all edges

Remove the edge with the highest betweenness value

Recalculate edge betweenness for the rest of the edges

Repeat step 2 until no edges remain
Calculating betweenness for all edges and vertices of a graph can be calculated using Newman’s algorithm for betweenness [22] which can be calculated in time . Edge betweenness has to be recalculated for every edge removal and thus the algorithm can work in time .
3.3.3 The walktrap based method
The walktrap method is based on the notion of random walks [23]. The main idea behind the walktrap method is that random walks in a graph tend to get trapped in dense parts of the graph which could represent communities. In the walktrap method, a distance between communities is calculated based on the properties of random walks. After this step, typically an agglomerative algorithm is used to merge communities and create a dendrogram, much like other methods. This algorithm has a runtime of .
3.3.4 The leading eigenvector based method
The leading eigenvector algorithm utilizes the eigenvalues of the modularity matrix. In this algorithm one determines the eigenvector corresponding to the most positive eigenvalue of the modularity matrix and divide the network into two groups according to the signs of the elements of this vector.
3.4 Community detection for directed graphs
Community detection in directed networks is a difficult task [24]. Various algorithms for community detection in undirected graphs have been presented in the literature, however methods for directed approaches have been less common. A comprehensive survey of community detection methods for directed graphs can be found in [24] by Malliaros et al. They propose the following classification for community detection approaches in directed graph.

Naive graph transformation approach. In this method, directions are simply removed from the graph and undirected community detection techniques are applied.

Transformations maintaining directionality. In this category of methods, the graph is transformed to an undirected version while directionality is maintained using other methods. The original graph can be tranformed to a unipartite weighted graph or a bipartite graph for this approach. An overview of such transformations is depicted in Fig. 3.3.

Extending objective functions and methodologies in directed graphs.
Many objective functions and quality measures used in undirected graphs can be extended to directed versions, i.e. modularity, spectral clustering, page rank and random walk methods, local density clustering.

Alternative approaches. Other methods that can not be placed in the first three categories also exist. Such as information theoretic approaches and blockmodeling approaches.
Although some algorithms exist for this purpose, many clustering algorithms for undirected graphs can be extended for directed graphs with the help of a directioncompliant quality measure. Several extensions of modularity for directed graphs have been proposed in the literature. Arenas et al [25] proposed an extension of modularity. Their idea is based on the fact that in a directed graph , if vertex exists with more outlinks and vertex exists with more inlinks, then it is more probable that in a random rewiring a link be found from to rather than the opposite. Considering the original idea of modularity, this suggests that if an edge is found from to , then this edge is contributing to a community structure more than to would, simply because it is more suprising and less random. By this definition, modularity can be altered for directed networks by changing the null model to a graph with the same number of vertices, edges, outlinks and inlinks as the original graph. The equation for modularity in a graph with the adjacency matrix and number of edges can then be expressed as
(3.4) 
where is the Kronecker delta function, and denote the communities that nodes and belong to, and and are the number of vertex and ’s outlinks and inlinks respectively.
3.5 Applications of community detection in software engineering
Graph clustering is widely used in the literature as a method for finding meaning in a structure. This need for finding meaning in a complex system is generally used in four main areas of software engineering.
3.5.1 Reflexion
Reflexion is the art of bridging the gap between software and humans, when it comes to analyzing a legacy system. Reflexion analysis tries to build an understandable high level abstraction of a large system, given the source code. In the process, the source code is analyzed and mapped to a new higher level model. This cumbersome task is typically done manually, however graph clustering can be used in semi automated mappings of source code to entities with the help the user’s knowledge about the system. Some related work has been presented in the literature [26], [27].
3.5.2 Refactoring
There are many properties that can be associated with good code. Sommerville describes good code as one that is highly maintainable, dependable, efficient and usable [1]. Truly reusable code is considered gold in the software industry as it significantly effects productivity and thus lowers costs [2] and without a doubt, good code is backed by a good design. Refactoring is the art of improving the internal structure of code while leaving the outer side intact [3]. One of the problems that has been tackled in the literature is refactoring large and complicated legacy systems and also analyzing the structure of new code. Graph clustering techniques can be considered a good method for finding the correct structure and packages of a large system by analyzing the relationships in a software’s dependency graph. Some work has been done in the area of refactoring at the class level, using graph clustering algorithms [6]. Recently, some work has also been presented in the package level [19], however the lack of an accurate package analysis tool that considers important object oriented aspects, such as stability and reusability is strongly felt in the literature.
3.5.3 Parallel computing
Task to processor mappings is considered an important problem in parallel environments. The two general strategies used in such problems is placing tasks that can run concurrently on different processors, while keeping tasks that need many communications on the same processors, in order to increase locality. Graph partitioning tools have been used in some cases to map tasks to hypercube structures [28].
3.5.4 Ontologies and concept grouping
One of the areas that highly utilizes graph clustering methods is ontologies and the semantic web. Various applications have been presented in the literature. One important application is extracting new concepts and taxonomies from ontologies. Extracting more generalized concepts and relations is one of the outputs of an ontology clustering. Tang et al. presents a great survey on such methods [29]. Modularization is also considered important for the problem of ever growing and over grown ontologies. The works in [30] is one of the most recent methods in this specific area.
3.6 Partition stability
In some works the notion of partition stability, also known as robustness is considered as an important property of a good clustering algorithm. The idea is that a stable partition is one that can be recovered even if the structure of the graph is modified, as long as the change in the graph is not too extensive. It important to stress that this thesis only studies stability in the software package sense of the word and does not cover cluster stability.
4.1 Basics of modeling packages with graphs
As discussed in previous chapters, many metrics have been proposed for different software properties at the class level. At the package level, which is in a higher level in the abstraction hierarchy compared to a class, the most important property in the literature is the dependency between two packages. When a class inside a package depends on a class from another package, the former package is said to depend on the latter.
Let be a graph with the adjacency matrix . Vertices in represent classes and edge between vertex and vertex resembles a dependency between the two classes. Communities in this graph represent package structure. A dependency between two classes can be any usage of methods or variables or inheritance. Classes are being modeled to graph vertices for the sole purpose of using community detection methods for finding appropriate clusters which represent packages and different relationships between classes are not considered different.
A thorough metric for package dependencies has been proposed in [32] by Gupta et al, which takes into account the different types of connections between packages when subpackages also exist in the software. The metric is validated using Briand’s evaluation criteria [33]. Gupta et al consider two classes of two packages connected if any of the following relationships are found between them.

Aggregation relationships between two classes, i.e., one class’s attribute has the type of another class

Class inheritence or implementing interfaces

Method invocation of one class by the method of another class

A class’s method referencing an attribute from another class

A class’s method has a parameter of the type of another class

A class’s method has a local variable of the type of another class

A class’s method invoking a method having a parameter of the type of another class
By Gupta et al’s metric, coupling between two packages and , where denotes the hierarchical level, is expressed as
where and are the number of elements of package and respectively at hierarchy level , and is the binary connection between elements. An example of different hierarchical levels given in [32] is depicted in Fig. 4.1. The binary connection between elements () can be calculated as
where denotes that element depends on element .
4.2 Basics of refactoring with community detection
The use of community detection methods for refactoring packages has only recently been studied in the literature by Pan et al [19]. An overview of their method is as follows.

Gather software information and dependencies from Java classes and jar files.

Construct an undirected weighted dependency network based on the information gathered in the first step.

Apply community detection to the dependency network to find the optimal placement of classes in packages.

Compare the optimized clustering with the original packages structure of the code and suggest a list of possible refactoring candidates.
In the first step of their algorithm Pan et al take into account two types of dependencies between code attributes, method accessing attribute dependency and method call dependency. Any of the two mentioned dependencies between two classes implies a dependency between the two classes.
Pan et al model package structure with the help of two different networks, namely the undirected Feature Dependency Network (uFDN) and the undirected Weighted Class Dependency Network (uWCDN). Nodes in uFDN represent features inside the software and edges represent dependencies between features. By this definition, uFDN can be expressed as
(4.1) 
where and represent the set of vertices and edges in uFDN respectively and is the adjacency matrix for the network. The subscript shows that the two sets and the adjacency matrix are at the feature level. An example of a uFDN presented in [19], consisting of two communities, is shown in Fig. 4.2.
The code resembeling the network in 4.2 is given below.
In uWCDN, only the relationship among the classes are shown. A weight is used for every class dependency that represents the number of connections between the the attributes and methods of the two classes involved in the relationship. uWCDN can be defined as
(4.2) 
where denotes the set of all vertices at the class level, denotes the set of all edges and is the weighted adjacency matrix of the network. Every entry in can be shown as which is the weight between the two elements and and is used to denote the strength of a dependency between nodes and . This weight can be calculated as
(4.3) 
The difference between uFDN and uWCDN is shown in Fig. 4.3.
where denotes the set of all nodes reachable from within a distance of and is the set of all the features of class . It is important to note that is equal to .
The community detection algorithm used by Pan et al utilizes an older definition of modularity [34].
4.3 The importance of directed graphs in modeling package relationships
Many studies in the literature have utilized undirected community detection methods for various applications. Fortunato [35] presents a comprehensive review on undirected community detection methods. Many studies that include a directed model of a problem simply discard the information that the directions in the graph provides, and use a naive graph transformation approach. In the naive tranformation approach, graph directions are simply discarded and normal undirected community detection methods are applied to the graph. This can cause many important information to be discarded. We briefly discuss three main problems that an undirected approach can cause and how it effects refactoring and package stability.
4.3.1 Citation based cluster models
Using naive transformation approaches for undirected community detection, introduces inaccuracy in certain graphs such as the citation based model that is depicted in Fig. 4.4. In this graph, the two middle vertices can clearly form a meaningful community. The two vertices have inlinks from the the same set of vertices while the vertices that they have outlinks to are also the same. In the package sense, the middle community resembles a package that is more stable than the package containing the vertices from the left. Many utility packages and libraries contain packages with a similar structure. There is little or no connection between the vertices inside the package, yet they belong to the same community as they are used in similar situations.
After applying naive transformation and trying to find optimal communities in the graph in Fig. 4.4, the output simply looses the intended community structure. The output is given in Fig. 4.5. Black vertices have been put into one community by the algorithm and white vertices have been placed in another community. In this clustering, it is clear that SDP (Stable Dependencies Principle) is violated and both communities depend on each other. Using a community detection algorithm intended for undirected graphs has changed the SDP compliant structure that the programmer had intended.
4.3.2 Bidirected graphs and loss of information
As discussed in [24], the information needed for correct community detection is simply lost in certain graphs such as the bidirected graph shown in Fig. 4.6.
From a stability perspective, the dependency graph in Fig. 4.6 shows a two packages that fully conform to SDP. The community created by the four vertices on the right represent a very stable package that the left community is depending upon. By performing naive transformation the graph would look like Fig. 4.7. This graph has lost its community structure and the two left most vertices and the two right most vertices will be treated in the same way when it is given to a community detection method. Fig. 4.8 shows this graph after applying community detection while optimizing Newman’s modularity.
4.4 Stability and modularity
In this section, the relationship between the directed version of modularity and the Stability Dependencies Principle (SDP) in refactoring packages is discussed. In a scenrio where a class is chosen to be moved from one package to another using community detection methods, we show that modularity is in favor of SDP and hiding dependencies that violate SDP inside packages has a higher contribution to modularity than hiding nonviolating dependencies. To show this behavior, some prior definitions are needed.
Definition 1.
A movement of class from package to package is shown as the tuple .
Definition 2.
A border node in a package is defined as a node that has connections with nodes in other packages and thus directly effects the package’s instability metric.
SDP is generally satisfied in a case where no stable package depends on an unstable package. When considering the movement of only two border classes, while all other classes and packages are left intact, then the only dependencies effecting the two package’s instability metric are the dependencies of the two border nodes. If a border node from stable package depends on a node from unstable package , then clearly SDP is violated.
Remark 4.4.1.
Let and be the outlink degree of vertices and respectively, and and be the inlink degree of vertices and . If and and node and node are border nodes, then SDP is satisfied.
Remark 4.4.2.
Let and be the outlink degree of vertices and respectively, and and be the inlink degree of vertices and . If and and node and node are border nodes, then SDP is not satisfied.
Proposition 4.4.3.
Proof.
Let denote modularity while the conditions in remark 4.4.1 holds true and denote modularity while the conditions in remark 4.4.2 holds true. and can be calculated using Eq. 3.4 as
The bar on inlink or outlink denotes that it is being calculated in the scenario of remark 4.4.2, and is therefore equivelant to the outlink and inlink in the scenario of remark 4.4.1 respectively. Thus one can write
∎
The above proposition shows how modularity is compatible with the notion of SDP. Modularity is in favor of nonrandom structure in a network. Violating SDP would mean that a stable package is depending on an unstable package. In this scenario, the above proof shows that keeping two nodes that have violated SDP before, inside a single package is better for than keeping two nodes that did not violate SDP. It is also important to note that if and belong to two different packages, then the condition will have no contribution to modularity and therefore is not discussed.
As an example for the proved proposition, suppose that a system contains two packages and , where is unstable and is a stable package. Two slighly different versions of this system is depicted in Fig. 4.9. In both of these versions, vertices 1, 2, 3 and 4 are members of and vertices 5, 6, 7 and 8 belong to . It is clear that in condition (b), edge is violating SDP. Based on Proposition 4.4.3, we show that moving node 1 from to has more positive contribution for package modularity, than in the case of condition (a). If movement happens, then four new edges positively contribute to the overall modularity of the dependency graph while one edge’s contribution is eliminated. The reason for this is that edges between two communities provide no contribution to modularity because the kronecker delta function in Eq. 3.4 becomes zero. therefore edges , , and will have new contributions to modularity and edge will no longer have any contribution. The changes in modularity for condition (b) can be calculated using Eq. 3.4 as
By replacing with the number of edges, we have
Changes in modularity for condition (a) can be calculated the same way as follows.
The results clearly indicate that the graph gained more modularity when trying to suppress an SDP violation than when it is not.
4.5 Proposed refactoring method
By considering the discussed importance of directed graphs in refactoring software packages and the package coupling metric proposed by [32], we present a package refactoring algorithm.
For calculating the dependencies, we use the package coupling metric provided by Gupta et al [32] at hierarchy level . This is a crucial point that must be noted. Hierarchy level is being used because it gives access to elements inside packages at level . The classes and subpakages in this level of hierarchy are the ones that will be analyzed for possible refactorings. In this study, only one package level is analyzed for refactoring, as deeper levels cause many open problems that need to be tackled. The most basic problem with optimizing software metrics such as coupling and cohesion in many levels of abstractness simultaneously is that cohesion inside one level can be considered as coupling in a deeper level, thus the problem of minimizing coupling contradicts with the problem of maximizing cohesion in a higher level of abstractness, i.e., the package level . therefore, in this work, only packages at level and their respective elements at level are considered.
For calculating the package dependency graph’s modularity, we use the directed and weighted version of modularity [25] expressed as
where and are respectively the output and input weights of nodes and and
The weights for an edge is equal to the edge’s coupling metric given in Eq. 4.1. These weights are used in the package dependency network, similar to the weights in uWCDN (Eq. 4.2) provided by Pan et al [19]. Considering the directedness of the network we can define an enhanced version of uWCDN, namely DWPDN (Directed, Weighted Package Dependency Network) that can be expressed as
(4.4) 
where denotes the set of all vertices at hierarchy level , denotes the set of all edges at hierarchy level and is the assymetric and weighted adjacency matrix of the network at hierarchy level . Every element of can calculated as
(4.5) 
where and are two elements and is the coupling function from Eq. 4.1.
The main phases of the proposed package refactoring algorithm are presented in Alg. 3.
5.1 Subjects
The two subjects being analyzed in this chapter are the same as the subjects in [19], namely Trama^{1}^{1}1http://trama.sourceforge.net and FrontEndForMySQL^{2}^{2}2http://frontend4mysql.sourceforge.net.
Trama is a graphical tool for manipulating and working with matrices. FrontEndForMySQL is a graphical front end for the MySQL database system and provides an easier and more user friendly environment for working with MySQL queries. Some details of the two subjects are shown in Table 5.1.
System  Version  Number of packages  Number of classes 

Trama  1.0  6  58 
FrontEndForMySQL  1.0  10  56 
The original packaging structure for Trama is depicted in Fig. 5.1. The original modularity calculated for the default packaging of Trama is calculated as 0.28 and the list of its packages is as follows.

visao

visao.renderizador

persistencia

negocio

negocio.leitor.Interface

negocio.leitor
FrontEndForMySQL is a larger system compared to Trama, with an initial package modularity of 0.21. The system’s default packaging structure is depicted in Fig. 5.2 and it contains the following packages.

frontendformysql

frontendformysql.domain.BackEnd

frontendformysql.domain.BackEndData

frontendformysql.domain.BackEndComponent.Editor

frontendformysql.domain.BackEndInterfaces

frontendformysql.domain.BackEnd.System

frontendformysql.domain.BackEndComponent.DriverModule

frontendformysql.domain.BackEndData

frontendformysql.domain.BackEndComponent.XMLutil

frontendformysql.domain.BackEndComponent.IO

frontendformysql.domain.BackEndComponent.DataStructures

frontendformysql.domain.BackEndComponent.Editor

frontendformysql.domain.BackEndInterfaces

frontendformysql.domain.BackEnd.System

frontendformysql.domain.BackEndComponent.DriverModule

frontendformysql.domain.BackEndComponent.XMLutil

frontendformysql.domain.BackEndComponent.IO

frontendformysql.domain.BackEndComponent.DataStructure
5.2 Case studies and results
After applying the proposed refactoring algorithm, with considering the importance of edge directions, the clustering of Trama changes to the depicted structure in Fig. 5.3 and the suggested movements are given in Table 5.2. The new packaging of Trama has a directed modularity of 0.43 and shows an improvement over the original 0.28. It is important to note that not all movements are acceptable and the suggestions should be given to a programmer for final analysis.
Order  Class name  Old package  Suggested package 

1  Main  negocio  visao 
2  Matriz  negocio  persistencia 
3  ModeloTabela  visao  persistencia 
4  JTableCustomizado  visao  visao.renderizador 
5  JTableCustomizado$1  visao  visao.renderizador 
6  JTableCustomizado$2  visao  visao.renderizador 
7  LeitorDeModelo  negocio.leitor  negocio 
8  Tela$23  visao  persistencia 
9  Tela$22  visao  persistencia 
10  Tela$24  visao  visao.renderizador 
11  Tela$3$1  visao  visao.renderizador 
As a comparison, an undirected version of the algorithm, using naive transformation, was applied on the Trama system. The produced clustering is shown in Fig. 5.4. In this clustering, modularity gets a value of 0.41. It is important to note that comparing the modularity of the two approaches would not be correct, as the formula for the two quality measures are inherently different. However, a comparison on package instability is shown in Table 5.3, in which is the original instability of a package, is the instability of a package after the proposed refactoring algorithm with edge directions, is applied and is the instability of a package after applying the undirected version of the algorithm.
Package name  OI  DI  UI 
negocio  0.478  0.529  0.6 
persistencia  0  0.368  0.409 
visao.renderizador  0.428  0.538  0 
negocio.leitor  0  0  0 
visao  0.64  0.578  0.5 
negacio.leitor.Intergface  0  0  0 
Table 5.3 shows how two packages became more stable after applying the proposed, directed clustering algorithm, while the stability of package visao decreased by 0.078. From Fig. 5.4, it is also clear that the visao.renderizador is merged with other packages and thus is not taken into account for comparison.
The implementation of the proposed algorithms was also applied to the FrontEndForMySQL system. The original package structure for FrontEndForMySQL and its structure after refactoring are depicted in Fig. 5.5 and Fig. 5.6 respectively. The original modularity for FrontEndForMySQL is calculated as 0.21.
Similar to the previous case study, an undirected version of the algorithm, using a naive transformation for removing edge directions was applied to FrontEndForMySQL and the clustering result is depicted in Fig. 5.7. The comparison table for this package instability measures is given in Table 5.4.
Package name  OI  DI  UI 
BackEndInterfaces  0  0  0.375 
BackEnd  0.969  1  0.714 
BackEnd.System  0.2  0  0 
BackEndComponent.IO  0  0.2  0 
BackEndComponent.XMLutil  0  0  0 
BackEndComponent.Editor  0  0  0 
BackEndComponent.DriverModule  0.818  0.25  0.25 
BackEndComponent.DataStructures  0  0  0 
frontendformysql  0.666  0  0.6 
BackEndData  0.238  0.125  0.5 
Table 5.4 clearly shows that the overall instability of packages is higher when edge directions are not taken into account in the refactoring algorithm.
6.1 Picasso overview
Picasso applies the proposed refactoring algorithm on software packages and provides a list of class moving suggestions. An example of the suggestions that Picasso presents is depicted in Fig. 6.2. Every suggestion is a class movement from a source package to a target package.
Picasso provides many extra features that are as follows.

Import Java jar files and class files.

Import UML structures.

Provides an option to choose famous graphs such as the Zachary club network.

Calculates modularity and provides a refactored solution for a software system using Alg. 3.

Calculates Martin’s instability metric for software packages.

Hierarchically provides cluster graphs of a graph.

Provides an extendible messaging system for future works.

Provides an edited version of JSNetworkX’s force layout graph visualization algorithm.

Provides functions for adding and removing graph edges and nodes.

Provides the ability to lock graph nodes in one position for better viewing.
Picasso’s top menu provides the main functionalities of the tool. The menu bar is depicted in Fig. 6.3 and shows that the tool is in working mode and awaits a response from the Picasso server. The gray section of the top bar shows some information such as the modularity measure of the current clustering and the name of the current selected class in the dependency graph. The top buttons consist of two main groups. The left, green buttons provide directed refactoring, undirected refactoring and the original clustering of the software system being analyzed. The right, blue buttons provide the options for viewing the graph’s clustering graph, viewing the movement suggestions after refactoring and viewing instability measures for different packages. An example of the instability measure window is shown in Fig. 6.4.
6.2 Picasso’s 3rd party dependencies
Picasso utilizes many diverse 3rd party libraries. Some of these libraries have been customized and tweaked specially for Picasso. The following list contains some brief information on these libraries.

Coffea^{1}^{1}1https://github.com/sbilinski/coffea java analysis tool. Coffea is an open source static code analyzer for Java byte code that can export package dependency graphs in various graph file formats. Coffea is written in Python and therefore can be integrated well with Picasso.

D3^{2}^{2}2http://d3js.org visualization library.
D3 stands for DataDriven Documents, and is arguably one of the best Javascript data visualization tools that utilizes HTML5, SVG (Scalable Vector Graphics), CSS3 and Javascript capabilities and provides an extremely flexible platform for data visualization.

JSNetworkX^{3}^{3}3http://felixkling.de/JSNetworkX network visualization library. This library is a port of the popular NetworkX Python graph library and is build upon the D3 platform.

Python’s igraph^{4}^{4}4http://igraph.org library. Python’s igraph library is used in Picasso for creating and manipulating graphs on the server side.

Python’s Socksjs^{5}^{5}5https://github.com/sockjs/sockjsclient library. The SocksJS library is used by Picasso for creating a web socket messaging system that can pass graph and graph cluster information between the server and client sides of the program.
The sequence diagram in Fig. 6.5 shows how Picasso interacts with these dependencies.
7.1 Refactoring
The refactoring method presented in this work utilizes a directed and weighted version of Newman’s modularity. This requires modularity to be calculated in every step of the proposed algorithm and thus performs slower than the algorithm of Pan et al [19]. This may be considered as one of the problems that can be tackled in future works. Also, some rare problems have been found with the directed version of modularity [24] and alternative approaches should also be considered, i.e. random walk based mathods such as LinkRank.
The importance of directed dependency graphs can also be analyzed in the class level, while using an appropriate metric for class couplings and cohesion.
7.2 Tool improvements
Some improvements can be applied on the tool proposed in this work. Currently a force directed layout is used for visualizing graphs. A force directed layouts simulate physical forces between nodes and edges to aesthetically draw a graph. Spring like attractive forces that are based on Hooke’s law are typically used. The force directed layout can be enhanced with collision detection algorithms, so that nodes that are members of the same community can be grouped together instead of being mixed in with nodes from other communities. Also several problems with force directed layouts in large graphs have been pointed out in the literature [36] and radial tree layouts have been proposed as alternatives. Radial tree layouts can be considered in future implementations of the tool. An example of a radial tree layout from a tool named Barrio, provided in [36] is depicted in Fig. 7.1.
Being able to force a node to be a member of a certain community while calculating the resulting modularity of the graph cluster can be considered as one of the important options in future versions of the application. Some library classes might need to be kept in their original package even though modularity is decreased by doing so.
Comments
There are no comments yet.