Refactoring Graphs: Assessing Refactoring over Time

Refactoring is an essential activity during software evolution. Frequently, practitioners rely on such transformations to improve source code maintainability and quality. As a consequence, this process may produce new source code entities or change the structure of existing ones. Sometimes, the transformations are atomic, i.e., performed in a single commit. In other cases, they generate sequences of modifications performed over time. To study and reason about refactorings over time, in this paper, we propose a novel concept called refactoring graphs and provide an algorithm to build such graphs. Then, we investigate the history of 10 popular open-source Java-based projects. After eliminating trivial graphs, we characterize a large sample of 1,150 refactoring graphs, providing quantitative data on their size, commits, age, refactoring composition, and developers. We conclude by discussing applications and implications of refactoring graphs, for example, to improve code comprehension, detect refactoring patterns, and support software evolution studies.



page 5


Why Developers Refactor Source Code: A Mining-based Study

Refactoring aims at improving code non-functional attributes without mod...

Framework Code Samples: How Are They Maintained and Used by Developers?

Background: Modern software systems are commonly built on the top of fra...

The Dynamics of Software Composition Analysis

Developers today use significant amounts of open source code, surfacing ...

Towards a Catalog of Composite Refactorings

Catalogs of refactoring have key importance in software maintenance and ...

Impact of Change Granularity in Refactoring Detection

Detecting refactorings in commit history is essential to improve the com...

A Quantitative Study of Java Software Buildability

Researchers, students and practitioners often encounter a situation when...

Using Source Code Density to Improve the Accuracy of Automatic Commit Classification into Maintenance Activities

Source code is changed for a reason, e.g., to adapt, correct, or adapt i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Refactoring is a key activity to preserve and evolve the internal design of software systems. Due to the importance of the practice in modern software development, there is a large body of papers and studies about refactoring, shedding light on aspects such as usage of refactoring engines [37, 38], documentation of refactorings using commit messages [37], motivations for performing refactorings [43, 34, 50], benefits and challenges of refactoring [27, 28], among many others.

However, time seems to be an underinvestigated dimension in refactoring studies. The notable exception are studies on refactoring tactics, particularly on repeated refactoring operations, often called batch refactorings. For example, Murphy-Hill et al. [37] define batch refactorings as operations that execute within 60 seconds of each another. They report that 40% of refactorings performed using a refactoring tool occur in batches, i.e., programmers repeat refactorings. But the authors also mention that “the main limitation of [our] analysis is that, while we wished to measure how often several related refactorings are performed in sequence, we instead used a 60-second heuristic”. Bibiano et al. [6]

investigate the characteristics and impact of batch refactorings on code elements affected by smells. The authors rely on a heuristic to retrieve batches 

[8], which groups refactorings performed by the same author in a single code element. Thus, their heuristic focus on single methods or classes, most of the cases resulting in batches with a single commit (93%).

Interestingly, in his seminal book on refactoring [15], Fowler dedicates a chapter—co-authored with Kent Beck—to big refactorings. They claim that when studied individually refactorings do not provide a whole picture of the “game” played by developers when improving software design, i.e., “refactorings take time [to be concluded]”. However, to our knowledge, refactorings performed over long time windows are not deeply studied by the literature.

Therefore, we propose and evaluate a novel concept, called refactoring graphs, to study and reason about refactoring activities over time. In such graphs, the nodes are methods and the edges represent refactoring operations. For example, suppose that a method is renamed to . This operation is represented by two nodes, and , and one edge connecting them. After this first refactoring, suppose that a method is extracted from . As a result, an edge connecting to a new node, representing , is also added to the graph. Furthermore, refactoring graphs do not impose time constraints between the represented refactoring operations. In our example, the extract operation, for instance, can be performed months after the rename. Finally, refactoring graphs may also express refactorings performed by different developers. In our example, the rename can be performed by and the extract by another developer .

We formalize an algorithm to build refactoring graphs and use it to to extract graphs for 10 well-known and popular open-source Java-based projects. Our goal is to characterize refactoring subgraphs to better understand this scenario. Thus, after removing refactoring graphs coming from a single commit (since our goal is to investigate refactorings over time), we answer five research questions about the following properties:

  • Size (RQ1): most refactoring graphs have at most four nodes (85%) and three edges (83%). However, we also found graphs with 57 nodes and 61 edges.

  • Commits (RQ2): Most refactoring subgraphs are generated from two or three commits (95%).

  • Age (RQ3): The age of the refactoring subgraphs ranges from a few days to weeks or even months. For instance, 67% of the subgraphs have more than one month.

  • Refactoring composition (RQ4): Most refactoring subgraphs include more than one refactoring type (72%).

  • Developers (RQ5): Most refactoring subgraphs are created by a single developer (60%). However, a relevant amount (40%) is created by multiple developers.

Our main contributions are threefold. First, we propose and formalize the notion of refactoring graphs, which can be used to study and reason about refactorings performed over any time window by multiple developers. Second, we reveal several properties of a large sample of 1,150 refactoring graphs extracted for 10 real software projects. Third, we discuss several applications and implications of refactoring graphs to expand current refactoring tools, improve code comprehension, detect refactoring patterns, and support software evolution studies.

Structure: Section II defines our concept of refactoring graphs. Section III describes the design of our study, while Section IV shows the results. Section V shows an example of a large refactoring subgraph. We discuss the key applications and implications in Section VI. Section VII states threats to validity and Section VIII presents related work. Finally, we conclude the paper in Section IX.

Ii Refactoring Graphs

Fig. 1: Refactoring subgraph produced by only one developer

A refactoring graph is a set of disconnected subgraphs . Each is called a refactoring subgraph, with a set of vertices and a set of directed edges . In this way, the history of a software system includes a set of refactoring subgraphs. In refactoring (sub)-graphs, the vertices are the full signature of methods. For instance, we labeled a method in class and package as . Finally, the edge indicates the refactoring type (e.g., move method) and it also includes meta-data about the operation (e.g., author name and date).

Figure 1 shows an example of a refactoring graph. A developer extracted three methods from , which are named , , and . The edges refer to the refactoring operation. It is worth noting that a refactoring graph can include refactorings performed by multiple developers. For instance, Figure 2 illustrates a second example, where a developer extracted two methods from , which are named and . Then, a second developer renamed to . After that, a code reviewer might have suggested to keep the original name. Thus, the developer undoes the latest refactoring, renaming to again. In this case, the graph contains refactorings performed by two authors. Besides, there is a cycle when the developer reverts the method to the original name.

Fig. 2: Refactoring subgraph over time
Project Stars Forks Commits Contributors Java Files Latest Version Description
Elasticsearch 44,489 14,930 48,313 1,273 11,770 7.3.2 Search engine for cloud systems
RxJava 40,622 6,825 5,581 237 1,666 3.0.0-RC3 Library for asynchronous communications
Square Okhttp 34,484 7,521 4,273 189 167 4.2.0 HTTP client
Square Retrofit 33,801 6,254 1,756 129 241 2.6.2 HTTP client
Spring Framework 32,582 21,226 19,752 396 7,203 5.2.0 Framework for web aplications
Apache Dubbo 29,353 19,256 3,639 249 1,743 2.7.3 RPC framework
MPAndroidChart 28,647 7,424 2,018 66 220 3.1.0 Library to create charts
Glide 27,289 5,025 2,416 102 647 4.10.0 Library to load imagens
Lottie Android 26,952 4,278 1,139 76 198 3.0.7 Library to parser animations
Facebook Fresco 15,870 3,595 2,158 170 985 2.0.0 Library to display images
TABLE I: selected java projects
Fig. 3: Example of refactoring subgraphs

As presented in Figure 3, we center our study on eight refactorings at the method level. Rename and move are the most trivial operations since they involve just changing the method’s signature. Inheritance-based refactorings comprise the movement of one or more methods to supertypes or subtypes (i.e., pull up and push down). For example, a pull up moves methods from subclasses to a superclass. Extract operations generate new methods in the same class (i.e., they create a new node in our subgraphs). It is possible to extract a method or multiple methods from a single method . However, as also illustrated in Figure 3, it is possible to extract from multiple methods . In this case, the extracted code is duplicated in each method . Inline method is a dual operation, involving the removal of trivial elements and replacement of the respective calls by their content. As in the case of extract, we can inline a method in multiple methods . Finally, we consider a refactoring called extract and move that extracts a method to a distinct class.

Iii Study Design

Iii-a Selecting Java Projects

We analyze the characteristics and frequency of refactoring subgraphs in popular software systems. We select 10 popular Java projects in terms of stars on GitHub, since stars is a key metric to reveal the popularity of repositories [7, 45]. We also confine our analysis to projects with more than 1K commits and more than 100 Java files to avoid young and small systems. Table I describes the selected projects, including basic information, such as number of stars, commits, files, contributors, latest version, and description. These projects cover distinct domains, including web development systems and media processing libraries, for example. The most popular project is Elasticsearch (44,489 stars). The number of forks ranges from 3,595 (Facebook Fresco) to 21,226 (Spring Framework). The number of commits ranges from 1,139 (Lottie Android) to 48,313 (Elasticsearch), while the number of contributors varies from 66 (MPAndroidChart) to 1,273 (Elasticsearch). Square Okhttp is the smallest system (167 files); and Elasticsearch is the largest one (11,770 files).

Iii-B Detecting Refactoring Operations

We use refdiff [44] to detect the refactoring operations needed to build refactoring graphs. refdiff identifies refactorings between two versions of a git-based project. In our study, we focus on well-known refactoring operations detected by refdiff at the method level (i.e., rename, move, extract, inline, pull up, and push down, as presented in Figure 3).

refdiff works by comparing each commit with its previous version in history. To avoid analyzing commits from temporary branches, we focus on the main branch evolution. Particularly, we use the command git log –first-parent to get the list of commits of each project.111 Additionally, we remove refactorings in packages with the keywords test(s), example(s), and sample(s), since they are not part of the core system.

Iii-C Building Refactoring Graphs

As mentioned earlier, we identify refactoring subgraphs over time in 10 systems. Algorithm 1 presents the steps to build refactoring graphs. The input comprises a list of refactorings, e.g., moved to . First, the algorithm identifies each refactoring and the two methods involved, and (line 3). Then, it creates a directed edge representing this refactoring (line 5). Since and are sets, each element is represented only one time. The edges are labeled with refactoring’s name . The output includes sets of refactoring subgraphs in text format.

Input: R (list of refactorings from a system S)
Output: DG (refactoring graph)
1 begin
2        DG , V , E  for (m1, m2, t) R do
3               V V {} E E  
4        end for
5       return (V, E)
6 end
Algorithm 1 Building refactoring graphs

Table II presents the frequency of refactoring subgraphs in the analyzed systems. Considering all the projects, we detect a total of 8,926 refactoring subgraphs. Spring Framework has the highest number of subgraphs (3,104), while Square Retrofit has the lowest amount (169). Overall, 87.1% of the refactoring subgraphs comprise a set of operations performed in a single commit. This ratio varies from 69.2% (Glide) to 93.8% (Apache Dubbo). In contrast, 12.9% capture refactorings performed in two or more commits. In this paper, we assess the 1,150 refactoring subgraphs with number of commits 2, because they are the ones that represent refactoring over time.

Project Refactoring Subgraphs
All % %
Elasticsearch 2,073 1,934 93.3 139 6.7
RxJava 1,073 975 90.9 98 9.1
Square Okhttp 635 548 86.3 87 13.7
Square Retrofit 169 135 79.9 34 20.1
Spring Framework 3,104 2,604 83.9 500 16.1
Apache Dubbo 483 453 93.8 30 6.2
MPAndroidChart 454 381 83.9 73 16.1
Glide 425 294 69.2 131 30.8
Lottie Android 196 173 88.3 23 11.7
Facebook Fresco 314 279 88.9 35 11.1
Total 8,926 7,776 87.1 1,150 12.9
TABLE II: Frequency of Refactoring subgraphs

Iv Results

Iv-a (RQ1) What Is the Size of Refactoring Subgraphs?

As presented in Figure 4, most refactoring subgraphs have three vertices (639 occurrences, 56%). The other recurrent cases comprise subgraphs with two (15%) or four vertices (14%). Square Okhttp holds the largest subgraph regarding the number of vertices (57), which are most related to inline operations. Concerning the number of edges, most subgraphs have two (67%) or three edges (16%), as shown in Figure 5. MPAndroidChart has the largest subgraph in term of edges. It has 61 edges, most representing extract and move operations. Therefore, most subgraphs contain few methods (vertices) and refactoring operations (edges).

Figure 6 shows a real example of a refactoring subgraph from MPAndroidChart, which includes three distinct refactoring operations. In the first commit C1, a developer renamed method to .222 In the subsequent operation performed 13 days later, the same developer extracted a new method from at commit C2.333 Two days after the second operation, in commit C3, he made new extractions from to another class, creating a subgraph with five vertices and four edges.444

Fig. 4: Number of vertices by refactoring subgraph
Fig. 5: Number of edges by refactoring subgraph
Fig. 6: Example of a refactoring subgraph from MPAndroidChart

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Summary: Most refactoring subgraphs are small. Among 1,150 samples, most cases comprise subgraphs with the number of vertices ranging from two to four (85%) and the number of edges varying between two and three (83%).

Iv-B (RQ2) How Many Commits Are in Refactoring Subgraphs?

In this second question, we investigate the number of commits per subgraph. As presented in Figure 7, most cases include subgraphs with two (81%) or three commits (14%). The largest subgraph in terms of commits is again from Square Okhttp (18 commits).

Fig. 7: Number of commits by refactoring subgraph

Figure 8 shows an example from Elasticsearh. In commit C1, a developer moved two methods from class to .555 After approximately three months, in commit C2, a second developer extracted duplicated code from three methods to a new method named .666 Among the source methods, two methods are the ones moved early. As a consequence, these two commits create a refactoring subgraph with six vertices and five edges.

Fig. 8: Example of a refactoring subgraph from Elasticsearch

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Summary: Most refactoring subgraphs are created in two commits (81%) or in three commits (14%).

Iv-C (RQ3) What Is the Age of Refactoring Subgraphs?

To assess age, we compute the number of days between the most recent and the oldest commit in a refactoring subgraph. Figure 9 presents the results: we notice that refactoring subgraphs age varies among the projects. Considering the median of the distributions, the youngest subgraphs are found in Lottie Android and RxJava, which have 3 and 3.4 days, respectively. On the other side, the oldest subgraphs are found in Glide (489.8 days), Spring Framework (127.9), and Fresco (192). The other systems have subgraphs with age between 76.7 (Retrofit) and 102.5 days (Dubbo). Regarding the maturity of the target systems, the youngest project is Lottie Android (3 years) while the oldest one is Elasticsearch (9 years). We run the Spearman’s test to assess the correlation between the systems age and the median time of their refactoring subgraphs. The correlation coefficient () is 0.067, showing a very weak correlation. In other words, there are subgraphs with different age in both old and young systems. However, the p-value is 0.001 due to our small sample size.

Fig. 9: Age of the refactoring subgraphs

Figure 10 shows an example of a subgraph describing refactorings performed in few days on Spring Framework. In commit C1, a developer renamed method to .777 After six days, the same developer reverted the operation in commit C2, renaming to the original name.888 As a consequence, these modifications created a subgraph with two vertices and two edges.

Fig. 10: Example of a refactoring subgraph from Spring Framework

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Summary: The age of the subgraphs is diverse: while some have few days, the majority of the subgraphs have weeks or even months. For example, 67% of the refactoring subgraphs have more than one month.

Iv-D (RQ4) Which Refactorings Compose the Refactoring Subgraphs?

First, we present the most common refactoring operations in our sample of 1,150 refactoring subgraphs (Table III). Most cases include rename method (21%), extract and move method (19%), and extract method (17%). By constrast, we detected only 83 occurrences of move and rename operations. There are also few inheritance-based refactorings, i.e., pull up (330 occurrences) and push down (142 occurrences).

Refactoring Occurrences %
Rename 757 21
Extract and move 685 19
Extract 635 17
Move 579 16
Inline 474 13
Pull up 330 9
Push down 142 4
Move and rename 83 2
All 3,685 100
TABLE III: frequency of refactoring operations

Next, we categorize the subgraphs into two groups. The homogeneous group includes subgraphs with a single refactoring operation. In contrast, the heterogeneous category comprises subgraphs with at least two distinct refactoring operations. As presented in Table IV, overall, around 28% of the subgraphs are homogeneous, while 72% are heterogeneous. The results per system follow a similar tendency. Most of the projects have more heterogeneous subgraphs than homogeneous ones; the sole exception is RxJava (57% vs 43%). In addition, as presented in Figure 11, heterogeneous subgraphs often include two distinct refactoring types (84%); in contrast, 12% have three and only 4% have four or more distinct refactoring types.

Project Homogeneous % Heterogeneous %
Elasticsearch 43 30.9 96 69.1
RxJava 56 57.1 42 42.9
Square Okhttp 22 25.3 65 74.7
Square Retrofit 12 35.3 22 64.7
Spring Framework 138 27,6 362 72,4
Apache Dubbo 6 20.0 24 80.0
MPAndroidChart 16 21.9 57 78.1
Glide 19 14.5 112 85.5
Lottie Android 5 21.7 18 78.3
Facebook Fresco 6 17.1 29 82.9
All 323 28.1 827 71.9
TABLE IV: homogeneous vs heterogeneous refactoring subgraphs
Fig. 11: Number of distinct refactoring operations in heterogeneous subgraphs

Figure 12 shows an example of a homogeneous subgraph from Facebook Fresco. In this case, the subgraph represents four extract operations performed over time. First, in commit C1, a developer extracted method from two methods into class .999 The next operations happened years later when a second developer made two new extract operations in commits C2101010 and C3111111

Fig. 12: Example of a homogeneous refactoring subgraph from Facebook Fresco

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Summary: Most refactoring subgraphs are heterogeneous (71.9%), i.e., they include more than one refactoring type.

Iv-E (RQ5) Are the Refactoring Subgraphs Created by the Same or Multiple Developers?

As the last research question, we separate the refactoring subgraphs into two groups. The first group includes subgraphs with refactoring operations performed by a single developer. The second category is the opposite; it holds subgraphs by multiple developers. As presented in Table V, most subgraphs have a single author (60.3%). As reported in a previous question, the number of commits per subgraph is also small. Thus, we execute Spearman’s test to evaluate the correlation between the number of developers and the number of commits for each refactoring subgraph. The correlation coefficient () is 0.244, with a p-value 0.001, indicating a weak correlation between these metrics. That is, the higher the number of commits in a subgraph, the higher its amount of developers.

Project Single dev. % Multiple devs. %
Elasticsearch 32 23.0 107 77.0
RxJava 88 89.8 10 10.2
Square Okhttp 32 36.8 55 63.2
Square Retrofit 14 41.2 20 58.8
Spring Framework 303 60.6 197 39.4
Apache Dubbo 17 56.7 13 43.3
MPAndroidChart 70 95.9 3 4.1
Glide 116 88.5 15 11.5
Lottie Android 11 47.8 12 52.2
Facebook Fresco 10 28.6 25 71.4
All 693 60.3 457 39.7
TABLE V: Developers of refactoring graphs
Fig. 13: Example of a large refactoring subgraph from Square Okhttp

Figure 14 presents an example of a refactoring subgraph from Square Okhttp. First, in commit C1, a developer D1 renamed three methods from class .121212 Basically, the developer removed the prefix from their names. After 10 months, a second developer D2 removed a duplicate code from these methods, extracting method .131313 Then, after seven months, D2 moved this method to a new class named , in commit C3.141414 As a result, these two developers were responsible for a refactoring subgraph with eight vertices and seven edges.

Fig. 14: Example of a refactoring subgraph create by multiple developers from Square Okhttp

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Summary: Most refactoring subgraphs are created by a single developer (60%). Only 40% have multiple developers.

V Large Subgraph Example

In this section, we present and discuss an example of a large refactoring subgraph. As we reported in Section IV, most refactoring subgraphs are small, in terms of number of vertices, edges, and commits. For this reason, we only presented small examples when discussing our RQ results. However, we also found graphs describing major refactorings over time, whose presentation we postponed to this section.

Figure 13 shows an example from Square Okhttp. We chose this example because it encompasses different refactoring operations performed over time and it is one of the largest subgraphs from our dataset. This graph has 37 vertices, four commits, and three refactoring operations (move, push down, and extract and move). It was built by multiple developers, over six months. As we can observe, the graph nicely describes an example of code duplication removal. First, a developer performed nine push down refactorings to move a method from a superclass to a subclass. Then, a second developer performed 21 extract method operations to move the duplicated code to a single method, which has the following code:

public int readInt() throws IOException {
  require(4, Deadline.NONE);
  return buffer.readInt();

Besides that, there are other three extract method operations: (i) readShort() from a single method (this node has a single incoming edge), (ii) readByteString() from four methods, and (iii) decode() from a single method. These new methods are presented in the bottom of Figure 13.

Vi Discussion and Implications

Vi-a Detecting Refactoring over Time

Several tools and techniques are proposed in the literature to detect refactoring operations, for instance, Refactoring Crawler [11], RefFinder [26], Refactoring Miner [50, 43], and, more recently, RefDiff [44] and RMiner [51]. In common, those approaches only detect atomic refactoring, i.e., operations that happen in a single commit and performed by a single developer. In contrast, our approach, refactoring graph, focuses on the detection of refactoring over time, i.e., operations over multiple commits and performed by multiple developers. Moreover, differently from the batch refactoring [37, 6, 8], our approach is not constrained by the amount of developers nor to a time window. Indeed, we found refactoring subgraphs with age ranging from weeks to months and created by multiple developers. Therefore, we contribute to the refactoring literature with a novel approach to detect and explore refactoring operations in a broader perspective to complement existing tools and techniques.

Vi-B Refactoring Comprehension and Improvement

When performing code review, developers often adopt diff tools to better understand code changes, and decide whether they will be accepted or not. In this process, developers may also look for defects and code improvement opportunities [3]. However, if the reviewed change is large and complex, this task becomes challenging [3]. To alleviate this issue, refactoring-aware code review tools were proposed [21, 16, 17] to better understand changes mixed with refactoring. Refactoring graphs can contribute to handle this issue by providing navigability at method level. That is, a code reviewer may navigate back in a method to reason how a similar change was performed. For example, in Figure 14, a code reviewer may investigate whether all methods were properly renamed in the past, before accepting commit C3. Thus, refactoring graphs can be integrated to code review tools to better support code understating and improvement.

Vi-C Detecting Refactoring Patterns and Smells

Frequent refactoring subgraphs may indicate common refactoring patterns over time. In contrast, infrequent refactoring subgraphs that are variations of the pattern may suggest the presence of “refactoring smells” that deserve to be fixed. For example, suppose the refactoring subgraph shown in Figure 2 is frequent: a developer extracted two methods from , which are named and ; then, was renamed to , finally, was renamed back to . In this case, if we find a single refactoring subgraph that does not include the last renaming, this may suggest that the developer forgot to perform the undo rename in one single case. In this sense, refactoring subgraphs can be used to spot bad smells, which are only visible because refactoring subgraphs provide the big picture of the refactoring. Indeed, this is a topic that we aim to deep assess in further research, possibly with the support of techniques to mine graphs [53, 23, 30]. Thus, refactoring graphs can foment the detection of refactoring anomalies over time and drive future research agenda on refactoring patterns.

Vi-D Understanding and Assessing Software Evolution

During software evolution, developers often perform refactoring operations. Consequently, the link between methods may be lost [22]. For example, if a method is renamed to and then extracted to , it becomes quite hard to trace to , and vice versa. This has several implications to software evolution research, particularly on studies that assess multiple code versions, such as code authorship detection [2, 39, 36, 46, 20], code evolution visual supporting [18, 19], bug introducing change detection [29, 54, 40, 10, 41], to name a few. In practice, these studies often rely on tools provided by Git and SVN, such as git blame and svn blame, which show what revision and author last modified each line of a file. However, this process is sensitive to refactoring operations [2, 22]. As Git and SVN tools cannot track fine-grained refactoring operations, particularly at method level, these approaches may miss relevant data. For instance, in the aforementioned example, it would be not possible to detect that method was originated in method . Consequently, we would be not able to find the real creator of method nor the developer who introduced a bug on . With refactoring graphs, we are able to resolve method names over time, thus, software evolution studies can benefit as more precise tools can be created on the top.

Vii Threats to Validity

Generalization of the results. We analyzed 1,150 refactoring subgraphs from 10 popular and open source Java systems. Therefore, our dataset is built over credible and real-world software systems. Despite these observations, our findings—as usual in empirical software engineering—may not be directly generalized to other systems, particularly commercial, closed source, and the ones implemented in other languages than Java. Besides that, we focus our study on eight refactorings at method level. Thus, other refactoring types can affect the size of subgraphs. We plan to extend this research to cover software systems implemented in other programming languages and refactorings at class level.

Adoption of refdiff. We adopted refdiff to detect refactoring operation because it is the sole refactoring detection tool that is multi language, working for Java, JavaScript, and C. It is also extensible to other programming languages. Thus, as we plan to extend this research to cover other programming languages than Java, refdiff was the proper solution. In addition to be multi language, refdiff accuracy is quite high. refdiff’s authors provide two evaluations of their tool [44]. In the first evaluation, it achieved an overall F-measure of 96.8% (precision: 100%; recall: 93.9%). In the second evaluation, refdiff’s authors analyzed 102 real refactoring instances. In this case, it achieved an overall F-measure of 89.3% (precision: 85.4%; recall: 93.6%). Recently, Tsantalis et al. [51] proposed the refactoring detection tool rminer. When considering all refactoring operations, rminer has an F-measure of 92% (precision: 98%; recall: 87%) improving on refdiff’s overall accuracy. However, rminer works only for Java projects.

Building refactoring graphs. When creating the refactoring graphs, we cleaned up our data (i.e., vertices and edges) to keep only meaningful subgraphs. For instance, we removed constructor methods (vertices) from our analysis because they include mostly initialization settings, and do not have behavior as conventional methods. We also removed some very specific cases of refactoring (edges) in which refdiff reported false positives in inner classes or same method. However, these cases are not likely to affect our results because they only represent a fraction (3.5%) of the refactoring operations. Finally, the refactoring subgraphs can include unintentional operations (e.g., reverted commits by automatic deployment systems). To mitigate this threat, we focus our study on the main branch evolution to avoid experimental or unstable versions.

Detection of developers. In RQ5, we investigate the number of developers per refactoring subgraphs. We used the email available on git log to distinguish the developers. Thus, our results can include the same developer committing with different email addresses. But, we already found that most cases are subgraphs created by a single developer.

Viii Related Work

Refactoring is an usual practice during software evolution and maintenance. Constantly, developers refactor the source code for different purposes [43, 52]. For this reason, several studies concentrate on this research field [37, 6, 33, 31, 51, 11, 44, 28, 25, 47, 5, 4, 12, 42, 49, 1, 32, 9]. Among the empirical studies, some research focus on set of related refactoring. Specifically, these studies analyze batch refactorings [37, 6, 14, 48, 13, 8]. Murphy et al. [37] analyzed four datasets from different sources, all of these including metadata about the usage of Eclipse IDE. For instance, the dataset named Everyone contains Eclipse refactoring commands used by developers. Based on these datasets, the authors discuss usage and configurations of refactoring tools, frequency of refactoring operations, and commit messages. They also investigate about sets of refactorings operations executed in 60 seconds of each another, which are named batches. The authors state that the some refactorings types are more common in batches, such as rename, introduce a parameter, and encapsulate field

. Besides that, about 47% of refactorings performed using a refactoring tool happen in batches. However, the baches involve a short period, the study does not investigate refactorings operations that occur in different moments over time.

In another context, Bibiano et al. [6] point out that sets of related refactorings can solve problems due to code smells. The authors studied 54 GitHub projects and three closed systems. First, they used RMiner tool to detect 13 well-know refactorings [51], resulting in 24,893 operations. Then, the authors applied a heuristic to compute batch refactorings, i.e., set of related refactorings [8]. The heuristic includes two main requirements do retrieve a batch refactoring: (i) there are more than two refactoring operations in a single entity and (ii) the operations are from a single developer. The results are 4.607 batch refactorings. Next, the authors used another tool and scripts to identify more than 41K code smell occurrences in these systems. Finally, the authors computed the effect of batch refactorings to remove code smells. The main results show that most batches have only one commit (93%) and two refactoring types. Also, the authors state that batches have a negative or neutral effect on code smells (81%). However, the authors focus on code smells and operations performed by a single developer. In our study, the subgraphs involve refactoring over time (i.e., more than one commit), including subgraphs by multiples developers and different code elements.

Other studies also discuss the impact of batches to eliminate code smells, proposing approaches to reuse or suggest sets of related refactoring operations [48, 13, 24]. Thus, they do not focus on sequences of refactoring operations over time. Fowler [15] mention a similar term called big refactoring. The author points out that some refactorings are atomic, i.e., they are finished in a few minutes. By contrast, there are big refactorings, which are performed during months or years. We reinforce this observation: the age of the refactoring subgraphs is diverse, ranging from days to weeks or even months.

Hora et al. [22] analyze untracked changes during software development. The authors show that refactorings invalidate several tracking strategies to evaluate system evolution. As in our study, they represent evolutionary changes as graphs. In this case, each node refers to a class or a method, and the edges indicate tracked changes (i.e., entities that keep their names after a modification) and untracked changes (i.e., entities that change their names after a refactoring). In other words, a graph represents traceable changes or alterations that split the entity’s history. The results point up to 21% of the changes at the method level and up to 15% at the class level are untraceable. By contrast, in our study, the goal is to investigate refactorings performed over long time windows; we do not concentrate on tracked modifications on source code.

Meananeatra [35] also reports changes during software evolution as graphs. However, the study concentrates on refactoring sequences to remove long methods. The author proposes an approach based on two main criteria to detect an optimal set of refactorings. An optimal refactoring sequence centers on four metrics: number of removed bad smells, size of the refactoring sequence, number of the affected code elements, and the maintainability value (i.e., analyzability, changeability, stability, and testability). The technique represents candidate refactoring sequences as graphs. In this case, a graph contains a root node representing the original method version with smells. Each new node denotes a new method version after a refactoring operation. As in our study, the edges refer to refactorings. By contrast, the nodes represent the same method before and after the changes. Each path in the graph is a candidate refactoring sequence, which can meet the selection criteria. Thus, the study does not focus on real refactorings over time. Instead, the graph model represents steps to decompose a long method.

Ix Conclusion

In this paper, we proposed refactoring graphs, a novel approach to assess refactoring operations over time. We analyzed 10 popular Java systems from which 1,150 refactoring subgraphs were extracted. We then investigate five research questions to evaluate the following properties of refactoring graphs: size, commits, age, composition, and developers. We summarize our findings as follows:

  • The majority of the refactoring subgraphs are small (four nodes and three edges). However, there also outliers with dozens of nodes and edges.

  • Most refactoring subgraphs have up to three commits (95%).

  • Refactoring subgraphs span from few days to months.

  • Refactoring graphs are often heterogeneous, that is, they are composed by several types of refactoring.

  • Refactoring graphs are mostly created by a single developer (60%).

Based on our findings, we provided further discussion and implications to our study. Particularly, (i) we discuss our contributions regarding refactoring tools as a novel approach to explore refactoring operations in a broader perspective; (ii) we argue that refactoring graphs can be integrated to code review tools to better support code comprehension; (iii) we claim that refactoring graphs can play a role on the detection of refactoring patterns and anomalies, only possible to be spotted over time; and (iv) we state the importance of refactoring graphs to resolve method names and support software evolution studies.

Further studies can consider other popular programming languages and ecosystems; refactoring graphs based on class and package level as well as other refactoring types at method level; and also novel approaches to complement existing tools and techniques that focus on atomic refactorings.


This research is supported by grants from FAPEMIG, CNPq, and CAPES.


  • [1] E. L. G. Alves, M. Song, and M. Kim (2014) RefDistiller: a refactoring aware code review tool for inspecting manual refactoring edits. In 22nd International Symposium on Foundations of Software Engineering (FSE), pp. 751–754. Cited by: §VIII.
  • [2] G. Avelino, L. Passos, A. Hora, and M. T. Valente (2016)

    A novel approach for estimating truck factors

    In 24th International Conference on Program Comprehension (ICPC), pp. 1–10. Cited by: §VI-D.
  • [3] A. Bacchelli and C. Bird (2013) Expectations, outcomes, and challenges of modern code review. In 35th International Conference on Software Engineering (ICSE), pp. 712–721. Cited by: §VI-B.
  • [4] G. Bavota, B. De Carluccio, A. De Lucia, M. Di Penta, R. Oliveto, and O. Strollo (2012) When does a refactoring induce bugs? an empirical study. In 12th International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 104–113. Cited by: §VIII.
  • [5] G. Bavota, A. D. Lucia, M. D. Penta, R. Oliveto, and F. Palomba (2015) An experimental investigation on the innate relationship between quality and refactoring. Journal of Systems and Software 107 (C), pp. 1–14. Cited by: §VIII.
  • [6] A. C. Bibiano, E. F. D. O. A. Garcia, M. Kalinowski, B. Fonseca, R. Oliveira, A. Oliveira, and D. Cedrim (2019) A quantitative study on characteristics and effect of batch refactoring on code smells. In 13th International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–11. Cited by: §I, §VI-A, §VIII, §VIII.
  • [7] H. Borges, A. Hora, and M. T. Valente (2016) Understanding the factors that impact the popularity of GitHub repositories. In 32nd International Conference on Software Maintenance and Evolution (ICSME), pp. 334–344. Cited by: §III-A.
  • [8] D. Cedrim (2018) Understanding and improving batch refactoring in software systems. Ph.D. Thesis, PUC-Rio. Cited by: §I, §VI-A, §VIII, §VIII.
  • [9] O. Chaparro, G. Bavota, A. Marcus, and M. D. Penta (2014) On the impact of refactoring operations on code quality metrics. In 30th International Conference on Software Maintenance and Evolution (ICSME), pp. 456–460. Cited by: §VIII.
  • [10] T. Chen, M. Nagappan, E. Shihab, and A. E. Hassan (2014) An empirical study of dormant bugs. In 11th Working Conference on Mining Software Repositories (MSR), Cited by: §VI-D.
  • [11] D. Dig, C. Comertoglu, D. Marinov, and R. Johnson (2006) Automated detection of refactorings in evolving components. In 20th European Conference on Object-Oriented Programming (ECOOP), pp. 404–428. Cited by: §VI-A, §VIII.
  • [12] D. Dig and R. Johnson (2005) How do APIs evolve? a story of refactoring. In 22nd International Conference on Software Maintenance (ICSM), pp. 83–107. Cited by: §VIII.
  • [13] E. Fernandes, A. Uchôa, A. C. Bibiano, and A. Garcia (2019) On the alternatives for composing batch refactoring. In 3rd International Workshop on Refactoring (IWOR), pp. 9–12. Cited by: §VIII, §VIII.
  • [14] E. Fernandes (2019) Stuck in the middle: removing obstacles to new program features through batch refactoring. In 41st International Conference on Software Engineering: Companion Proceedings (ICSE), pp. 206–209. Cited by: §VIII.
  • [15] M. Fowler (1999) Refactoring: improving the design of existing code. Addison-Wesley. Cited by: §I, §VIII.
  • [16] X. Ge, S. Sarkar, and E. Murphy-Hill (2014) Towards refactoring-aware code review. In 7th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pp. 99–102. Cited by: §VI-B.
  • [17] X. Ge, S. Sarkar, J. Witschey, and E. Murphy-Hill (2017) Refactoring-aware code review. In Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 71–79. Cited by: §VI-B.
  • [18] V. U. Gómez, S. Ducasse, and T. D’Hondt (2010)

    Visually supporting source code changes integration: the Torch dashboard

    In 17th Working Conference on Reverse Engineering (WCRE), Cited by: §VI-D.
  • [19] V. U. Gómez, S. Ducasse, and T. D’Hondt (2015) Visually characterizing source code changes. Science of Computer Programming 98 (P3), pp. 376–393. Cited by: §VI-D.
  • [20] L. Hattori and M. Lanza (2009) Mining the history of synchronous changes to refine code ownership. In 6th International Working Conference on Mining Software Repositories (MSR), Cited by: §VI-D.
  • [21] S. Hayashi, S. Thangthumachit, and M. Saeki (2013) Rediffs: refactoring-aware difference viewer for Java. In 20th Working Conference on Reverse Engineering (WCRE), pp. 487–488. Cited by: §VI-B.
  • [22] A. Hora, D. Silva, R. Robbes, and M. T. Valente (2018) Assessing the threat of untracked changes in software evolution. In 40th International Conference on Software Engineering (ICSE), pp. 1102–1113. Cited by: §VI-D, §VIII.
  • [23] A. Inokuchi, T. Washio, and H. Motoda (2000) An apriori-based algorithm for mining frequent substructures from graph data. In 4th Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 13–23. Cited by: §VI-C.
  • [24] H. C. Jiau, L. W. Mar, and J. C. Chen (2013) OBEY: optimal batched refactoring plan execution for class responsibility redistribution. Transactions on Software Engineering 39 (9), pp. 1245–1263. Cited by: §VIII.
  • [25] J. Kim, D. Batory, D. Dig, and M. Azanza (2016) Improving refactoring speed by 10x. In 38th International Conference on Software Engineering (ICSE), pp. 1145–1156. Cited by: §VIII.
  • [26] M. Kim, M. Gee, A. Loh, and N. Rachatasumrit (2010) Ref-finder: a refactoring reconstruction tool based on logic query templates. In 8th International Symposium on Foundations of software engineering (FSE), pp. 371–372. Cited by: §VI-A.
  • [27] M. Kim, T. Zimmermann, and N. Nagappan (2012) A field study of refactoring challenges and benefits. In 20th International Symposium on the Foundations of Software Engineering (FSE), pp. 50:1–50:11. Cited by: §I.
  • [28] M. Kim, T. Zimmermann, and N. Nagappan (2014) An empirical study of refactoring challenge and benefits at Microsoft. Transactions on Software Engineering 40 (7), pp. 633–649. Cited by: §I, §VIII.
  • [29] S. Kim, T. Zimmermann, K. Pan, and E. J. Jr. Whitehead (2006) Automatic identification of bug-introducing changes. In 21st International Conference on Automated Software Engineering (ASE), Cited by: §VI-D.
  • [30] M. Kuramochi and G. Karypis (2001) Frequent subgraph discovery. In 1st International Conference on Data Mining (ICDM), pp. 313–320. Cited by: §VI-C.
  • [31] B. Lin, C. Nagy, G. Bavota, and M. Lanza (2019) On the impact of refactoring operations on code naturalness. In 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 594–598. Cited by: §VIII.
  • [32] Y. Lin, X. Peng, Y. Cai, D. Dig, D. Zheng, and W. Zhao (2016) Interactive and guided architectural refactoring with search-based recommendation. In 24th International Symposium on Foundations of Software Engineering (FSE), pp. 535–546. Cited by: §VIII.
  • [33] M. Mahmoudi, S. Nadi, and N. Tsantalis (2019) Are refactorings to blame? an empirical study of refactorings in merge conflicts. In 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 151–162. Cited by: §VIII.
  • [34] D. Mazinanian, A. Ketkar, N. Tsantalis, and D. Dig (2017) Understanding the use of lambda expressions in Java. Programming Languages 1 (85), pp. 85:1–85:31. Cited by: §I.
  • [35] P. Meananeatra (2012) Identifying refactoring sequences for improving software maintainability. In 27th International Conference on Automated Software Engineering (ASE), pp. 406–409. Cited by: §VIII.
  • [36] A. Meneely and O. Williams (2012) Interactive churn metrics: socio-technical variants of code churn. Software Engineering Notes 37 (6). Cited by: §VI-D.
  • [37] E. Murphy-Hill, C. Parnin, and A. P. Black (2009) How we refactor, and how we know it. In 31st International Conference on Software Engineering (ICSE), pp. 287–297. Cited by: §I, §I, §VI-A, §VIII.
  • [38] S. Negara, N. Chen, M. Vakilian, R. E. Johnson, and D. Dig (2013) A comparative study of manual and automated refactorings. In 27th European Conference on Object-Oriented Programming (ECOOP), pp. 552–576. Cited by: §I.
  • [39] F. Rahman and P. Devanbu (2011) Ownership, experience and defects: a fine-grained study of authorship. In 33rd International Conference on Software Engineering (ICSE), Cited by: §VI-D.
  • [40] F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu (2011) BugCache for inspections: hit or miss?. In 19th International Symposium on the Foundations of Software Engineering (FSE), Cited by: §VI-D.
  • [41] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu (2016) On the naturalness of buggy code. In 38th International Conference on Software Engineering (ICSE), Cited by: §VI-D.
  • [42] B. Shen, W. Zhang, H. Zhao, G. Liang, Z. Jin, and Q. Wang (2019) IntelliMerge: a refactoring-aware software merging technique. Programming Languages 3 (170), pp. 170:1–170:28. Cited by: §VIII.
  • [43] D. Silva, N. Tsantalis, and M. T. Valente (2016) Why we refactor? Confessions of GitHub contributors. In 24th International Symposium on the Foundations of Software Engineering (FSE), pp. 858–870. Cited by: §I, §VI-A, §VIII.
  • [44] D. Silva and M. T. Valente (2017) RefDiff: detecting refactorings in version histories. In 14th International Conference on Mining Software Repositories (MSR), pp. 1–11. Cited by: §III-B, §VI-A, §VII, §VIII.
  • [45] H. Silva and M. T. Valente (2018) What’s in a GitHub star? Understanding repository starring practices in a social coding platform. Journal of Systems and Software 146, pp. 112–129. Cited by: §III-A.
  • [46] D. Spinellis (2017) A repository of Unix history and evolution. Empirical Software Engineering 22 (3), pp. 1372–1404. Cited by: §VI-D.
  • [47] G. Szóke, C. Nagy, R. Ferenc, and T. Gyimóthy (2016) Designing and developing automated refactoring transformations: an experience report. In 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 693–697. Cited by: §VIII.
  • [48] D. Tenorio, A. C. Bibiano, and A. Garcia (2019) On the customization of batch refactoring. In 3rd International Workshop on Refactoring (IWOR), pp. 13–16. Cited by: §VIII, §VIII.
  • [49] R. Terra, M. T. Valente, S. Miranda, and V. Sales (2018) JMove: a novel heuristic and tool to detect move method refactoring opportunities. Journal of Systems and Software 138, pp. 19–36. Cited by: §VIII.
  • [50] N. Tsantalis, V. Guana, E. Stroulia, and A. Hindle (2013) A multidimensional empirical study on refactoring activity. In 23th Conference of the Center for Advanced Studies on Collaborative Research (CASCON), pp. 132–146. Cited by: §I, §VI-A.
  • [51] N. Tsantalis, M. Mansouri, L. M. Eshkevari, D. Mazinanian, and D. Dig (2018) Accurate and efficient refactoring detection in commit history. In 40th International Conference on Software Engineering (ICSE), pp. 483–494. Cited by: §VI-A, §VII, §VIII, §VIII.
  • [52] Y. Wang (2009) What motivate software engineers to refactor source code? evidences from professional developers. In International Conference on Software Maintenance (ICSM), pp. 413–416. Cited by: §VIII.
  • [53] Xifeng Yan and Jiawei Han (2002) gSpan: graph-based substructure pattern mining. In 2nd International Conference on Data Mining (ICDM), pp. 721–724. Cited by: §VI-C.
  • [54] T. Zimmermann, S. Kim, A. Zeller, and E. J. Whitehead,Jr. (2006) Mining version archives for co-changed lines. In 3rd International Workshop on Mining Software Repositories (MSR), Cited by: §VI-D.