Different code review techniques have been proposed in the past and widely adopted by open-source and commercial projects. Code reviews involve the manual inspection of the code by different developers and help companies to reduce the number of defects and improve the quality of software Ackerman et al. (1984)Ackerman et al. (1989).
Nowadays, code reviews are generally no longer conducted as they were in the past, when developers organized review meetings to inspect the code line by line Fagan (1976).
Industry and researchers agree that code inspection helps to reduce the number of defects, but that in some cases, the effort required to perform code inspections hinders their adoption in practice Shull and Seaman (2008). However, the born of new tools and has enabled companies to adopt different code review practices. In particular, several companies, including Facebook Feitelson et al. (2013), Google Potvin and Levenberg (2016), and Microsoft Bacchelli and Bird (2013), perform code reviews by means of tools such as Gerrit111https://www.gerritcodereview.com or by means of the pull request mechanism provided by Git222https://help.github.com/en/articles/about-pull-requests Rigby et al. (2012).
In the context of this paper, we focus on pull requests. Pull requests provide developers a convenient way of contributing to projects, and many popular projects, including both open-source and commercial ones, are using pull requests as a way of reviewing the contributions of different developers.
Researchers have focused their attention on pull request mechanisms, investigating different aspects, including the review process Gousios et al. (2014), Gousios et al. (2015) and v. d. Veen et al. (2015), the influence of code reviews on continuous integration builds Zampetti et al. (2017), how pull requests are assigned to different reviewers Yu et al. (2014), and in which conditions they are accepted process Gousios et al. (2014),Rahman and Roy (2014),Soares et al. (2015),Kononenko et al. (2018). Only a few works have investigated whether developers consider quality aspects in order to accept pull requests Gousios et al. (2014),Gousios et al. (2015). Different works report that the reputation of the developer who submitted the pull request is one of the most important acceptance factors Gousios et al. (2015),Calefato et al. (2017).
However, to the best of our knowledge, no studies have investigated whether the quality of the code submitted in a pull request has an impact on the acceptance of this pull request. As code reviews are a fundamental aspect of pull requests, we strongly expect that pull requests containing low-quality code should generally not be accepted.
In order to understand whether code quality is one of the acceptance drivers of pull requests, we designed and conducted a case study involving 28 well-known Java projects to analyze the quality of more than 36K pull requests. We analyzed the quality of pull requests using PMD333https://pmd.github.io, one of the four tools used most frequently for software analysis Lenarduzzi et al. (2020), Beller et al. (2016). PMD evaluates the code quality against a standard rule set available for the major languages, allowing the detection of different quality aspects generally considered harmful, including code smells Beck (1999) such as ”long methods”, ”large class”, ”duplicated code”; anti-patterns Brown et al. (1998b) such as ”high coupling”; design issues such as ”god class” Lanza et al. (2005); and various coding style violations444https://pmd.github.io/latest/pmd_rules_java.html. Whenever a rule is violated, PMD raises an issue that is counted as part of the Technical Debt Cunningham (1992). In the remainder of this paper, we will refer to all the issues raised by PMD as ”TD items” (Technical Debt items).
Previous work confirmed that the presence of several code smells and anti-patterns, including those collected by PMD, significantly increases the risk of faults on the one hand and maintenance effort on the other hand Khomh et al. (2009a), Olbrich et al. (2009), D’Ambros et al. (2010), Fontana Arcelli and Spinelli (2011).
Unexpectedly, our results show that the presence of TD items of all types does not influence the acceptance or rejection of a pull request at all. Based on this statement, we analyzed all the data not only using basic statistical techniques, but also applying seven machine learning algorithms (Logistic Regression, Decision Tree, Random Forest, Extremely Randomized Trees, AdaBoost, Gradient Boosting, XGBoost), analyzing 36,986 pull requests and over 4.6 million TD items present in the pull requests.
Structure of the paper. Section 2 describes the basic concepts underlying this work, while Section 3 presents some related work done by researchers in recent years. In Section 4, we describe the design of our case study, defining the research questions, metrics, and hypotheses, and describing the study context, including the data collection and data analysis protocol. In Section 5, we present the achieved results and discuss them in Section 6. Section 7 identifies the threats to the validity of our study, and in Section 8, we draw conclusions and give an outlook on possible future work.
In this Section, we will first introduce code quality aspects and PMD, the tool we used to analyze the code quality of the pull requests. Then we will describe the pull request mechanism and finally provide a brief introduction and motivation for the usage of the machine learning techniques we applied.
2.1 Code Quality and PMD
Different tools on the market can be used to evaluate code quality. PMD is one of the most frequently used static code analysis tools for Java on the market, along with Checkstyle, Findbugs, and SonarQube Lenarduzzi et al. (2020).
PMD is an open-source tool that aims to identify issues that can lead to technical debt accumulating during development. The specified source files are analyzed and the code is checked with the help of predefined rule sets. PMD provides a standard rule set for major languages, which the user can customize if needed. The default Java rule set encompasses all available Java rules in the PMD project and is used throughout this study.
Issues found by PMD have five priority values (P). Rule priority guidelines for default and custom-made rules can be found in the PMD project documentation 4.
Change absolutely required. Behavior is critically broken/buggy.
Change highly recommended. Behavior is quite likely to be broken/buggy.
Change recommended. Behavior is confusing, perhaps buggy, and/or against standards/best practices.
Change optional. Behavior is not likely to be buggy, but more just flies in the face of standards/style/good taste.
Change highly optional. Nice to have, such as a consistent naming policy for package/class/fields…
These priorities are used in this study to help determine whether more severe issues affect the rate of acceptance in pull requests.
PMD is the only tool that does not require compiling the code to be analyzed. This is why, as the aim of our work was to analyze only the code of pull requests instead of the whole project code, we decided to adopt it. PMD defines more than 300 rules for Java, classified in eight categories (coding style, design, error prone, documentation, multithreading, performance, security). Several rules have also been confirmed harmful by different empirical studies. In Table I we highlight a subset of rules and the related empirical studies that confirmed their harmfulness. The complete set of rules is available on the PMD official documentation4.
|PMD Rule||Defined By||Impacted Characteristic|
Avoid Using Hard-Coded IP
|Brown et al Brown et al. (1998a)||Maintainability Brown et al. (1998a)|
|Chidamber and Kemerer Chidamber and Kemerer (1994)||Maintainability Dallal,J. and Abdin,A. (2018)|
Base Class Should be Abstract
|Brown et al Brown et al. (1998a)||Maintainability Khomh et al. (2009a)|
|Coupling Between Objects||Chidamber and Kemerer Chidamber and Kemerer (1994)||Maintainability Dallal,J. and Abdin,A. (2018)|
|Cyclomatic Complexity||Mc Cabe McCabe (1976)||Maintainability Dallal,J. and Abdin,A. (2018)|
|Data Class||Fowler Beck (1999)||Maintainability Li and Shatnawi (2007),|
|Faultiness Sjøberg et al. (2013), Yamashita (2014)|
|Excessive Class Length||Fowler (Large Class) Beck (1999)||Change Proneness Palomba et al. (2018), Khomh et al. (2009b)|
|Excessive Method Length||Fowler (Large Method) Beck (1999)||Change Proneness Jaafar et al. (2016), Khomh et al. (2009b) Fault Proneness Palomba et al. (2018)|
|Excessive Parameter List||Fowler (Long Parameter List) Beck (1999)||Change Proneness Jaafar et al. (2016)|
|God Class||Marinescu and Lanza Lanza et al. (2005)||Change Pronenes Olbrich et al. (2010), Schumacher et al. (2010), Zazworka et al. (2011), Comprehensibility Du Bois et al. (2006),|
|Faultiness Olbrich et al. (2010)Zazworka et al. (2011)|
|Law of Demeter||Fowler (Inappropriate Intimacy) Beck (1999)||Change Proneness Palomba et al. (2018)|
|Loose Package Coupling||Chidamber and Kemerer Chidamber and Kemerer (1994)||Maintainability Dallal,J. and Abdin,A. (2018)|
|Comment Size||Fowler (Comments) Beck (1999)||Faultiness Aman et al. (2014), Aman (2012)|
2.2 Git and Pull Requests
Git555https://git-scm.com/ is a distributed version control system that enables users to collaborate on a coding project by offering a robust set of features to track changes to the code. Features include “committing” a change to a local repository, “pushing” that piece of code to a remote server for others to see and use, “pulling” other developers’ change sets onto the user’s workstation, and merging the changes into their own version of the code base. Changes can be organized into branches, which are used in conjunction with pull requests. Git provides the user a ”diff” between two branches, which compares the branches and provides an easy method to analyze what kind of additions the pull request will bring to the project if accepted and merged into the master branch of the project.
Pull requests are a code reviewing mechanism that is compatible with Git and are provided by GitHub666https://github.com/. The goal is for code changes to be reviewed before they are inserted into the mainline branch. A developer can take these changes and push them to a remote repository on GitHub. Before merging or rebasing a new feature in, project maintainers in GitHub can review, accept, or reject a change based on the diff of the “master” code branch and the branch of the incoming change. Reviewers can comment and vote on the change in the GitHub web user interface. If the pull request is approved, it can be included in the master branch. A rejected pull request can be abandoned by closing it or the creator can further refine it based on the comments given and submit it again for review.
2.3 Machine Learning Techniques
In this section, we will describe the machine learning classifiers adopted in this work. We used eight different classifiers: a generalized linear model (Logistic Regression), a tree-based classifier (Decision Tree), and six ensemble classifiers (Bagging, Random Forest, ExtraTrees, AdaBoost, GradientBoost, and XGBoost).
In the next sub-sections, we will briefly introduce the eight adopted classifiers and give the rationale for choosing them for this study.
Logistic Regression Cox (1958)
is one of the most frequently used algorithms in Machine Learning. In logistic regression, a collection of measurements (the counts of a particular issue) and their binary classification (pull request acceptance) can be turned into a function that outputs the probability of an input being classified as 1, or in our case, the probability of it being accepted.
Decision Tree Breiman et al. (1984) is a model that takes learning data and constructs a tree-like graph of decisions that can be used to classify new input. The learning data is split into subsets based on how the split from the chosen variable improves the accuracy of the tree at the time. The decisions connecting the subsets of data form a flowchart-like structure that the model can use to tell the user how it would classify the input and how certain the prediction is perceived to be.
We considered two methods for determining how to split the learning data: GINI impurity and information gain. GINI tells the probability of an incorrect classification of a random element from the subset that has been assigned a random class within the subset. Information gain tells how much more accuracy a new decision node would add to the tree if chosen. GINI was chosen because of its popularity and its resource efficiency.
Decision Tree as a classifier was chosen because it is easy to implement and human-readable; also, decision trees can handle noisy data well because subsets without significance can be ignored by the algorithm that builds the tree. The classifier can be susceptible to overfitting, where the model becomes too specific to the data used to train it and provides poor results when used with new input data. Overfitting can become a problem when trying to apply the model to a mode-generalized dataset.
Random Forest Breiman (2001) is an ensemble classifier, which tries to reduce the risk of overfitting a decision tree by constructing a collection of decision trees from random subsets in the data. The resulting collection of decision trees is smaller in depth, has a reduced degree of correlation between the subset’s attributes, and thus has a lower risk of overfitting.
When given input data to label, the model utilizes all the generated trees, feeds the input data into all of them, and uses the average of the individual labels of the trees as the final label given to the input.
Extremely Randomized Trees Geurts et al. (2006) builds upon the Random Forest introduced above by taking the same principle of splitting the data into random subsets and building a collection of decision trees from these. In order to further randomize the decision trees, the attributes by which the splitting of the subsets is done are also randomized, resulting in a more computationally efficient model than Random Forest while still alleviating the negative effects of overfitting.
Bagging Breiman (1996) is an ensemble classification technique that tries to reduce the effects of overfitting a model by creating multiple smaller training sets from the initial set; in our study, it creates multiple decision trees from these sets. The sets are created by sampling the initial set uniformly and with replacements, which means that individual data points can appear in multiple training sets. The resulting trees can be used in labeling new input through a voting process by the trees.
AdaBoost Freund and Schapire (1997)
is a classifier based on the concept of boosting. The implementation of the algorithm in this study uses a collection of decision trees, but new trees are created with the intent of correctly labeling instances of data that were misclassified by previous trees. For each round of training, a weight is assigned to each sample in the data. After the round, all misclassified samples are given higher priority in the subsequent rounds. When the number of trees reaches a predetermined limit or the accuracy cannot be improved further, the model is finished. When predicting the label of a new sample with the finished model, the final label is calculated from the weighted decisions of all the constructed trees. As Adaboost is based on decision trees, it can be resistant to overfitting and be more useful with generalized data. However, Adaboost is susceptible to noise data and outliers.
Gradient Boost Friedman (2001)
is similar to the other boosting methods. It uses a collection of weaker classifiers, which are created sequentially according to an algorithm. In the case of Gradient Boost as used in this study, the determining factor in building the new decision trees is the use of a loss function. The algorithm tries to minimize the loss function and, similarly to Adaboost, stops when the model has been fully optimized or the number of trees reaches the predetermined limit.
XGBoost Chen and Guestrin (2016) is a scalable implementation of Gradient Boost. The use of XGBoost can provide performance improvements in constructing a model, which might be an important factor when analyzing a large set of data.
3 Related Work
In this Section, we report on the most relevant works on pull requests.
3.1 Pull Request Process
Pull requests have been studied from different points of view, such as pull-based development Gousios et al. (2014), Gousios et al. (2015) and v. d. Veen et al. (2015), usage of real online resources Zampetti et al. (2017), pull requests reviewer assignment Yu et al. (2014), and acceptance process Gousios et al. (2014), Rahman and Roy (2014), Soares et al. (2015), Kononenko et al. (2018). Another issue regarding pull requests that have been investigated is latency. Yu et al. Yu et al. (2015) define latency as a complex issue related to many independent variables such as the number of comments and the size of a pull request.
Zampetti et al. Zampetti et al. (2017) investigated how, why, and when developers refer to online resources in their pull requests. They focused on the context and real usage of online resources and how these resources have evolved during time. Moreover, they investigated the browsing purpose of online resources in pull request systems. Instead of investigating commit messages, they evaluated only the pull request descriptions, since generally the documentation of a change aims at reviewing and possibly accepting the pull request Gousios et al. (2014).
Yu et al. Yu et al. (2014) worked on pull requests reviewer assignment in order to provide an automatic organization in GitHub that leads to an effort waste. They proposed a reviewer recommender, who should predict highly relevant reviewers of incoming pull requests based on the textual semantics of each pull request and the social relations of the developers. They found several factors that influence pull requests latency such as size, project age, and team size.
This approach reached a precision rate of 74% for top-1 recommendations, and a recall rate of 71% for top-10 recommendations. However, the authors did not consider the aspect of code quality. The results are confirmed also by Soares et al. (2015).
Recent studies investigated the factors that influence the acceptance and rejection of a pull request.
There is no difference in treatment of pull-requests coming from the core team and from the community. Generally merging decision is postponed based on technical factors Hellendoorn et al. (2015),Rigby and Storey (2011). Generally, pull requests that passed the build phase are generally merged more frequently Zampetti et al. (2019)
Integrators decide to accept a contribution after analysing source code quality, code style, documentation, granularity, and adherence to project conventions Gousios et al. (2014). Pull request’s programming language had a significant influence on acceptance Rahman and Roy (2014). Higher acceptance was mostly found for Scala, C, C#, and R programming languages. Factors regarding developers are related to acceptance process, such as the number and experience level of developers Rahman et al. (2016), and the developers reputation who submitted the pull request Calefato et al. (2017). Moreover, social connection between the pull-request submitter and project manager concerns the acceptance when the core team member is evaluating the pull-request Tsay et al. (2014).
Rejection of pull requests can increase when technical problems are not properly solving and if the number of forks increase too Rahman et al. (2016). Other most important rejection factors are inexperience with pull requests; the complexity of contributions; the locality of the artifacts modified; and the project’s policy contribution Soares et al. (2015). From the integrator’s perspective, social challenges that needed to be addressed, for example, how to motivate contributors to keep working on the project and how to explain the reasons of rejection without discouraging them. From the contributor’s perspective, they found that it is important to reduce response time, maintain awareness, and improve communication Gousios et al. (2014).
3.2 Software Quality of Pull Requests
Gousios et al. Gousios et al. (2014) investigated the pull-based development process focusing on the factors that affect the efficiency of the process and contribute to the acceptance of a pull request, and the related acceptance time. They analyzed the GHTorrent corpus and another 291 projects. The results showed that the number of pull requests increases over time. However, the proportion of repositories using them is relatively stable. They also identified common driving factors that affect the lifetime of pull requests and the merging process. Based on their study, code reviews did not seem to increase the probability of acceptance, since 84% of the reviewed pull requests were merged.
Gousios et al. Gousios et al. (2015) also conducted a survey aimed at characterizing the key factors considered in the decision-making process of pull request acceptance. Quality was revealed as one of the top priorities for developers. The most important acceptance factors they identified are: targeted area importance, test cases, and code quality. However, the respondents specified quality differently from their respective perception, as conformance, good available documentation, and contributor reputation.
Kononenko et al. Kononenko et al. (2018) investigated the pull request acceptance process in a commercial project addressing the quality of pull request reviews from the point of view of developers’ perception. They applied data mining techniques on the project’s GitHub repository in order to understand the merge nature and then conducted a manual inspection of the pull requests. They also investigated the factors that influence the merge time and outcome of pull requests such as pull request size and the number of people involved in the discussion of each pull request. Developers’ experience and affiliation were two significant factors in both models. Moreover, they report that developers generally associate the quality of a pull request with the quality of its description, its complexity, and its revertability. However, they did not evaluate the reason for a pull request being rejected. These studies investigated the software quality of pull requests focusing on the trustworthiness of developers’ experience and affiliation Kononenko et al. (2018). Moreover, these studies did not measure the quality of pull requests against a set of rules, but based on their acceptance rate and developers’ perception. Our work complements these works by analyzing the code quality of pull requests in popular open-source projects and how the quality, specifically issues in the source code, affect the chance of a pull request being accepted when it is reviewed by a project maintainer. We measured code quality against a set of rules provided by PMD, one of the most frequently used open-source software tools for analyzing source code.
4 Case Study Design
We designed our empirical study as a case study based on the guidelines defined by Runeson and Höst Höst (2009). In this Section, we describe the case study design, including the goal and the research questions, the study context, the data collection, and the data analysis procedure.
4.1 Goal and Research Questions
The goal of this work is to investigate the role of code quality in pull request acceptance.
Accordingly, to meet our expectations, we formulated the goal as follows, using the Goal/Question/Metric (GQM) template Basili et al. (1994):
|Object||the acceptance of pull requests|
|Quality||with respect to their code quality|
|Viewpoint||from the point of view of developers|
|Context||in the context of Java projects|
Based on the defined goal, we derived the following Research Questions (RQs):
|RQ1||What is the distribution of TD items violated by the pull requests in the analyzed software systems?|
|RQ2||Does code quality affect pull request acceptance?|
|RQ3||Does code quality affect pull request acceptance considering different types and levels of severity of TD items?|
RQ1 aims at assessing the distribution TD items violated by pull requests in the analyzed software systems. We also took into account the distribution of TD items with respect to their priority level as assigned by PMD (P1-P5). These results will also help us to better understand the context of our study.
RQ2 aims at finding out whether the project maintainers in open-source Java projects consider quality issues in the pull request source code when they are reviewing it. If code quality issues affect the acceptance of pull requests, the question is what kind of TD items errors generally lead to the rejection of a pull request.
RQ3 aims at finding out if a severe code quality issue is more likely to result in the project maintainer rejecting the pull request. This will allow us to see whether project maintainers should pay more attention to specific issues in the code and make code reviews more efficient.
The projects for this study were selected using ”criterion sampling” Patton (2002). The criteria for selecting projects were as follows:
Uses Java as its primary programming language
Older than two years
Had active development in last year
Code is hosted on GitHub
Uses pull requests as a means of contributing to the code base
Has more than 100 closed pull requests
Moreover, we tried to maximize diversity and representativeness considering a comparable number of projects with respect to project age, size, and domain, as recommended by Nagappan et al. Nagappan et al. (2013).
We selected 28 projects according to these criteria. The majority, 22 projects, were selected from the Apache Software Foundation repository777http://apache.org. The repository proved to be an excellent source of projects that meet the criteria described above. This repository includes some of the most widely used software solutions, considered industrial and mature, due to the strict review and inclusion process required by the ASF. Moreover, the included projects have to keep on reviewing their code and follow a strict quality process888https://incubator.apache.org/policy/process.html.
The remaining six projects were selected with the help of the Trending Java repositories list that GitHub provides999https://github.com/trending/java. GitHub provides a valuable source of data for the study of code reviews Kalliamvakou et al. (2016). In the selection, we manually selected popular Java projects using the criteria mentioned before.
In Table 2, we report the list of the 28 projects that were analyzed along with the number of pull requests (”#PR”), the time frame of the analysis, and the size of each project (”#LOC”).
|Project Owner/Name||#PR||Time Frame||#LOC|
4.3 Data Collection
We first extracted all pull requests from each of the selected projects using the GitHub REST API v3 101010https://developer.github.com/v3/.
For each pull request, we fetched the code from the pull request’s branch and analyzed the code using PMD. The default Java rule set for PMD was used for the static analysis. We filtered the TD items added in the main branch to only include items introduced in the pull request. The filtering was done with the aid of a diff-file provided by GitHub API and compared the pull request branch against the master branch.
We identified whether a pull request was accepted or not by checking whether the pull request had been marked as merged into the master branch or whether the pull request had been closed by an event that committed the changes to the master branch. Other ways of handling pull requests within a project were not considered.
4.4 Data Analysis
The result of the data collection process was a csv file reporting the dependent variable (pull request accepted or not) and the independent variables (number of TD items introduced in each pull request). Table 3 provides an example of the data structure we adopted in the remainder of this work.
|Dependent Variable||Independent Variables|
|Project ID||PR ID||Accepted PR||Rule1||…||Rule n|
, we first calculated the total number of pull requests and the number of TD items present in each project. Moreover, we calculated the number of accepted and rejected pull requests. For each TD item, we calculated the number of occurrences, the number of pull requests, and the number of projects where it was found. Moreover, we calculated descriptive statistics (average, maximum, minimum, and standard deviation) for each TD item.
In order to understand if TD items affect pull request acceptance (RQ2), we first determined whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. First, we computed the test. Then, we selected eight Machine Learning techniques and compared their accuracy. To overcome to the limitation of the different techniques, we selected and compared eight of them. The description of the different techniques, and the rationale adopted to select each of them is reported in Section 2.
test could be enough to answer our RQs. However, in order to support possible follow-up of the work, considering other factors such as LOC as independent variable, Machine Learning techniques can provide much more accuracy results.
We examined whether considering the priority value of an issue affects the accuracy metrics of the prediction models (RQ3). We used the same techniques as before but grouped all the TD items in each project into groups according to their priorities. The analysis was run separately for each project and each priority level (28 projects * 5 priority level groups) and the results were compared to the ones we obtained for RQ2. To further analyze the effect of issue priority, we combined the TD items of each priority level into one data set and created models based on all available items with one priority.
Once a model was trained, we confirmed that the predictions about pull request acceptance made by the model were accurate (Accuracy Comparison). To determine the accuracy of a model, 5-fold cross-validation was used. The data set was randomly split into five parts. A model was trained five times, each time using four parts for training and the remaining part for testing the model. We calculated accuracy measures (Precision, Recall, Matthews Correlation Coefficient, and F-Measure) for each model (see Table 4
) and then combined the accuracy metrics from each fold to produce an estimate of how well the model would perform.
We started by calculating the commonly used metrics, including F-measure, precision, recall, and the harmonic average of the latter two. Precision and recall are metrics that focus on the true positives produced by the model. PowersPowers (2008) argues that these metrics can be biased and suggests that a contingency matrix should be used to calculate additional metrics to help understand how negative predictions affect the accuracy of the constructed model. Using the contingency matrix, we calculated the model’s Matthew Correlation Coefficient (MCC), which suggests as the best way to reduce the information provided by the matrix into a single probability describing the model’s accuracy Powers (2008).
TP: True Positive; TN: True Negative; FP: False Positive; FN: False Negative
For each classifier to easily gauge the overall accuracy of the machine learning algorithm in a model Bradley (1997), we calculated the Area Under The Receiver Operating Characteristic (AUC). For the AUC measurement, we calculated Receiver Operating Characteristics (ROC) and used these to find out the AUC ratio of the classifier, which is the probability of the classifier ranking a randomly chosen positive higher than a randomly chosen negative one.
In order to allow our study to be replicated, we have published the complete raw data in the replication package111111https://figshare.com/s/d47b6f238b5c92430dd7.
RQ1. What is the distribution of TD items violated by the pull requests in the analyzed software systems?
For this study, we analyzed 36,344 pull requests violating 253 TD items and contained more than 4.7 million times (Table 5) in the 28 analyzed projects. We found that 19,293 pull requests (53.08%) were accepted and 17,051 pull requests (46.92%) were rejected. Eleven projects contained the vast majority of the pull requests (80%) and TD items (74%). The distribution of the TD items differs greatly among the pull requests. For example, the projects Cassandra and Phoenix contain a relatively large number of TD items compared to the number of pull requests, while Groovy, Guacamole, and Maven have a relatively small number of TD items.
Taking into account the priority level of each rule, the vast majority of TD items (77.86%) are classified with priority level 3, while the remaining ones (22.14%) are equally distributed among levels 1, 2, and 4. None of the projects we analyzed had any issues rated as priority level 5.
Table 6 reports the number of TD items (”#TD item”) and their number of occurrences (”#occurrences”) grouped by priority level (”Priority”).
Looking at the TD items that could play a role in pull request acceptance or rejection, 243 of the 253 TD items (96%) are present in both cases, while the remaining 10 are found only in cases of rejection (Table 6).
Focusing on TD items that have with a ”double role”, we analyzed the distribution in each case. We discovered that 88 TD items have a diffusion rate of more than 60% in the case of acceptance and 127 have a diffusion rate of more than 60% in the case of rejection. The remaining 38 are equally distributed.
Table 8 and Table 9 present preliminary information related to the twenty most recurrent TD items. We report descriptive statistics by means of Average (”Avg.”), Maximum (”Max”), Minimum (”Min”), and Standard Deviation (”Std. dev.”). Moreover, we include the priority of each TD item (”Priority”), the sum of issue rows of that rule type found in the issues master table (”# Total occurrences”), and the number of projects in which the specific TD item has been violated (”#Project”).
The complete list is available in the replication package (Section 4.5).
|Project Name||#PR||#TD Items||% Acc.||% Rej.|
|Priority||#TD Items||#occurrences||% PR Acc.||% PR Rej.|
RQ2. Does code quality affect pull request acceptance?
To answer this question, we trained machine learning models for each project using all possible pull requests at the time and using all the different classifiers introduced in Section 2. A pull request was used if it contained Java that could be analyzed with PMD. There are some projects in this study that are multilingual, so filtering of the analyzable pull requests was done out of necessity.
Once we had all the models trained, we tested them and calculated the accuracy measures described in Table 4 for each model. We then averaged each of the metrics from the classifiers for the different techniques. The results are presented in Table 7. The averaging provided us with an estimate of how accurately we could predict whether maintainers accepted the pull request based on the number of different TD items it has. The results of this analysis are presented in Table 10. For reasons of space, we report only the most frequent 20 TD items. The table also contains the number of distinct PMD rules that the issues of the project contained. The rule count can be interpreted as the number of different types of issues found.
|Average between 5-fold validation models|
|Accuracy Measure||L. R.||D. T.||Bagg.||R. F.||E. T.||A. B.||G. B.||XG.B.|
|TD Item||Avg||Max||Min||Std. dev.|
|Rule ID||Prior.||#prj.||#occur.||Importance (%)|
|TD items||No TD items|
As depicted in Figure 1, almost all of the models’ AUC for every method of prediction hovering around 50%, overall code quality does not appear to be a factor in determining whether a pull request is accepted or rejected.
There were some projects that showed some moderate success, but these can be dismissed as outliers.
The results can suggest that perhaps Machine Learning could not be the most suitable techniques. However, also test on the contingency matrix (0.12) (Table 11) confirms the above results that the presence of TD items does not affect pull request acceptance (which means that TD items and pull request acceptance are mutually independent).
RQ3. Does code quality affect pull request acceptance considering different types and levels of severity of TD items?
To answer this research question, we introduced PMD priority values assigned to each TD item. By taking these priorities into consideration, we grouped all issues by their priority value and trained the models using data composed of only issues of a certain priority level.
Once we had run the training and tested the models with the data grouped by issue priority, we calculated the accuracy metrics mentioned above. These results enabled us to determine whether the prevalence of higher-priority issues affects the accuracy of the models. The affect on model accuracy or importance is determined with the use of drop-column importance -mechanism121212https://explained.ai/rf-importance/. After training our baseline model with P amount of features, we trained P amount of new models and compared each of the new models’ tested accuracy against the baseline model. Should a feature affect the accuracy of the model, the model trained with that feature dropped from the dataset would have a lower accuracy score than the baseline model. The more the accuracy of the model drops with a feature removed, the more important that feature is to the model when classifying pull-requests as accepted or rejected. In table 10 we show the importance of the 20 most common quality rules when comparing the baseline model accuracy with a model that has the specific quality rule dropped from the feature set.
Grouping by different priority levels did not provide any improvement of the results in terms of accuracy.
In this Section, we will discuss the results obtained according to the RQs and present possible practical implications from our research.
The analysis of the pull requests in 28 well-known Java projects shows that code quality, calculated by means of PMD rules, is not a driver for the acceptance or the rejection of pull requests. PMD recommends manual customization of the set of rules instead of using the out-of-the-box rule set and selecting the rules that developers should consider in order to maintain a certain level of quality. However, since we analyzed all the rules detected by PMD, no rule would be helpful and any customization would be useless in terms of being able to predict the software quality in code submitted to a pull request. The result cannot be generalized to all the open source and commercial projects, as we expect some project could enforce quality checks to accept pull requests. Some tools, such as SonarQube (one of the main PMD competitor), recently launched a new feature to allow developers to check the TD Issues before submitting the pull requests. Even if maintainers are not sensible to the quality of the code to be integrated in their projects, at least based on the rules detected by PMD, the adoption of pull request quality analysis tools such as SonarQube or the usage of PMD before submitting a pull request will increase the quality of their code, increasing the overall software maintainability and decreasing the fault proneness that could be increased from the injection of some TD items (see Table I).
The results complement those obtained by Soares et al. Soares et al. (2015) and Calefato et al. Calefato et al. (2017), namely, that the reputation of the developer might be more important than the quality of the code developed. The main implication for practitioners, and especially for those maintaining open-source projects, is the realization that they should pay more attention to software quality. Pull requests are a very powerful instrument, which could provide great benefits if they were used for code reviews as well. Researchers should also investigate whether other quality aspects might influence the acceptance of pull requests.
7 Threats to Validity
In this Section, we will introduce the threats to validity and the different tactics we adopted to mitigate them,
Construct Validity. This threat concerns the relationship between theory and observation due to possible measurement errors. Above all, we relied on PMD, one of the most used software quality analysis tool for Java. However, beside PMD is largely used in industry, we did not find any evidence or empirical study assessing its detection accuracy. Therefore, we cannot exclude the presence of false positive and false negative in the detected TD items. We extracted the code submitted in pull requests by means of the GitHub API10. However, we identified whether a pull request was accepted or not by checking whether the pull request had been marked as merged into the master branch or whether the pull request had been closed by an event that committed the changes to the master branch. Other ways of handling pull requests within a project were not considered and, therefore, we are aware that there could be the limited possibility that some maintainer could have integrated the pull request code into their projects manually, without marking the pull request as accepted.
Internal Validity. This threat concerns internal factors related to the study that might have affected the results. In order to evaluate the code quality of pull requests, we applied the rules provided by PMD, which is one of the most widely used static code analysis tools for Java on the market, also considering the different severity levels of each rule provided by PMD. We are aware that the presence or the absence of a PMD issue cannot be the perfect predictor for software quality, and other rules or metrics detected by other tools could have brought to different results.
External Validity. This threat concerns the generalizability of the results. We selected 28 projects. 21 of them were from the Apache Software Foundation, which incubates only certain systems that follow specific and strict quality rules. The remaining six projects were selected with the help of the trending Java repositories list provided by GitHub. In the selection, we preferred projects that are considered ready for production environments and are using pull requests as a way of taking in contributions. Our case study was not based only on one application domain. This was avoided since we aimed to find general mathematical models for the prediction of the number of bugs in a system. Choosing only one domain or a very small number of application domains could have been an indication of the non-generality of our study, as only prediction models from the selected application domain would have been chosen. The selected projects stem from a very large set of application domains, ranging from external libraries, frameworks, and web utilities to large computational infrastructures. The application domain was not an important criterion for the selection of the projects to be analyzed, but at any rate we tried to balance the selection and pick systems from as many contexts as possible. However, we are aware that other projects could have enforced different quality standards, and could use different quality check before accepting pull requests. Furthermore, we are considering only open source projects, and we cannot speculate on industrial projects, as different companies could have different internal practices. Moreover, we also considered only Java projects. The replication of this work on different languages and different projects may bring to different results.
. This threat concerns the relationship between the treatment and the outcome. In our case, this threat could be represented by the analysis method applied in our study. We reported the results considering descriptive statistics. Moreover, instead of using only Logistic Regression, we compared the prediction power of different classifier to reduce the bias of the low prediction power that one single classifier could have. We do not exclude the possibility that other statistical or machine learning approaches such as Deep Learning or others might have yielded similar or even better accuracy than our modeling approach. However, considering the extremely low importance of each TD Issue and its statistical significance, we do not expect to find big differences applying other type of classifiers.
Previous works reported 84% of pull requests to be accepted based on the trustworthiness of the developers Gousios et al. (2015)Calefato et al. (2017). However, pull requests are one of the most common code review mechanisms, and we believe that open-source maintainers are also considering the code quality when accepting or rejecting pull requests.
In order to verify this statement, we analyzed the code quality of pull requests by means of PMD, one of the most widely used static code analysis tools, which can detect different types of quality flaws in the code (TD Issues), including design flaws, code smells, security vulnerability, potential bugs, and many other issues. We considered PMD as it is able to detect a good number of TD Issues of different types that have been empirically considered harmful by several works. Examples of these TD Issues are God Class, High Cyclomatic Complexity, Large Class and Inappropriate Intimacy.
We applied basic statistical techniques, but also eight machine learning classifiers to understand if it is possible to predict if a pull request could be accepted or not based on the presence of a set of TD Issue in the pull request code. Of the 36,344 pull requests we analyzed in 28 well-known Java projects, nearly half had been accepted and the other half rejected. 243 of the 253 TD items were present in each case.
Unexpectedly, the presence of TD items of all types in the pull request code, does not influence the acceptance or rejection of pull requests at all and therefore, the quality of the code submitted in a pull request does not influence at all its acceptance. The same results are verified in all the 28 projects independently. Moreover, also merging all the data as a single large data-set confirmed the results.
Our results complement the conclusions derived by Gausios et al. Gousios et al. (2015) and Calefato et al. Calefato et al. (2017), who report that the reputation of the developer submitting the pull request is one of the most important acceptance factors.
As future work, we plan to investigate whether there are other types of qualities that might affect the acceptance of pull requests, considering TD Issues and metrics detected by other tool, analyzing different projects written in different languages. We also will also investigate how to raise awareness in the open-source community that code quality should also be considered when accepting pull requests.
Moreover, we will understand the perceived harmfulness of developers about PMD rules, in order to qualitatively assess over these violations. Another important factor need to be consider is the developers’ personality as possible influence on the acceptance of the pull request Calefato et al. (2019).
- Software inspections: an effective verification process. IEEE Software 6 (3), pp. 31–36. External Links: Cited by: §1.
- Software inspections and the industrial production of software. In Proc. Of a Symposium on Software Validation: Inspection-testing-verification-alternatives, pp. 13–40. Cited by: §1.
- Empirical analysis of fault-proneness in methods by focusing on their comment lines. In 2014 21st Asia-Pacific Software Engineering Conference, Vol. 2, pp. 51–56. Cited by: Table 1.
- An empirical analysis on fault-proneness of well-commented modules. In 2012 Fourth International Workshop on Empirical Software Engineering in Practice, Vol. , pp. 3–9. Cited by: Table 1.
- Expectations, outcomes, and challenges of modern code review. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp. 712–721. External Links: Cited by: §1.
- The goal question metric approach. Encyclopedia of Software Engineering. Cited by: §4.1.
- Refactoring: improving the design of existing code. Addison-Wesley Longman Publishing Co., Inc.. Cited by: §1, Table 1.
- Analyzing the state of static analysis: a large-scale evaluation in open source software. In 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1, pp. 470–481. Cited by: §1.
- The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition 30 (7), pp. 1145 – 1159. Cited by: §4.4.
- Classification and regression trees. The Wadsworth and Brooks-Cole statistics-probability series, Taylor and Francis. Cited by: §2.3.
- Bagging predictors. Machine Learning 24 (2), pp. 123–140. Cited by: §2.3.
- Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §2.3.
- AntiPatterns: refactoring software, architectures, and projects in crisis. 1st edition, New York, NY, USA. External Links: Cited by: Table 1.
- AntiPatterns: refactoring software, architectures, and projects in crisis: refactoring software, architecture and projects in crisis. John Wiley and Sons. External Links: Cited by: §1.
- A preliminary analysis on the effects of propensity to trust in distributed software development. In 2017 IEEE 12th International Conference on Global Software Engineering (ICGSE), Vol. , pp. 56–60. External Links: Cited by: §1, §3.1, §6, §8, §8.
- A large-scale, in-depth analysis of developers’ personalities in the apache ecosystem. Information and Software Technology 114, pp. 1 – 20. Cited by: §8.
- XGBoost: a scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Cited by: §2.3.
- A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20 (6), pp. 476–493. External Links: Cited by: Table 1.
The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) 20 (2), pp. 215–242. Cited by: §2.3.
- The wycash portfolio management system. OOPSLA ’92. External Links: Cited by: §1.
- On the impact of design flaws on software defects. In 2010 10th International Conference on Quality Software, Vol. , pp. 23–31. Cited by: §1.
- Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: a systematic literature review. IEEE Transactions on Software Engineering 44 (1), pp. 44–69. Cited by: Table 1.
- Does god class decomposition affect comprehensibility?. pp. 346–355. Cited by: Table 1.
- Design and code inspections to reduce errors in program development. IBM Systems Journal 15 (3), pp. 182–211. External Links: Cited by: §1.
- Development and deployment at facebook. IEEE Internet Computing 17 (4), pp. 8–17. External Links: Cited by: §1.
- Impact of refactoring on quality code evaluation. In Proceedings of the 4th Workshop on Refactoring Tools, WRT ’11, pp. 37–40. External Links: Cited by: §1.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), pp. 119 – 139. Cited by: §2.3.
- Greedy function approximation: a gradient boosting machine.. Ann. Statist. 29 (5), pp. 1189–1232. Cited by: §2.3.
- Extremely randomized trees. Machine Learning 63 (1), pp. 3–42. Cited by: §2.3.
- Work practices and challenges in pull-based development: the integrator’s perspective. In 37th IEEE International Conference on Software Engineering, Vol. 1, pp. 358–368. Cited by: §1, §3.1, §3.2, §3.2, §8, §8.
- An exploratory study of the pull-based software development model. In 36th International Conference on Software Engineering, ICSE 2014, pp. 345–355. Cited by: §1, §3.1, §3.1, §3.1, §3.1, §3.2, §3.2.
- Will they like this? evaluating code contributions with language models. In 12th Working Conference on Mining Software Repositories, Vol. , pp. 157–167. External Links: Cited by: §3.1.
- Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Engg. 14 (2), pp. 131–164. Cited by: §4.
- Evaluating the impact of design pattern and anti-pattern dependencies on changes and faults. Empirical Softw. Engg. 21 (3), pp. 896–931. Cited by: Table 1.
- An in-depth study of the promises and perils of mining github. Empirical Software Engineering 21 (5), pp. 2035–2071. Cited by: §4.2.
- An exploratory study of the impact of code smells on software change-proneness. In 2009 16th Working Conference on Reverse Engineering, Vol. , pp. 75–84. Cited by: §1, Table 1.
- An exploratory study of the impact of code smells on software change-proneness. In 2009 16th Working Conference on Reverse Engineering, Vol. , pp. 75–84. Cited by: Table 1.
- Studying pull request merges: a case study of shopify’s active merchant. In 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP ’18, pp. 124–133. Cited by: §1, §3.1, §3.2, §3.2.
- Object-oriented metrics in practice. Springer-Verlag, Berlin, Heidelberg. External Links: Cited by: §1, Table 1.
- A survey on code analysis tools for software maintenance prediction. In 6th International Conference in Software Engineering for Defence Applications, pp. 165–175. Cited by: §1, §2.1.
- An empirical study of the bad smells and class error probability in the post-release object-oriented system evolution. J. Syst. Softw. 80 (7), pp. 1120–1128. Cited by: Table 1.
- A complexity measure. IEEE Trans. Softw. Eng. 2 (4), pp. 308–320. External Links: Cited by: Table 1.
- Diversity in software engineering research. ESEC/FSE 2013, pp. 466–476. Cited by: §4.2.
- The evolution and impact of code smells: a case study of two open source systems. In 2009 3rd International Symposium on Empirical Software Engineering and Measurement, Vol. , pp. 390–400. External Links: Cited by: §1.
- Are all code smells harmful? a study of god classes and brain classes in the evolution of three open source systems. In 2010 IEEE International Conference on Software Maintenance, Vol. , pp. 1–10. External Links: Cited by: Table 1.
- On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Softw. Engg. 23 (3), pp. 1188–1221. Cited by: Table 1.
- Qualitative Evaluation and Research Methods. Sage, Newbury Park. Cited by: §4.2.
- Why google stores billions of lines of code in a single repository. Commun. ACM 59 (7), pp. 78–87. Cited by: §1.
- Evaluation: from precision, recall and f-factor to roc, informedness, markedness & correlation. Mach. Learn. Technol. 2, pp. . Cited by: §4.4.
- CORRECT: code reviewer recommendation in github based on cross-project and technology experience. In 38th International Conference on Software Engineering Companion (ICSE-C), Vol. , pp. 222–231. Cited by: §3.1, §3.1.
- An insight into the pull requests of github. In 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 364–367. Cited by: §1, §3.1, §3.1.
- Understanding broadcast based peer review on open source software projects. In 33rd International Conference on Software Engineering (ICSE), Vol. , pp. 541–550. Cited by: §3.1.
- Contemporary peer review in action: lessons from open source development. IEEE Software 29 (6), pp. 56–61. External Links: Cited by: §1.
- Building empirical support for automated code smell detection. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’10, pp. 8:1–8:10. Cited by: Table 1.
- Inspecting the history of inspections: an example of evidence-based technology diffusion. IEEE Software 25 (1), pp. 88–90. External Links: Cited by: §1.
- Quantifying the effect of code smells on maintenance effort. IEEE Transactions on Software Engineering 39 (8), pp. 1144–1156. Cited by: Table 1.
- Rejection factors of pull requests filed by core team developers in software projects with high acceptance rates. In 14th International Conference on Machine Learning and Applications (ICMLA), Vol. , pp. 960–965. Cited by: §1, §3.1, §3.1, §3.1, §6.
- Influence of social and technical factors for evaluating contribution in github. In 36th International Conference on Software Engineering, ICSE 2014, pp. 356–366. Cited by: §3.1.
- Automatically prioritizing pull requests. In 12th Working Conference on Mining Software Repositories, Vol. , pp. 357–361. Cited by: §1, §3.1.
- Assessing the capability of code smells to explain maintenance problems: an empirical study combining quantitative and qualitative data. Empirical Softw. Engg. 19 (4), pp. 1111–1143. Cited by: Table 1.
- Wait for it: determinants of pull request evaluation latency on github. In 12th Working Conference on Mining Software Repositories, Vol. , pp. 367–371. Cited by: §3.1.
- Reviewer recommender of pull-requests in github. In IEEE International Conference on Software Maintenance and Evolution, Vol. , pp. 609–612. Cited by: §1, §3.1, §3.1.
- How developers document pull requests with external references. In 25th International Conference on Program Comprehension (ICPC), Vol. 00, pp. 23–33. Cited by: §1, §3.1, §3.1.
- A study on the interplay between pull request review and continuous integration builds. pp. 38–48. Cited by: §3.1.
- Investigating the impact of design debt on software quality. In Proceedings of the 2Nd Workshop on Managing Technical Debt, MTD ’11, pp. 17–23. Cited by: Table 1.