Collaborative coding environments  like GitHub,111https://github.com/ make it easier to contribute to software projects. However, these environments also make the process of evaluating contributions a challenging task for project managers and code reviewers , since developers with wide ranges of experience and commitment work simultaneously on the same projects. To make the matter worse, during the development of a software, developers can perform contributions that may lead to the introduction of bugs .
Although collaborative coding environments have made it more difficult to evaluate developers’ contributions, they provide a rich source of data. This data can be explored to extract information on technical (such as developers’ experience) and social (such as interactions among developers) factors related to developers and their contributions. Understanding the relation between those factors and the introduction of bugs may be useful for project managers and code reviewers. For example, let’s consider the case a developer with bug-related factors decides to perform a pull request in a project. Thus, code reviewers might want to double check this contribution, potentially avoiding the introduction of a bug.
Previous studies [1, 4, 2, 5, 6, 7, 8, 9, 10] have used these factors to perform different analyses on developers’ contributions. For instance, studies [1, 4, 2] indicate that both technical and social factors impact the acceptance rate of contributions on GitHub. However, these studies did not analyze the influence of these factors on the introduction of bugs. Other studies [5, 6, 7, 8, 9, 10] evaluated this influence in a very limited way by considering only proprietary projects [5, 8, 10] or a reduced number of factors as well as characteristics to represent them [8, 9].
To provide a deeper understanding on the relation between technical, social factors and introduction of bugs, we present a broader study involving five technical and two social factors. In particular, we analyze factors related to the developers’ experience [6, 5, 11, 12], developers’ habit to follow well-known technical contribution norms , ownership [11, 8, 13], nature of developers’ changes , bugginess of developer’s commits , communication with the community of a project , and the project establishment . First, we investigate how buggy commits (i.e., a commit that introduces a bug) differ from clean commits (i.e., commits that did not introduce any bug) in terms of these factors. Then, we analyze how strong is the difference between buggy and clean commits. Finally, we evaluate the effect of each factor on commit bugginess (i.e., the likelihood of a commit to introduce a bug) when considering the presence of multiple factors.
To perform our study, we collect data from eight open-source projects hosted on GitHub. In particular, we collect 6,832 bug reports and compute 19 metrics. We use these metrics to characterize the factors analyzed. Our study led to four main findings: (i) both technical and social factors are able to discriminate between buggy and clean commits; (ii) the developers’ habits of not following technical contribution norms are associated with an increase on commit bugginess; (iii) unexpectedly, the presence of tests in commits presents a relation with the increase of commit bugginess; and, finally, (iv) the developer’s experience presents a contradictory relation with the introduction of bugs. But, the analysis of both code complexity and developer’s experience may explain this contradiction. These findings shed light towards improving state-of-the-art techniques that may assist project managers and code reviewers during the inspection of bugs.
The remainder of this paper is structured as follows. Section II presents the design of the empirical study, while its results are presented in Section III. Section IV discuss our findings and compare them with the literature. Section V presents the threats to validity. Section VI discusses the related work on commit bugginess. Finally, Section VII concludes this paper.
Ii Study Design
Open-source environments like GitHub enable developers with different technical capabilities and social interactions to contribute actively and simultaneously on the same software project. Also, developers may perform a variety of activities, for instance: push commits, open/close pull requests and issues, as well as discuss about contributions. Although, developers can collaborate on different projects, their technical capabilities and social interactions may be determining factors to the quality of a software. For example, a developer that has never communicated with others involved in a project may not have enough knowledge about it and, therefore, he may inadvertently introduce bugs when performing a commit. In this context, our study wants to investigate the relation between technical, social factors and the introduction of bugs in open-source software projects. To do so, we define three research questions:
RQ1: Do bug-introducing commits differ from clean commits in terms of technical and social factors?
Understanding which factor is more related to buggy or clean commits may help code reviewers to avoid the introduction of bugs during the software development. For example, if the number of modified files in buggy commits is greater than clean commits, code reviewers may want to double check commits with a high number of modified files. Hence, the RQ1 aims at investigating if there is a statistically significant difference between buggy and clean commits by considering technical and social factors.
RQ2: How strong is the difference between buggy and clean commits?
In the previous research question, we aim at understanding whether buggy commits differ from clean commits in terms of technical and social factors. Although these factors may indicate a difference between buggy and clean commits, the strength of this difference may be negligible . As a consequence, these factors may not be useful to characterize buggy commits. Hence, we define the RQ2 to evaluate how strong is the difference between buggy and clean commits. Our understanding is that the higher this difference, the more technical and social factors may characterize buggy commits.
RQ3: When we consider multiple factors, what is the effect of each one on commit bugginess?
In the previous research questions, we analyze which factors are able to discriminate between buggy and clean commits as well as how strong is this difference. However, during the development of a software, the influence of different factors may lead to commit bugginess [13, 12, 7, 16, 5, 9, 6]. For example, the commit size and developer’s experience may simultaneously contribute to commit bugginess. Hence, this research question aims at investigating the effect of each factor on commit bugginess by considering the presence of other factors.
Ii-a Technical and Social Factors
To answer our research questions, we analyze five technical and two social factors. The former factors are related to technical contributions: developer’s experience in a project (F1), developer’s habit to follow well-known technical contribution norms (F2), ownership of a developer’s code (F3), nature of a developer’s commits (F4), and bugginess of a developer’s commits (F5). The latter factors are related to the interactions among developers in open-source projects. Particularly, we focus on comments performed by developers (F6) and the project establishment (F7). We selected these factors since they were previously analyzed in studies involving investigations on open-source environments [1, 4, 17, 6, 3, 16]. Each studied factor and their motivations are detailed below.
Ii-A1 Developers’ Experience (F1)
Although previous studies have assessed the influence of developer’s experience on the likelihood of their commits being buggy [6, 5, 11, 12], they presented contradictory conclusions about such relation. While Eyolfson et al.  and Rahman & Devanbu  show that experienced developers are less likely to introduce bugs, Mockus  and Tufano et al.  indicate that more experienced developers are more likely to introduce bugs. The contradictory findings indicated by these studies lead us to investigate again some of their analysis. However, these studies have assessed the influence of developers’ experience by considering only their number of previous commits or the number of days in which they have been associated with a project. Only those metrics may be not sufficient to characterize the experience factor. For example, the developers’ experience may increase as they participate in the code review process of a project [18, 13]. Therefore, we also use data about the code review process to characterize the developers’ experience. Hence, we analyze the F1 factor to provide a deeper understanding regarding the relationship between developer’s experience and commit bugginess.
Ii-A2 Developer’s Habit to Follow Technical Contribution Norms (F2)
Studies [17, 1] indicate that project managers and code reviewers prefer to receive contributions (pull requests) that follow certain norms, such as, the inclusion of tests, commits with a lower number of files changed and a higher legibility, to improve the software quality. Moreover, other studies [3, 5, 16] assessed quality measures associated with technical contribution norms (e.g., commits size and complexity) and their influence on commit bugginess. However, none of these studies assess the relation between the habit of following technical contribution norms and commit bugginess. For example, if the developer’s commits include tests or are performed with small pieces of code; do these commits tend to insert fewer bugs? Hence, we analyze the F2 factor, which wants to investigate the relation between following technical contribution norms and the commit bugginess.
Ii-A3 Ownership Level of Developer’s Commits (F3)
Prior work [11, 8, 13] have studied the relation between code ownership, i.e., how much a developer is responsible for a source code, and commit bugginess. Such studies focus only on the ownership of a particular source code entity, e.g., files, at the instant that a commit is pushed. There is no work analyzing the relationship between developer’s ownership and commit bugginess. For instance, is a developer that works mostly on his own code less (or more) likely to introduce bugs? Therefore, in our study, we want to analyze the relation between the developer’s ownership and the commit bugginess. Hence, we define the F3 factor to evaluate this relationship.
Ii-A4 Nature of Developer’s Commits (F4)
Previous studies on commit bugginess [16, 10, 19] state that bug-fix commits are more likely to introduce a new bug in the software. This finding indicates that the nature of a commit may be a relevant indicator of commit bugginess. In addition, studies [14, 20, 21] provide commit classification strategies able to recognize the nature of a code change. Based on these strategies, we define the F4 factor to evaluate the relation between the nature of developer’s commits and their bugginess.
Ii-A5 Developer’s Commits Bugginess (F5)
The bugginess (i.e., the likelihood of a commit to introduce a bug) of developer’s commits has been commonly used as the outcome measure in previous studies [6, 5, 11, 16]. Eyolfson et al.  used the percentage of developer’s commits that are buggy to determine how buggy a developer’s commits are. However, these studies did not evaluate if a developer’s commit bugginess (i.e., the buggy percentage) may influence the introduction of new bugs. For instance, is a developer whose commits are mostly buggy more likely to introduce bugs in the future? Thus, we study the F5 factor to investigate the relation between developers’ commits bugginess and the introduction of new bugs.
Ii-A6 Communication with the Community of a Project (F6)
A previous study  analyzed the relation between the communication among developers and the introduction of bugs. The results suggest that bug-introducing committers discuss significantly less than other committers. In this study, the authors considered only the interaction among developers in the bug-tracking system (Bugzilla 222https://www.bugzilla.org) of two projects. However, open-source environments [1, 17] support the communication among developers in different ways, for instance, GitHub supports discussions about: (i) feature implementation and bug fixes through pull requests; (ii) report bugs or feature requests through issues; or (iii) changes made in a specific commit. Our intuition is that those different ways of communicating in a project community should also be considered when we investigate the effects on commit bugginess. Hence, we define the F6 factor. This way, we analyze the relation between commit bugginess and the amount of developer’s communication with the community of a project hosted on GitHub.
Ii-A7 Project Establishment (F7)
Open-source projects are constantly evolving, attracting new developers and followers, who, eventually, will demand an increase in the stability of the project . As a project becomes more stable, other important projects may become dependent on that project . Thus, in light of these concerns, code reviewers of more established projects may be much more conservative and careful when accepting new contributions . Hence, we define the F7 factor, which aims at investigating whether the establishment of a project affects commit bugginess.
In the next sections, we provide a detailed description of the methodology used to answer our research questions in terms of the seven factors studied.
To characterize the factors discussed in the previous section, for all commits of a project, we compute different metrics related to each factor, as follow:
Developer’s Experience (F1): We use six metrics to characterize the developer’s experience in a project. Our intuition is that the higher the value of these metrics, the more a developer understands the project and its source code, and, therefore, the more experienced he is. To compute the values of each metric, we consider the interval between the first developer’s commit to a project and the instants in which the commits authored by him were pushed. These metrics are detailed below:
Experience (EXP): this metric measures experience as the number of commits authored by a developer ;
Recent Experience (REXP): to measure the recent experience, we consider the developer’s experience (EXP) weighted by the age of his previous changes, as defined by . By using the REXP, we give a higher weight to more recent changes. As a consequence, we can attribute more experience to developers who have contributed recently;
File Experience (FEXP): it measures the developer’s experience in the files modified in a commit authored by him. Particularly, for each file modified in a commit, we define the developer’s experience as the number of previous commits authored by him on this file. Then, we measure this metric as the sum of the experience in each file;
Code Review Experience (EXPRev): this metric indicates the developers’ experience regarding their activities involving code review in GitHub projects. Particularly, we measure these activities in terms of the number of: (i) open and closed issues; (ii) open and closed pull requests; and (iii) comments on issues and pull requests;
Recent Code Review Experience (REXPRev): this metric attributes more experience to developers that performed code review activities recently. To do that, we analyze the code review experience (EXPrev) weighted by the age of previous review activities performed by the developer.
Even though such metrics may not be sufficient to fully characterize experience, they represent different aspects of the activities performed by developers when they contribute to GitHub projects.
Technical Contribution Norms (F2): We use five metrics to characterize the factor related to the technical contribution norms. Such metrics are based on the norms described in the work of Tsay at al. , which analyzed the influence of the presence of tests (and other technical metrics) in commits on the acceptance of pull requests. In our study, we can analyze, for example, if the higher the presence of tests in the developer’s commits, the more compliant with technical contribution norms the developer is. We describe these five metrics below:
Tests (%) (TP): this metric measures the percentage of developer’s commits that contain tests. Studies [1, 17] indicate that reviewers prefer contributions containing tests because they are more reliable. Hence, we define the TP metric to measure how reliable developers’ contributions are. To extract this metric, we adopt the procedure described in  since the authors report a high accuracy. First, we retrieve the files modified in a commit authored by a developer. Then, we check if at least one of these files contains the “test” word in its pathname.333The complete name indicating the location of a file in a file system. In the affirmative case, we consider that the commit contains tests;
Median of Modified Files (MMF): it measures the median of modified files among all the commits authored by a developer. This metric was defined to characterize the usual behavior (habit) of developers in terms of the number of files modified in their commits. By calculating the median of modified files, we can investigate, for instance, if developers that constantly modify many files are more (or less) likely to introduce bugs;
Changed Lines (CL): it represents the number of changed lines in a commit. A changed line can be an addition or a deletion in a commit.
Median of Changed Lines (MCL): it represents the median of changed lines among all the commits authored by a developer.
We define the CL and MCL metrics as complementary metrics of MF and MMF, respectively, aiming at characterizing the legibility of developers’ contributions. We computed the MF and CL metrics by considering the instant that the commits were pushed in a project. To extract the TP, MMF, and MCL metrics, we adopted the same interval used to compute the experience metrics.
Ownership Level (F3): We use two metrics to characterize the ownership level factor. Our intuition is that the higher the value of these metrics, the higher the ownership level of developer’s commits. We describe these metrics below:
Commit Ownership (CO): this metric measures how much a developer “owns” the files modified in a commit that he authored. For each modified file, we measure the ownership of a developer in a file as , where is the number of developers that previously authored a commit involving the file. Then, we define the ownership level of a developer in a commit as the median of the files ownerships. To compute this metric, we consider the instant that a commit was pushed to a project;
Median Ownership (MO): it represents the median of the commits’ ownerships among all the commits authored by a developer. The process of computing this metric is the same adopted for the developer’s experience metrics;
Nature of Commits (F4): We use four metrics to characterize the nature of developers’ commits. These metrics are based on the commit nature classification described in Hattori & Lanza . The authors define four categories of commits based on a keyword analysis of the textual content of their messages: (Forward Engineering) this category is related to development activities, for example, the implementation of new features; (Reengineering) this category is related to refactorings, redesigns, and other actions to enhance code quality without properly adding new features; (Corrective Engineering) this category handles defects, errors, and bugs in the software; and (Management) it handles activities that are unrelated to codification, such as, documentation and cosmetic changes. We employed such strategy because of its simplicity and good performance . Our intuition is that the higher the value of these metrics, the more a developer is focused on a specific commit nature. The process of computing such metrics was the same adopted for the experience metrics. We describe these metrics below:
Forward Engineering (%) (FEP)
: this metric measures the percentage of commits classified asForward Engineering;
Reengineering (%) (RP): it measures the percentage of commits classified as Reengineering;
Corrective Engineering (%) (CEP): it measures the percentage of commits classified as Corrective Engineering;
Management (%) (MP): this metric measures the percentage of commits classified as Management.
Commits Bugginess (F5): Inevitably, during the development process of a software, developers make changes that introduce bugs . In this context, we analyze if the commits of developers who have previously introduced bugs are more likely to be buggy. To measure this factor, we evaluate the Percentage of Buggy Commits (PBC) previously authored by a developer. Our intuition is that the higher the value of this metric, the more “harmful” developer’s commits are. The process of computing this metric is the same adopted for the developer’s experience metrics.
Communication with the Community (F6): During the software development, developers can perform diverse activities on GitHub. For instance, they can communicate with the community by posting comments about different topics in issues and pull requests. Such interactions may represent the involvement of a developer on a project. In this context, we evaluate the Number of Comments (NC) performed by a developer in a GitHub repository. Our intuition is that the higher the value of this metric, the more a developer is involved in a project. The process of computing this metric is the same adopted when we compute the F1 metrics.
Project Establishment (F7): We measure the establishment of a project as the Age of a project since its first commit, i.e., how long (in days) a project has existed on GitHub. Such metric was defined by Tsay et al.  and our intuition is that the higher the value of this metric, the more mature, and therefore, established, a project is. We compute this metric at the instant that a commit was pushed.
Ii-C Project Selection
To perform our study, we manually selected eight GitHub Java projects according to the following criteria: (i) the projects must be open-source and their changes history must be hosted on GitHub. This way, we ensure the full access to the software history; (ii) the projects must use the GitHub issues as the default bug-tracking tool. This way, we standardize our bug report analysis; (iii) the projects must be currently active and have been maintained or evolved for a long period of time. The main motivation to this criteria is to ensure that the projects are active and relevant to the GitHub community; and (iv) the projects must have a relevant number of reported bugs and involved developers. This way, we ensure that the projects have enough bug-related data to be investigated.
Table I summarizes the characteristics of the selected projects. Notice that each project has a high number of developers involved, varying from 144 (OkHttp) up to 902 (Elasticsearch). Moreover, the projects have a large number of commits and bugs associated with them. Notice also that all projects have thousands of GitHub stars444https://github.com/trending, which is a measure of community interest in a project . All this data enable us to perform a deep analysis regarding the relation between commit bugginess and the factors analyzed in our study.
Ii-D Collecting Bug Reports
The GitHub issues are used to keep track of tasks, enhancements, and bugs related to a project. Furthermore, developers can associate labels with each issue to characterize it. For example, an issue can be opened to fix a bug and a “bug” label can be associated with this issue. After fixing the bug, the issue is closed.
To collect the reports of fixed bugs in the selected projects, we mined the closed issues related to bugs (or defects) existing in each project. In order to identify these issues, we verified the ones containing the “bug” or “defect” labels. As a result of this process, we collected bug reports from the eight projects analyzed.
Furthermore, we conducted a careful manual validation of the collected bug reports to guarantee that they are related to the report of bugs. After the manual validation process, we considered GitHub issues ( of the total collected) that were classified as actual bug reports and investigated in our study.
Ii-E Locating Bug-introducing Changes
During the development of a software, developers make changes in the source code, either to add new functionality, repair an existing bug or restructure the code. Inevitably, some of these changes may introduce bugs. We will further refer to these changes as bug-introducing changes .
To locate bug-introducing changes in the selected projects, we implemented the SZZ algorithm  to identify the commits that introduced a bug in the projects analyzed. To locate the commits that introduced bugs, the SZZ algorithm requires the commits that fixed these bugs.
GitHub provides a functionality to close an issue or pull request using commit messages. For example, prefacing a commit message with the keywords “Fixes”, “Fixed”, “Fix”, “Closes”, “Closed”, or “Close”, followed by an issue number, such as, “Fixes #12345”, will automatically close the issue when the commit is merged into the master branch. This way, when this strategy is used to close a bug issue, we assume the commit that closed the issue as being the bug fix commit.
We employed the SZZ algorithm for each validated bug report from the eight selected projects. As a result, we obtained a total of unique candidate bug-introducing changes. In addition, similarly to the collection of bug reports (see Section II-D), we conducted a careful manual validation on a sample of 250 bug-introducing changes reported by SZZ. The validation was conducted due to the high numbers of false-positives (changes reported as bug-introducing when they are actually not) reported in previous studies [24, 25, 26].
Ii-F Data Collection
To collect the data used to compute the metrics related to technical and social factors (Section II-B), we use the GitHub API as follows. First, we collect the public identifier (username) on GitHub of the developers that authored at least one commit in the studied projects. Then, we mine the commits, issues, pull requests, and comments performed by the developers involved in the studied projects.
As a result of this process, we obtained data about developers, which authored at least one commit in the repository. Moreover, we collect commits, pull requests, issues, and comments related to the eight projects analyzed. We use the collected commits to compute the metrics of experience (F1), technical contribution norms (F2), ownership (F3), nature (F4), bugginess (F5), and establishment of a project (F7). Moreover, we use the collected issues and pull requests to compute the experience (F1) and communication (F6) metrics.
Ii-G Data Analysis
To answer RQ1, we use the Wilcoxon Rank Sum Test  to verify which metrics are able to discriminate between buggy and clean commits. This test allows us to decide whether two populations (in our study, metrics related to the buggy and clear commits) are identical
or not without assuming that the populations follow a normal distribution. To ensure statistical significance, we adopted the customarysignificance level () for this test.
Furthermore, to answer RQ2, we used the Cliff’s Delta (d) measure  to evaluate how strong is the difference between buggy and clean commits in terms of the metrics analyzed. Similarly to the Wilcoxon Rank Sum test, the Cliff’s Delta (d) is a non-parametric effect size measure. In order to interpret the Cliff’s Delta (d) effect size, we employed a well-known classification . It defines four categories of magnitude: negligible if , small if , medium if , and large if . To compute the Cliff’s Deltas, we used the cliff.delta functionality from the effsize  package, based on the R statistical language .
To answer the RQ3, we evaluate the effect of each metric in the presence of other metrics. To perform this evaluation, we created a multiple logistic regression
multiple logistic regressionmodel for each studied project, where each metric is a predictor and the outcome variable is whether a commit introduces or not a bug in the project. In other words, we create a regression model that predicts the likelihood of commit bugginess.
We report the effect of the metrics in the likelihood of a commit being buggy in terms of odds ratios. Odds ratios
are the increase or decrease in the odds of a commit being buggy occurring per “unit” value of a predictor (metric). An odds ratioindicates a decrease in the odds, while an odds ratio
indicates an increase. Since our metrics are heavily skewed, we apply atransformation on the right-skewed predictors and a transformation on the left skewed ones.
To ensure that multicollinearity would not affect our model, we remove the metrics which have a pair-wise correlation coefficient above  using the findCorrelation functionality from the caret  package, also based on the R statistical language 
. Moreover, to ensure normality, we normalize the continuous predictors in the model. As a result, the mean of each predictor is equaled to zero and the standard deviation to one. Finally, to ensure statistical significance of the predictors, we employ the customarysignificance level for each predictor in the model.
Ii-H Replication Package
To make our study as much reproducible as possible, we provide a replication package,555https://github.com/filipefalcaos/ISSRE-18 that contains the source code used to: collect all the data used in this study (see Section II-F); run the SZZ algorithm (see Section II-E); compute the metrics (see Section II-B); and perform the statistical analyses (see Section II-G). The replication package also includes all the data we collected and used in the analyses.
In this section, we present and discuss our main results in terms of three research questions (Section II).
RQ1: Do bug-introducing commits differ from clean commits in terms of technical and social factors?
Table II presents the results that support the RQ1. We use the Wilcoxon Rank Sum test to verify if there is a statistically significant difference between buggy and clean commits in terms of the metrics related to the technical and social factors analyzed in our study. The first column represents these factors. The second column describes the metrics associated with each factor and the remaining columns describe the p-values of the Wilcoxon Rank Sum test for each metric in the projects analyzed in our study. The cells in gray present the p-values of the metrics that obtained statistical significance, i.e., p-value .
We observe that the developers’ experience metrics (F1) obtained a statistically significant difference in 33 out of 40 cases analyzed. Both the EXP and EXPRev metrics did not obtain statistical significance only in two projects. The remaining F1 metrics presented results even better; each one did not obtain a statistical significance only in one project. Regarding the technical contribution norms metrics (F2), they obtained statistical significance in a number of cases slightly greater than the F1 metrics. The F2 metrics obtained a statistically significant difference in of the cases analyzed. The CL and MF metrics presented statistical significance in all projects analyzed.
Both metrics characterizing the F3 factor presented a statistical significance in the vast majority of the projects analyzed. While the CO metric obtained a statistically significant difference in seven out of eight projects, the MO metric obtained in six projects. Regarding the F4 factor, its metrics presented statistical significance in 26 out of 32 cases analyzed. Particularly, the FEP metric obtained statistical significance in all projects analyzed. Similarly to the FEP metric, CL (F1), MF (F1), and the PBC (F5) also obtained statistical significance in all projects analyzed. Such results indicate that technical factors are able to discriminate between buggy and clean commits. Regarding the social factors, we observe that the NC (F6) metrics presented a statistical significance in six projects. The Age (F7) metric reached results even better by obtaining statistical significance in all projects analyzed.
Summary for RQ1. The analyzed metrics obtained a statistically significant difference in 84% of the cases investigated. In particular, the CL, MF, FEP, PBC, and Age metrics obtained statistical significance in all projects analyzed. Hence, these metrics may be a promising way distinguish between clear and buggy commits, suggesting that clean and buggy commits differ in terms of both technical and social factors.
RQ2: How strong is the difference between buggy and clean commits?
In the previous analysis, we investigated if there is a statistically significant difference between buggy and clean commits in terms of technical and social factors. Now, we use the Cliff’s Delta technique to analyze the magnitude of such difference. Table III supports the analysis related to the RQ2. Similarly to Table II, the first column represents the technical and social factors analyzed. The second column represents the metrics associated with each factor. The remaining columns describe the magnitudes  of the Cliff’s Delta (d) related to each metric in the projects used. We use the symbol to indicate if d was positive, and otherwise. In addition, we use four levels to measure the strength of a magnitude: (small) *; (medium) **; (large) ***; and (negligible) we do not use a symbol to represent this level.
Positive Magnitude. Notice that the MCL (F2), MF (F2), CL (F2), and PBC (F5) metrics presented a positive magnitude in all projects analyzed. Indeed, we observe that the CL metric obtained a large magnitude in all projects analyzed. The MF and PBC metrics obtained a strength slightly lower than CL by reaching magnitudes between small and large. Both the MCL and MMF reached magnitudes that varied from negligible up to medium. While MCL presented a positive magnitude in all analyzed projects, the MMF presented in seven ones. Finally, the majority of the F1 metrics presented a small or medium positive magnitude in Netty and RxJava.
Negative Magnitude. Regarding the negative magnitudes, the Age metric presented negative magnitudes in six projects, reaching a large one in Spring-boot. Similarly to the Age metric, the REXP and REXPRev metrics also presented negative magnitudes in six projects. However, they obtained a strength equal or slightly lower than Age in such projects. We also observe that the majority of the F1 metrics presented a medium and small negative magnitude in the Spring-boot and Signal-Android, respectively. Such result shows an opposite tendency of the magnitudes obtained by these metrics in the Netty and RxJava, which indicates that the F1 metrics do not have a consistent tendency.
Summary for RQ2. Results show that buggy commits had a significant higher CL, MF, and PBC than clean commits. On the other hand, in the majority of the projects, buggy commits had a significant lower Age, REXP, and REXPRev than clean commits. Therefore, there are strong differences between clean and buggy commits in terms of both technical and social factors.
RQ3: When we consider multiple factors, what is the effect of each one on commit bugginess?
In the RQ1 and RQ2, we analyzed the metrics individually. Now, we use the odds ratios technique to investigate the effect of each metric on commit bugginess in the presence of other metrics analyzed in our study. Table IV summarizes the effects of the metrics on commit bugginess in each project. The first column represents the technical and social factors. The remaining columns describe the metrics and their respective odds ratios related to the projects. For each project, we consider only the metrics that do not have collinearity among them. Similarly to the RQ1, the cells in gray present the odds ratios of the metrics that obtained statistical significance. In addition, we use the symbol to indicate a risk-increasing effect, and the symbol otherwise.
|F1||FEXP (0.961)||FEXP (0.831)||FEXP (0.95)||FEXP (1.04)||FEXP (1.042)||FEXP (1.131)||FEXP (1.102)||FEXP (1.025)|
|EXPRev (1.804)||REXPRev (0.751)||EXP (1.084)||REXPRev (0.677)|
|F2||TP (1.252)||TP (1.139)||TP (1.475)||TP (1.08)||TP (0.848)||TP (0.865)||TP (0.899)||TP (1.361)|
|MCL (1.111)||MCL (1.22)||CL (3.262)||MCL (1.071)||MCL (1.1)||MCL (1.099)||MCL (1.117)||MCL (1.01)|
|CL (2.939)||CL (6.674)||MMF (0.801)||CL (2.459)||CL (2.942)||CL (1.726)||CL (3.073)||CL (2.472)|
|MMF (1.014)||MMF (0.842)||MF (1.031)||MMF (0.8)||MF (1.161)||MMF (1.079)||MMF (0.996)||MF (1.234)|
|MF (1.01)||MF (1.5)||MF (1.334)||MF (1.828)||MF (1.164)|
|F3||CO (0.885)||CO (0.949)||CO (0.886)||CO (0.869)||CO (0.844)||CO (2.11)||CO (0.677)||CO (1.106)|
|MO (1.019)||MO (1.064)||MO (0.642)||MO (0.403)|
|F4||RP (0.935)||RP (1.037)||RP (0.846)||RP (1.138)||RP (1.017)||RP (1.076)||RP (0.894)||RP (1.094)|
|CEP (1.056)||CEP (0.908)||CEP (0.831)||CEP (0.794)||CEP (1.035)||CEP (0.949)||CEP (0.945)||CEP (1.109)|
|FEP (1.058)||FEP (1.129)||FEP (1.287)||FEP (1.07)||FEP (0.873)||FEP (0.991)||FEP (1.017)||FEP (1.051)|
|MP (0.873)||MP (0.925)||MP (0.779)||MP (1.016)||MP (0.733)||MP (1)||MP (0.657)|
|F5||PBC (1.174)||PBC (0.999)||PBC (1.206)||PBC (1.49)||PBC (1.59)||PBC (2.74)||PBC (1.109)||PBC (0.8)|
|F6||NC (1.204)||NC (1.063)||NC (1.493)|
|F7||Age (0.417)||Age (0.241)||Age (0.986)||Age (0.173)||Age (0.413)||Age (0.296)||Age (1.125)|
Risk-increasing Effect. When we analyze the effect of each metric in the presence of the other ones, we observe that only the F2 factor could obtain at least one metric having a statistically significant effect in all projects analyzed. In particular, the CL (F2) metric obtained a risk-increasing effect in all cases. Moreover, this metric reached the highest effect in seven of the projects, reaching an odds ratio up to in the Spring-boot project. Such fact shows that each unit of the CL metric increases the odds of a commit of these seven projects being buggy by a factor of (RxJava) up to (Spring-boot). This metric could not reach the highest risk-increasing effect only in RxJava, where the PBC metric obtained the highest effect by reaching an odds ratio of . This means that each unit of the PBC metric increases the odds of a commit of the RxJava project being buggy by a factor of . Indeed, this metric has a statistically significant risk-increasing effect in five of the projects analyzed, varying its odds ratio from (Elasticsearch) up to (RxJava).
The MF (F2) metric also presented a statistically significant risk-increasing effect, but only in four projects, with an odds ratio of up to in the RxJava project. Similarly to the MF, the TP metric also presented a statistically significant risk-increasing effect in four projects. However, it obtained a significant risk-decreasing effect in the Presto project. This high number of risk-increasing effects obtained by the TP introduces some questions concerning code reviewers that prefer contributions containing tests . In the next section, we discuss some issues that may lead the TP to obtain this tendency.
Risk-decreasing Effect. We observe that the Age (F7) metric reached the highest statistically significant risk-decreasing effect in five out of the eight projects analyzed. It was able to obtain an odds ratio of , which means that each unit of the Age decreases the odds of a commit being buggy by a factor of . Similarly to the Age metric, the CO metric also obtained significant risk-decreasing effects in five projects. However, it was not able to obtain the highest effect in none of the projects. Moreover, while the MP (F4) metric obtained the highest risk-decreasing effect in the Netty and OkHttp projects, the MO (F3) reached the highest one in the OkHttp. Indeed, these metrics obtained statistical significance only in the cases in which they have a risk-decreasing effect. Still on the CO metric, we observe an exceptional case in RxJava. This metric reached a statistically significant high risk-increasing effect. We discuss in more details such case in the next section.
Summary for RQ3. The CL, PBC, and MF metrics present a higher tendency to risk-increasing effect. On the other hand, the Age, CO, MP, and MO present a higher tendency to risk-decreasing effect. Therefore, the results show that some metrics of both technical and social factors have an opposite tendency when we consider multiple metrics.
In this section, we discuss our main findings and exceptional cases.
The contradictory effects of experience metrics. Regarding the developers’ experience, we observe that four metrics reached a medium and small negative magnitude in the Spring-boot and Signal-Android projects, respectively. On the other hand, the same metrics reached a medium or small positive magnitude in the Netty and RxJava. Such contradiction reinforces the results discussed by previous studies [6, 5, 12, 11]. Eyolfson et al.  provide evidence from two open-source projects that more experienced developers are less likely to introduce bugs. On the other hand, prior work [5, 12] explains that more experienced developers are more likely to introduce bugs due to the complexity of their tasks. Thereby, to understand why the effects of the experience factor are contradictory, we further investigate the relation of this factor with the complexity of the changes performed by the developers. In particular, we investigate whether more experienced developers perform more complex changes. To do so, we use the Spearman rank  technique to evaluate if there is a correlation between the developers’ experience metrics and the complexity of their commits (i.e., median of changed lines in the previous commits).
We perform our investigation in the Spring-boot and RxJava projects since five experience metrics (EXP, FEXP, REXP, EXPRev, and REXPRev) presented opposite effects in these projects. We use the classification defined in  to determine the strength of the correlations between these five metrics and the commits complexity. In the Spring-boot, the five experience metrics presented only negative correlations ranging from (small) up to (large). Such result suggests that more experienced developers usually perform less complex changes in the Spring-boot. Such finding may be an indication why more experienced developers introduce fewer bugs in this project. On the other hand, the same five experience metrics presented positive correlations ranging from (small) up to (medium). Such result indicates that more experienced developers usually perform more complex changes in the RxJava. Such finding may be an indication why more experienced developers introduce more bugs in this project. Even though these results shed light in the understanding about why experience metrics presented contradictory effects on commit bugginess, it is still necessary more analysis in order to obtain relevant conclusions.
The risk-increasing effect of the TP metric. Previous work [1, 17] show that code reviewers prefer contributions that contain tests aiming at improving the software quality. However, results from RQ3 (see Table IV) show that the percentage of developer’s commits that contain tests (TP) is a risk-increasing effect on commit bugginess. This finding sounds contradictory to the assumption that testing practices would improve the software quality. Thus, to better understand the reason behind this tendency, we further investigate the relation of the TP metric with the complexity of the changes performed by developers. Our intuition is that this metric may have a risk-increasing effect if the developers who constantly write tests in their commits also perform more complex changes. Similarly to the previous discussion, we use the Spearman rank correlation to evaluate the relation between the TP metric and the complexity of their commits (i.e., median of changed lines in the previous commits).
We perform our investigation by using the Netty, Spring-boot, OkHttp, and Elasticsearch projects, since the TP metric has a risk-increasing effect in these projects (see Table IV). Spring-boot, OkHttp, and Elasticsearch presented positive correlations ranging from (small) up to (medium). However, the Netty project presented a medium negative correlation of . Hence, despite the different behaviour in one project, we find evidence (that should be further explored in future work) indicating that the TP metric is a risk-increasing factor due to the complexity of developer’s commits.
Commit Ownership and RxJava. The results of the RQ3 show that the F3 metrics have a tendency to risk-decreasing effects on commit bugginess, as previously discussed in [11, 13, 8]. However, the CO metric presents a risk-increasing effect only in the RxJava project, reaching an increase in the odds of a commit being buggy by a factor of . This effect may be explained by a singularity in the RxJava project. In this project, only two developers were responsible for of the bug-introducing changes reported by SZZ. Indeed, we observe that these developers are the most active ones in the project. Moreover, while the median of the CO metric for the remaining developers involved in the project is , these two most active developers presented values equals to and . Such fact shows that the developers responsible for the vast majority of bug-introducing changes in RxJava work mostly on their own code. Such behavior may explain why the CO metric is a risk-increasing factor in the RxJava project. This result suggests that the tendency of risk-decreasing effects obtained by the CO metric, in the vast majority of the projects, may not hold to projects where a few developers are responsible for most of the bug-introducing commits and also work mostly on their own code.
Reengineering and Bugs. A previous work of Bavota et al.  states that while some kinds of refactorings are unlikely to introduce bugs, other refactoring operations (e.g., Pull-up Method or Inline Temp) tend to introduce bugs very often. In our study, we also analyze the impact of refactoring operations, i.e., Reengineering commits (see Section II-B), on commit bugginess. Results show that the percentage of developer’s reengineering commits present a risk-decreasing effect in Elasticsearch and Netty. Such result suggests that the more focused a developer is on refactoring operations, the less likely that his commits introduce bugs in these two projects. Although our investigation does not deal with the specific kinds of refactorings studied in Bavota et al. , our finding suggests that general refactoring operations decrease the likelihood of a commit being buggy.
The risk-decreasing effect of project establishment. Results of the RQ1 indicated that the Age (F7) metric can be used to discriminate between clean and buggy commits in the vast majority of the analyzed projects. Moreover, the RQ2 results showed that this metric is negatively associated with commit bugginess. These results suggest that the more established a project is, the less likely that the commits from this project are buggy. In a previous work on pull request acceptance , authors showed that the code reviewers involved in established projects are more careful when evaluating new contributions. Our findings support this conclusion, since a more careful evaluation of contributions may be directly related to a lower commit bugginess.
V Threats to Validity
This section presents the threats to validity by following the criteria defined in Wohlin et al. .
Construct Validity. The set of technical and social factors analyzed in our study may not fully represent the reasons that may lead developers to introduce bugs in open-source projects. To mitigate this threat, we selected factors that were analyzed by previous studies involving investigations on open-source environments [1, 4]. We considered the perceptions of code reviewers [17, 23] to define the metrics related to technical contribution norms. Nonetheless, we cannot guarantee that the community of the analyzed projects agrees with such norms.
Prior work  found that of the bug reports from five open-source projects analyzed in our study were misclassified, i.e., a feature is requested instead of a bug reported. We mitigated this threat by performing a manual validation in all the bug reports collected (see Section II-D). Another threat is related to correctly identify the commits that fixed bugs. To mitigate this threat, we used a GitHub functionality to identify bug-fix commits, as described in Section II-E.
Internal Validity. We rely on the SZZ approach to locate the introduction points of the analyzed bugs. Although the SZZ has been widely used to locate bug-introducing changes , it presents high false positive and false negative rates. To mitigate this threat, we also performed a manual validation on a sample of 250 bug-introducing changes reported by SZZ (see Section II-E). However, the false negatives, i.e., bug-introducing changes not detected, were not included in the manual validation because of the high effort needed to validate such cases.
Conclusion Validity. Regarding the validity of our findings, the metrics used in this study did not follow a normal distribution due to high skewness. To mitigate this, we used non-parametric methods, such as the Wilcoxon Rank Sum Test  and the Cliff’s Delta . Moreover, since multicollinearity of predictors may heavily affect the results of a multiple regression model , we removed from our models the predictors with pair-wise correlations above (see Section II-G). In addition, these statistical procedures have been widely used in software engineering researches [12, 16, 5].
External Validity. Regarding the generality of our findings, we selected only projects in which the primary language is Java. Although we have analyzed eight projects with different sizes, developers, and domains, our results might not hold to other projects, mainly the ones in which the primary language is not Java. This may be due to the fact that each project has specific characteristics and different communities.
Vi Related Work
Some previous studies focus on the relation between quality measures and commit bugginess. Śliwerski et al.  presents an approach to automatically locate fix-inducing commits (SZZ). They found that buggy commits are roughly three times larger than other commits. Correlations between developer characteristics (commit frequency and experience) and commit bugginess were previously investigated by Eyolfson et al. . The authors found that developers who commit to a repository on a daily basis write less buggy commits, while developers who commit as their day-job are more likely to produce bugs. Also, Eyolfson et al.  suggest the existence of a correlation between developer experience and commit bugginess.
Rahman & Devanbu analyzed four open-source projects and found that high levels of ownership are associated with a lower bug introduction rate. Moreover, the authors found that specialized experience is consistently associated with buggy code, while general experience is not. Similar findings on the ownership factor were presented in the work of Bird et al. . Thongtanunam et al.  show that there is a relationship between ownership and code review. In addition, the proportion of reviewers without expertise shares a strong relation with commit bugginess. Tufano et al.  presented an empirical study on developer-related factors. Their results show that commit coherence, developer experience, and past interfering changes are associated with commit bugginess.
Mockus  investigated the organizational factor (e.g, size of the organization, time between releases) relating to the presence of defects in the software. The author found that recent departures from an organization and distributed development are related with commit bugginess. Bernardi et al.  studied the influence of developer communication on commit bugginess, finding that developers who introduce bugs have a higher social importance and communicate less between themselves.
Those studies [5, 6, 7, 8, 13, 12, 16] evaluated the relation between the factors discussed above and commit bugginess in a very limited way by considering only proprietary projects [5, 8], projects that do not adopt modern code review practices [12, 16] or a reduced number of factors as well as characteristics to represent them [8, 13, 6]. Our study differs from prior work by providing a more extensive and complete study on the relation between technical, social factors and the introduction of bugs.
This paper investigated the relation between different technical, social factors and the likelihood of developers to introduce bugs. We analyzed a total of bug reports (manually validated) and bug-introducing changes from eight open-source Java projects hosted on GitHub. To understand which factors may be related to the introduction of bugs, we analyzed seven different technical and social factors. First, we investigated how buggy commits differ from clean commits in terms of these factors. Then, we evaluated how strong is the difference between buggy and clean commits. Finally, we evaluated the effect of each factor on commit bugginess when considering the presence of multiple factors.
Our findings show that: (i) both technical and social factors are able to distinguish between buggy and clean commits; (ii) there is a association between an increase on commit bugginess and the developer’s habits of not following technical contribution norms and a high number of previous bugs authored by them; (iii) commits from developers who work mostly on their own code or are focused in management activities are less likely to introduce bugs; and, finally, (iv) a well-established project is less likely to have new bugs. We believe that these findings benefit project managers and code reviewers, since, they may want to carefully verify contributions from developers that present factors related to commit bugginess.
As future work, we intend to expand this investigation for more projects of different programming languages and domains. We also intend to asses the importance of contributions outside an analyzed project, to better understand developers’ experience and interactions in such a complex social environment that GitHub is. Moreover, we intend to expand our work on commit bugginess and social factors to analyze a wider amount of social interactions.
-  J. Tsay, L. Dabbish, and J. Herbsleb, “Influence of social and technical factors for evaluating contribution in github,” 2014.
-  G. Gousios, M.-A. Storey, and A. Bacchelli, “Work practices and challenges in pull-based development: the contributor’s perspective,” in Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on. IEEE, 2016, pp. 285–296.
-  J. Śliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?” in ACM sigsoft software engineering notes, vol. 30, no. 4. ACM, 2005, pp. 1–5.
-  G. Gousios, M. Pinzger, and A. v. Deursen, “An exploratory study of the pull-based software development model,” in Proceedings of the 36th International Conference on Software Engineering. ACM, 2014, pp. 345–355.
-  A. Mockus, “Organizational volatility and its effects on software defects,” in Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 2010, pp. 117–126.
-  J. Eyolfson, L. Tan, and P. Lam, “Do time of day and developer experience affect commit bugginess?” in Proceedings of the 8th Working Conference on Mining Software Repositories. ACM, 2011, pp. 153–162.
-  M. L. Bernardi, G. Canfora, G. A. Di Lucca, M. Di Penta, and D. Distante, “Do developers introduce bugs when they do not communicate? the case of eclipse and mozilla,” in Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on. IEEE, 2012, pp. 139–148.
-  C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu, “Don’t touch my code!: examining the effects of ownership on software quality,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 2011, pp. 4–14.
-  D. Posnett, R. D’Souza, P. Devanbu, and V. Filkov, “Dual ecological measures of focus in software development,” in Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 2013, pp. 452–461.
-  P. J. Guo, T. Zimmermann, N. Nagappan, and B. Murphy, “Characterizing and predicting which bugs get fixed: an empirical study of microsoft windows,” in Software Engineering, 2010 ACM/IEEE 32nd International Conference on, vol. 1. IEEE, 2010, pp. 495–504.
-  F. Rahman and P. Devanbu, “Ownership, experience and defects: a fine-grained study of authorship,” in Proceedings of the 33rd International Conference on Software Engineering. ACM, 2011, pp. 491–500.
-  M. Tufano, G. Bavota, D. Poshyvanyk, M. Di Penta, R. Oliveto, and A. De Lucia, “An empirical study on developer-related factors characterizing fix-inducing commits,” Journal of Software: Evolution and Process, vol. 29, no. 1, 2017.
-  P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revisiting code ownership and its relationship with software quality in the scope of modern code review,” in Proceedings of the 38th international conference on software engineering. ACM, 2016, pp. 1039–1050.
-  L. P. Hattori and M. Lanza, “On the nature of commits,” in Proceedings of the 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 2008, pp. III–63.
-  G. M. Sullivan and R. Feinn, “Using effect size—or why the p value is not enough,” Journal of graduate medical education, vol. 4, no. 3, pp. 279–282, 2012.
-  Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi, “A large-scale empirical study of just-in-time quality assurance,” IEEE Transactions on Software Engineering, vol. 39, no. 6, pp. 757–773, 2013.
-  L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb, “Social coding in github: transparency and collaboration in an open software repository,” in Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, 2012, pp. 1277–1286.
-  A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in Proceedings of the 2013 international conference on software engineering. IEEE Press, 2013, pp. 712–721.
-  R. Purushothaman and D. E. Perry, “Toward understanding the rhetoric of small source code changes,” IEEE Transactions on Software Engineering, vol. 31, no. 6, pp. 511–526, 2005.
-  A. Mockus and L. G. Votta, “Identifying reasons for software changes using historic databases.” in icsm, 2000, pp. 120–130.
-  N. Dragan, M. L. Collard, M. Hammad, and J. I. Maletic, “Using stereotypes to help characterize commits,” in Software Maintenance (ICSM), 2011 27th IEEE International Conference on. IEEE, 2011, pp. 520–523.
-  K. Nakakoji, Y. Yamamoto, Y. Nishinaka, K. Kishida, and Y. Ye, “Evolution patterns of open-source software systems and communities,” in Proceedings of the international workshop on Principles of software evolution. ACM, 2002, pp. 76–85.
-  J. Marlow, L. Dabbish, and J. Herbsleb, “Impression formation in online peer production: activity traces and personal profiles in github,” in Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 2013, pp. 117–128.
-  S. Kim, T. Zimmermann, K. Pan, E. James Jr et al., “Automatic identification of bug-introducing changes,” in Automated Software Engineering, 2006. ASE’06. 21st IEEE/ACM International Conference on. IEEE, 2006, pp. 81–90.
-  C. Williams and J. Spacco, “Szz revisited: verifying when changes induce fixes,” in Proceedings of the 2008 workshop on Defects in large software systems. ACM, 2008, pp. 32–36.
-  D. A. da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho, and A. E. Hassan, “A framework for evaluating the results of the szz approach for identifying bug-introducing changes,” IEEE Transactions on Software Engineering, vol. 43, no. 7, pp. 641–657, 2017.
-  E. Whitley and J. Ball, “Statistics review 6: Nonparametric methods,” Critical care, vol. 6, no. 6, p. 509, 2002.
-  R. J. Grissom and J. J. Kim, Effect sizes for research: A broad practical approach. Lawrence Erlbaum Associates Publishers, 2005.
J. Romano, J. D. Kromrey, J. Coraggio, J. Skowronek, and L. Devine, “Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices.” Citeseer, 2006.
-  M. Torchiano, effsize: Efficient Effect Size Computation, 2017, r package version 0.7.1. [Online]. Available: https://CRAN.R-project.org/package=effsize
-  R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2018. [Online]. Available: https://www.R-project.org/
-  C. F. Dormann, J. Elith, S. Bacher, C. Buchmann, G. Carl, G. Carré, J. R. G. Marquéz, B. Gruber, B. Lafourcade, P. J. Leitão et al., “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance,” Ecography, vol. 36, no. 1, pp. 27–46, 2013.
-  M. Kuhn, “Building predictive models in r using the caret package,” Journal of Statistical Software, Articles, vol. 28, no. 5, pp. 1–26, 2008. [Online]. Available: https://www.jstatsoft.org/v028/i05
-  J. H. McDonald, Handbook of biological statistics, 2009, vol. 2.
-  J. Cohen, “Statistical power analysis for the behavioral sciences. 2nd,” 1988.
-  G. Bavota, B. De Carluccio, A. De Lucia, M. Di Penta, R. Oliveto, and O. Strollo, “When does a refactoring induce bugs? an empirical study,” in Source Code Analysis and Manipulation (SCAM), 2012 IEEE 12th International Working Conference on. IEEE, 2012, pp. 104–113.
-  C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in software engineering. Springer Science & Business Media, 2012.
-  K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: how misclassification impacts bug prediction,” in Proceedings of the 2013 international conference on software engineering. IEEE Press, 2013, pp. 392–401.