Code review is the activity performed by software teams to check code quality, with the purpose of identifying issues and shortcomings (Bacchelli and Bird, 2013). Nowadays, reviews are mostly performed in an iterative, informal, change- and tool-based fashion, also known as Modern Code Review (MCR) (Cohen, 2010). Both open-source and industry software teams employ MCR to check code changes before being integrated in their codebases (Rigby and Bird, 2013). Past research has provided evidence that MCR is associated with improved key software quality aspects, such as maintainability (Morales et al., 2015) and security (di Biase et al., 2016), as well as with less defects (McIntosh et al., 2016).
Reviewing a source code change is a cognitively demanding process and researchers provided evidence that understanding the code change under review is among the most challenging tasks for reviewers (Bacchelli and Bird, 2013). In this light, past studies have argued that code changes that—at the same time—address multiple, possibly unrelated concerns (also known as noisy (Murphy-Hill et al., 2012) or tangled changes (Herzig and Zeller, 2013)) can hinder the review process (Herzig and Zeller, 2013; Kirinuki et al., 2014), by increasing the cognitive load for reviewers. Indeed, it is reasonable to think that grasping the rationale behind a change that spans multiple concepts in a system requires more effort than the same patch committed separately. Moreover, the noise could put a reviewer on a wrong track, thus leading to missing defects (false negatives) or to raising unfounded issues in sound code (false positives in this paper).
Qualitative studies reported that professional developers perceive tangled code changes as problematic and asked for tools to automatically decompose them (Tao et al., 2012; Barnett et al., 2015). Accordingly, change untangling mechanisms have been proposed (e.g., (Tao and Kim, 2015; Dias et al., 2015; Barnett et al., 2015)).
Although such tools are expectedly useful, the effects of change-decomposition on review is an open research problem. Tao and Kim presented the earliest and most significant results in this area (Tao and Kim, 2015), showing that changedecomposition allows practitioners to achieve their tasks better in a similar amount of time.
In this paper, we continue on this research line and focus on evaluating the effects of change-decomposition on code review. We aim at answering questions, such as: Is change-decomposition beneficial for understanding the rationale of the change? Does it have an impact on the number/types of issues raised? Are there differences in time to review? Are there variations with respect to defect lifetime?
To this end, we designed a controlled experiment focusing on pull requests, a widespread approach to submit and review changes (Gousios et al., 2015). Our work investigates whether the results from Tao and Kim (Tao and Kim, 2015) can be replicated, as well as extends the knowledge on the topic. With a Java system as a subject, we asked 28 software developers among professionals and graduate students to review a refactoring and a new feature (according to professional developers (Tao et al., 2012), these are the most difficult to review when tangled). We measure how the partitioning vs. non-partitioning of the changes impacts defects found, false positive issues, suggested improvements, time to review, and understanding the change rationale. We also perform qualitative observations on how subjects conduct the review and address defects or raise false positives, in the two scenarios.
This paper makes the following contributions:
the design of an asynchronous controlled experiment to assess the benefits of change decomposition in code review using pull requests, available for replication (di Biase et al., 2018);
empirical evidence that change decomposition in the pull-based review environment leads to fewer false positives.
The paper proceeds as follows: Section 2 illustrates the related work; Section 3 describes our research objectives; the design of our experiment is described in Section 4; limits of the experiment are discussed in Section 5; results are presented in Section 6; Section 7 reports the discussion based on the results; finally, Section 8 summarizes our study.
2. Related Work
Several studies explore tangled changes and concern separation in code reviews. Tao et al. investigate the role of understanding code changes during the software development process, exploring practitioners’ needs (Tao et al., 2012). Their study outlined that grasping the rationale when dealing with the process of code review is indispensable. Moreover, to understand a composite change, it is useful to break it into smaller ones each concerning a single issue. Rigby et al. empirically studied the peer review process for six large, mature OSS projects, showing that small change size is essential to the more fine-grained style of peer review (Rigby et al., 2014). Kirinuki et al. provide evidence about problems with the presence of multiple concepts in a single code change (Kirinuki et al., 2014). They show that these are unsuitable for merging code from different branches, and that tangled changes are different to review because practitioners have to seek the changes for the specified task in the commit.
Regarding empirical controlled experiments on the topic of code reviews, the most relevant work is by Uwano et al. (Uwano et al., 2006). They use an eye-tracker to characterize the performance of subjects reviewing source code. Their experimentation environment enabled them to identify a pattern called scan, consisting of the reviewer reading the entire code before investigating the details of each line. In addition, their qualitative analysis found that participants who did not spend enough time during the scan took more time to find defects. Uwano’s experiment was replicated by Sharif et al. (Sharif et al., 2012). Their results indicate that the longer participants spent in the scan, the quicker they were able to find the defect. Conversely, review performance decreases when participants do not spend sufficient time on the scan, because they find irrelevant lines.
Even if MCR is now a mainstream process, adopted in both open source and industrial projects, we found only two studies on change partitioning and its benefits for code review. The work by Barnett et al. (Barnett et al., 2015) analyzed the usefulness of an automatic technique for decomposing changesets. They found a positive association between change decomposition and the level of understanding of the changesets. According to their results, this would help time to review as the different contexts are separated. Tao and Kim (Tao and Kim, 2015)
proposed a heuristic-based approach to decompose changeset with multiple concepts. They conducted a user study with students investigating whether their untangling approach affected the time and the correctness in performing review-related tasks. Results were promising: Participants completed the tasks better with untangled changes in a similar amount of time. In spite of the innovative techniques they proposed to untangle code changes and in these promising results, the evaluation of effects of change-decomposition was preliminary.
In contrast, our research focuses on setting up and running an experiment to empirically assess the benefits of changedecomposition for the process of code-review, rather than evaluating the performances of an approach.
3. Motivation and research objectives
3.1. Experiment definition and context
Our analysis of the literature showed that there is only preliminary empirical evidence on how code review decomposition affects its outcomes, relatively to change-understanding, time to completion, effectiveness (i.e., number of defects found), false positives (issues mistakenly identified as defect by the reviewer), and suggested improvements. This motivates us in setting-up a controlled experiment, exploiting the popular pull-based development model, to assess the conjecture that a proper separation of concerns in code review is beneficial to the efficiency and effectiveness of the review.
Pull requests features asynchronous, tool-based activities in the bigger scope of pull-based software development (Gousios et al., 2014). The pull-based software process features a distributed environment where changes to a system are proposed through patch submissions that are pulled and merged locally, rather than being directly pushed to a central repository.
Pull requests are the way contributors submit changes for review in GitHub. Change acceptance has to be granted by other team members called integrators (Gousios et al., 2015). They have the crucial role of managing and integrating contributions and are responsible for inspecting the changes for functional and non-functional requirements. 80% of integrators use pull requests as the means to review changes proposed to a system.
In the context of distributed software development and change integration, GitHub is one of the most popular code hosting sites with support for pull-based development. GitHub pull requests contain a branch from which changes are compared by an automatic discovery of commits to be merged. Changes are then reviewed online. If further changes are requested, the pull request can be updated with new commits to address the comments. The inspection can be repeated and, when the patchset fits the requirements, the pull request can be merged.
3.2. Research questions
The motivation behind modern code review is to find defects and improve code quality (Bacchelli and Bird, 2013). We are interested in checking if reviewers are able to address defects (referenced in this paper as effectiveness). Furthermore, we focus on comments pointing out wrongly reported defects and suggested improvements (which are non-critical non-functional issues). Suggested improvements highlight reviewer participation (McIntosh et al., 2014) and these comments are considered very useful (Bosu et al., 2015).
Our first research question is:
Based on the first research question, we formulate the following nullhypotheses for (statistical) testing:
|Tangled pull requests do not reduce:|
|H0e||the effectiveness of the reviewers during peer-review|
|H0f||the false positives detected by the reviewers during peer-review|
|H0c||the suggested improvements written by the reviewers during peer-review|
Given the structure and the settings of our experimentation, we can also measure the time spent on review activity and defect lifetime. Thus, our next research question is:
For the second research question, we formulate the following nullhypotheses:
|Tangled pull requests do not reduce:|
|H0t1||time to review|
Further details on how we measure time and define defect lifetime are described in Section 4.7.
In our study, we aim to measure whether changedecomposition has an effect on understanding the rationale of the change under review. Understanding the rationale is the most important information need when analyzing a change, according to professional software developers (Tao et al., 2012). As such, the question we set to answer is:
For our third research question, we test the following nullhypotheses:
|Tangled pull requests do not reduce:|
|H0u||change-understanding of reviewers during peer-review|
|when compared to untangled pull requests|
Finally, we qualitatively investigate how participants individually perform the review to understand how they address defects or potentially raise false positives. Our last research question is then:
4. Experimental Design and Method
In this section, we detail how we designed the experiment and the research method that we followed.
4.1. Object system chosen for the experiment
The system that will be considered by the reviews in the experiment is JPacman, an open-source Java system available on GitHub111https://github.com/SERG-Delft/jpacman-framework that emulates a popular arcade game used at the Delft University of Technology to teach software testing. The system has about 3,000 SLOC and was selected because a more complex and larger project would require participants to grasp the rationale of a more elaborate system. In addition, the training phase required for the experiment would imply hours of effort, increasing the consequent fatigue that participant might experience. In the end, the experiment targets assessing differences in review partitioning and is tailored for a process rather than a product.
4.2. Recruiting of the subject participants
The study was conducted with 28 participants recruited by means of convenience sampling (Wohlin et al., 2012) among experienced and professional software developers, PhD, and MSc students. We involved as many different roles as possible to have a larger sample for our study and increase its external validity. Using a questionnaire, we asked development experience, language-specific skills, and review experience as number of reviews per week. We also included a question that asks if a participant knows the source code of the game. Table 1 reports the results of the questionnaire, which are used to characterize our population and to identify key attributes of each subject participant.
4.3. Monitoring versus realism
Designing an experiment requires carefully managing the trade-off between realism and control. In controlled experiments, researchers strive to have the most realistic scenario possible, compromising as little as possible the control of the settings that are put in place.
Due to the nature of pull-based software development and its peer review with pull requests, we established the experimentation phase to be executed asynchronously. This implies that participants can run the experiment when and where they feel most comfortable, with no explicit constraints for place, time or equipment.
With this choice, we are purposefully giving up some degree of control to increase realism. Having a more strictly controlled experimental environment would not replicate the usual way of running such tasks (that is, asynchronous and informal). Besides, an experiment run synchronously in a laboratory would anyway raise some control challenges: It might be distracting for some participants, or even induce some follow the crowd behavior, thus leading to people rushing to finish their tasks.
To regain some degree of control, participants run all the tasks in a provided virtual machine available in our replication package (di Biase et al., 2018). In addition, the screencast of the experiment is recorded, therefore not leaving space to misaligned results and mitigating issues of incorrect interpretation. Subjects were provided with instructions on how to use the virtual machine, but no time window was set.
|Group||# of subjects||Role||FTE Experience||
|6||2 (33 %)||SW Developer||
|3||1 (33 %)||PhD Student|
|5||3 (60 %)||MSc Student|
|6||2 (33 %)||SW Developer||
|3||1 (33 %)||PhD Student|
|5||3 (60 %)||MSc Student|
4.4. Independent variable, group assignment, and duration
The independent variable of our study is change-decomposition in pull requests. We split our subjects between a control group and a treatment group: The control group receives one pull request containing a single commit with all the changes tied together; the treatment group receives two pull requests with changes separated according to a logical decomposition.
Participants are randomly assigned to either the control group or the treatment using strata based on experience as developers and previous knowledge. Previous research has shown that these factors have an impact on review outcome (Rigby et al., 2012; Bosu et al., 2015): Developers who previously made changes to files to be reviewed had a higher proportion of useful comments.
All subjects are asked to run the experiment in a single session so that external distracting factors can be eliminated as much as possible. If a participant needs a pause, the pause is considered and excluded from the final result as we measure and monitor for periods of inactivity. We seek to reduce the impact of fatigue by limiting the expected time required for the experiment to an average of 60 minutes; this value is closer to the minimum rather than the median for similar experiments (Ko et al., 2015). No learning effect is present as every participant runs the experiment only once.
4.5. Pilot experiments
We ran two pilot experiments to assess the settings. The first subject (a developer with 5 FTE222A full-time employee (FTE) works the equivalent of 40 hours a week. We consider 1 FTE-year when a person has worked the equivalent of 40 hours a week for one year. For example, an individual working two years as a developer for 20 hours a week would have 1 FTE-year experience. years of experience) took too long to complete the training and showed some issues with the virtual machine. Consequently, we restructured the training phase addressing the potential environment issues in the material provided to participants. The second subject (a MSc student with low experience) successfully completed the experiment in 50 minutes with no issues. Both pilot experiments were executed asynchronously in the same way as the actual experiment.
4.6. Tasks of the experiment
The participants were asked to conduct the following four tasks. Further details are available in the online appendix (di Biase et al., 2018).
1 - Preparing the environment. Participants are given precise and detailed instructions on how to set-up the environment for running the experiment. These entail installing the virtual machine, setting up the recording of the screen during the experiment, and troubleshooting common problems, such as network or screen resolution issues.
2 - Training the participants. Before starting with the review phase, we first ensure that the participants have a sufficient familiarity with the system. It is likely that the participants had never seen the codebase before: this situation would limit the realism of the subsequent review task.
To train our participants we ask subjects to implement three different
features in the system:
(1) Change the way the player moves on the board game, using different keys,
(2) check if the game board has null squares (a board is made of
multiple squares) and perform this check when the board is created, and
(3) implement a new enemy in the game, with similar artificial intelligence to
another enemy but different parameters.
implement a new enemy in the game, with similar artificial intelligence to another enemy but different parameters.This learning by doing approach is expected to have higher effectiveness than providing training material to participants (Slavin, 1987). By definition, this approach is a method of instruction where the focus is on the role of feedback in learning. The desired features require change across the system’s codebase. The third feature to be implemented targets the classes and components of the game that will be addressed by the review tasks. The choice of using this feature as the last one is to progressively increment the level of difficulty.
No time window is given to participants with the aim of having a more realistic scenario. As explicitly mentioned in the provided instructions, participants are allowed to use any source for retrieving information about something they do not know. This is permitted as the study does not want to assess skills in implementing some functionality in a programming language. The only limitation is that the participants must use the tools within the virtual machine.
The virtual machine provides the participants with the Eclipse Java IDE. The setup already has the project imported in Eclipse’s workspace. We use a plugin in Eclipse, WatchDog (Beller et al., 2015), to monitor development activity. With this plugin, we measure how much time participants spend reading, typing, or using the IDE. The purpose is to quantify the time to understand code among participants and whether this relates to a different outcome in the following phases. Results for this phase are shown in Figure 1, which contains boxplots depicting the data. It shows that there is no significant difference between the two groups.
3 - Perform code review on proposed change(s). Participants are asked to review two changes made to the system: (1) the implementation of the artificial intelligence for one of the enemies (2) the refactoring of a method in all enemy classes (moving its logic to the parent class). These changes can be inspected in the online appendix (di Biase et al., 2018) and have been chosen to meet the same criteria used by Herzig et al. (Herzig et al., 2016)
when choosing tangled changes. Changes proposed can be classified asrefactoring and enhancement. Previous literature gave insight as to how these two kinds of changes, when tangled together, are the hardest to review (Tao et al., 2012). Although recent research proposed a theory for the optimal ordering of code changes in a review (Baum et al., 2017), we use the default ordering and presentation provided by GitHub, because it is the de-fact standard. Changesets are included in pull requests on private GitHub repositories so that participants perform the review in a real-world environment. Pull requests have identical descriptions for both the control and the treatment, with no additional information except their descriptive title. While research shows that a short description may lead to poor review participation (Thongtanunam et al., 2017), this does not apply to our experiment as there is no interaction among subjects.
Subjects were instructed to understand the change and check its functional correctness. We asked the participants to comment on the pull request(s) if they found any problem in the code, such as any functional error related to correctness and issues with code quality. The changes proposed had three different functional issues that were intentionally injected into the source code. Participants could see the source code in case more context was needed, but only through GitHub’s browser-based UI.
The size of the changeset is around 100 lines of code and it involves seven files. Gousios et al. showed that the number of total lines changed by pull requests is on average less than 500, with a median number of 20 (Gousios et al., 2014). Therefore, the number of lines of the changeset used in this study is between the median and the average.
4 - Post-experiment questionnaire. In the last phase participants are asked to answer the questions shown in Table 4. Questions Q1 to Q4 are about change-understanding, while Q5 to Q12 involve subjects’ opinions about changeset comprehension and its correctness, rationale, understanding, etc. Q5 to Q12 are a summary of interesting aspects that developers need to grasp in a code change, as mentioned in the study of Tao et al. (Tao et al., 2012). The answers must be provided in a Likert scale (Oppenheim, 2000) ranging from ‘Strongly disagree’ (1) to ‘Strongly agree’ (5).
4.7. Outcome measurements
Effectiveness, false positives, suggested improvements. Subjects are asked to comment a pull request in the pull request discussion or in-line comment in a commit belonging to that pull request. The number of comments addressing functional issues is counted as the effectiveness. At the same time, we also measure false positives (i.e., comments in pull request that do not address a real issue in the code) and suggested improvements (i.e., remarks on other non-critical non-functional issues). We identify such comments as the three functional defects were intentionally put in the source code. Comments that do not directly and correctly tackle one of these three issues are classified either as false positives or suggested improvements. They are identified by the first author by looking at the description provided by the subject. A correct identified issue needs to highlight the problem, and optionally provide a short description.
Time. Having the screencast of the whole experiment, as well as using tools that give time measures, we gather the following measurements:
Time for Task 2, in particular:
total time Eclipse is [opened/active]
total time the user is [active/reading/typing];
as collected by WatchDog (Section 4.6).
Total net time for Task 3, defined as from when the subject opens a pull request until when (s)he completes the review, purged of eventual breaks.
Defect lifetime, defined as from when the subject opens a pull request to when (s)he writes a comment. For the case of multiple comments on the same pull request, this is the time between finishing with one issue and addressing the next. A similar measure was previously used by Prechelt et al. (Prechelt and Tichy, 1998).
All the above measures are collected in seconds elapsed.
Change-understanding. In this experiment, changeunderstanding is measured by means of a questionnaire submitted to participants post review activity, as mentioned in Task 4 in Section 4.6. Questions are shown in Table 4 from Q1 to Q4. Its aim is to evaluate differences in change-understanding. A similar technique is used by Binkley et al. who adopt a SAT-style questionnaire (Binkley et al., 2013).
Final Survey. Lastly, participants are asked to give their opinion on statements targeting the perception of correctness, understanding, rationale, logical partitioning of the changeset, difficulty in navigating the changeset in the pull request, comprehensibility, and the structure of the changes. This phase, as well as the previous one, is included in Task 4, corresponding to questions Q5 to Q12 (Table 4). Results will be given on a Likert scale from “Strongly disagree” (1) to “Strongly agree” (5) (Oppenheim, 2000)
, reported as mean, median and standard deviation over the two groups, and tested for statistical significance with the Mann-Whitney U-test.
4.8. Research method for RQ4
For our last research question, we aim to build some initial hypotesis to explain the results from the previous research questions. We seek what actions and patterns lead a reviewer in finding an issue or raising false positive, as well as other comments. The method to map actions to concepts starts by annotating the screencasts retrieved after the conclusion of the experimental phase. Subjects perform a series of actions that define and characterize both the outcome and the execution of the review. The first author inserted notes regarding actions performed by participants to build a knowledge base of steps (i.e., participant opens fileName, participant uses GitHub search box with the keyword, etc.).
Using the methodology for qualitative content analysis delineated by Schreirer (Schreier, 2013), we firstly defined the coding frame. Our purpose is having a characterization of the review activity based on patterns and behaviors. As previous studies already tackled this problem and came up with reliable categories, we used the investigations by Tao et al. (Tao et al., 2012) and Sillito et al. (Sillito et al., 2006) as the base for our frame. We used the concepts from Tao et al. (Tao et al., 2012) regarding Information needs for reasoning and assessing the change and Exploring the context and impact of the change, as well as the Initial focus points and Building on initial focus points steps from Sillito et al. (Sillito et al., 2006).
To code the transcriptions, we use the deductive category application, resembling the data-driven content analysis technique by Mayring (Mayring, 2000). We read the material transcribed, checking whether a concept covers that action transcribed (e.g., participant opens file fileName so that (s)he is looking for context). We group actions covered by the same concept (e.g., a participant opens three files, but always for context purpose) and continue until we build a pattern that lead to a specific outcome (i.e., addressing a defect or a false positive). We split the patterns according to their concept ordering such that those that lead to more defects found or false positive issues are visible. Ultimately, we identify the major differences between the two experimental groups.
5. Threats to validity and limitations
Conclusion’s validity. Threats to conclusion validity can be found in the participants selection. If the group is very heterogeneous, there is a risk that the variation due to individual differences is larger than due to the treatment. Our settings put in place countermeasures to mitigate this issue: we do not ask subjects to quantify objective measures (i.e., time to review), as this has a tendency to reduce the reliability of results. The measure chosen to evaluate dependent variables allowed us to assess results in an objective manner, avoiding to give subjective scores. Choosing more homogeneous groups will, on the other hand, affect the external validity (Campbell and Stanley, 2015). Questionnaires were designed using standard ways and scales (Oppenheim, 2000). Statistical tests used in our study were not performed with the Bonferroni correction, following the advice by Perneger: “Adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference” (Perneger, 1998).
Internal validity Regarding the instrumentation, artifacts, and environment used in our experimentation, the development environment provided is a standard Linux installation with Eclipse IDE (v. 4.4.2) and the plugin from the work by Beller et al. (Beller et al., 2015). Concerning our participants, they were drawn from a population sample that volunteered to participate. Software developers belong to three software companies, PhD students belong to three universities and MSc students to different faculties despite being from the same university.
Construct validity Construct threats are related to mono-operation bias and mono-method bias (Wohlin et al., 2012). The settings of our experiment are analyzing only the pull-based development process reviews with pull requests, in the frame of a single Java system. This was necessary as, given the structure of our experiment, we wanted to replicate a real-world scenario for this particular process. This allows increasing its external validity.
External validity The most important threat is related to the interaction of the settings and treatment. In fact, one might argue that the system used is not fully representative of a real-world scenario. Our choice, however, as explained in Section 4.1, aims to reduce the training phase effort required from participants and to encourage the completion of the experiment. Furthermore, we evaluate the effects that external environment factors could have. We accept this threat as we run our experiment in an asynchronous environment, even if we believe that our description of the settings, tools, and methods provides adequate countermeasures in this context.
|Group||# of subjects||Total||Median||Mean||Confidence Interval 95%||p-value|
|Effectiveness (defects found)||Control||14||20||1.0||1.42||0.72||(0, 2.85)||0.6|
|False Positives||Control||14||6||0||0.42||0.5||(-0.54, 1.40)||0.03|
|Suggested Improvements||Control||14||7||0||0.5||0.62||(-1.22, 1.22)||0.4|
|Group||# of subjects||Median||Mean||Conf. Interval 95%||p-value|
|Review net time||
|1st defect lifetime||
|2nd defect lifetime||
RQ1. Effectiveness, false positives, and suggestions
For our first research question, descriptive statistics about results are shown in Table2. It contains data about effectiveness of participants (i.e.
, correct number of issues addressed), false positives, and number of suggested improvements. Given the sample size, we applied a non-parametric test and performed a Mann-Whitney U-test to test for differences between the control and the treatment group. This test, unlike a t-test, does not require the assumption of a normal distribution of the samples. Results of the statistical test are intended to be significant for a significance level of 95%.
Results indicate a significant difference between the control and the treatment group regarding the number of false positives, with a p-value of 0.03. On the contrary, there is no statistically significant difference regarding the number of defects found (effectiveness) and number of suggested improvements.
The example of a false positive is when subject C9 writes: “This doesn’t sound correct to me. Might want to fix the for, as the variable varName is never used”. This is not a defect, as varName is used to check how many times the for-statement has to be executed, despite not being used inside the statement. This is also written in a code comment. Another false positive example is provided by subject C1, who, reading the refactoring proposed by the changeset under review, writes: “The method methodName is used only in Class ClassName, so fix this”. This is not a defect as the same methodName
is used by the other classes in the hierarchy. As such, we can reject only the null hypothesis H0f regarding the false positives, while we cannot provide statistically significant evidence about the other two variables tested in H0e and H0c.
The statistical significance alone for the false positives does not provide a measure to the actual impact of the treatment. To measure the effect size of the factor over the dependent variable we chose the Cliff’s Delta (Cliff, 1993)
, a non-parametric measure for effect size. For data with skewed marginal distribution it is a more robust measure if compared to Cohen standardized effect size(Cohen, 1992). The computed value shows a positive (i.e., tangled pull requests lead to more false positives) effect size (), revealing a medium effect.
Given the presence of suggested improvements in our results, we found that the control group writes in total seven, while the participants in the treatment write nineteen. This difference is interesting, calling for further classification of the suggestions. For the control group, participants wrote respectively three improvements regarding code readability, two concerning functional checks, one regarding understanding of source code and one regarding other code issues. For the treatment group, we classified five suggestions for code readability, eight for functional checks and seven for maintainability. Although subjects have been explicitly given the goal to find and comment exclusively functional issues (Section 4.6), they wrote these suggestions spontaneously The suggested improvements are included in the online appendix (di Biase et al., 2018) along with their classification.
RQ2. Review time and defect lifetime
To answer RQ2, we measured and analyzed the time subjects took to review the pull requests, as well as the amount of time they used to fix each of the issues present. Descriptive statistics about results for our second research question are shown in Table 3. It contains data about the time participants used to review the patch, completed by the measurements of how much they took to fix respectively two of the three issues present in the changeset. All measures are in seconds. We exclude data relatively to the third defect as only one participant detected it. To analyze this data, we used the same statistical means described for the previous research question.
When computing the review net time used by the subjects, results show an insignificant difference, thus we are not able to reject null-hypothesis H0t1. This indicates that the average case of the treatment group takes the same time to deliver the review, despite having two pull requests to deal with instead of one. However, analyzing results regarding the defect lifetime we also see no significant difference and cannot reject H0t2. Data show that the mean time to address the first issue is about 14% faster in the treatment group if compared with the control. This is because subjects have to deal with less code that concerns a single concept, rather than having to extrapolate context information from a tangled change. At the same time the treatment group is taking longer (median) to address the second defect. We believe that this is due to the presence of two pull requests, and as such, the context switch has an overhead effect on that. Subjects had to close a pull-request and then review the next, where they need to gain knowledge on different code changes.
RQ3. Understanding The Change’s Rationale
For our third research question, we seek to measure whether subjects are affected in their understanding the rationale of the change by the dependent variable. Rationale understanding questions are Q1 to Q4 (Table 4) and Figure 2 reports the results. Higher scores for Q1, Q2, and Q4 mean better understanding, whereas for Q3 a lower score signifies a correct understanding. As for the previous research questions, we test our hypothesis with a non-parametrical statistical test. Given the result we cannot reject the null hypothesis H0u of tangled pull requests reducing change-understanding. Participants are in fact able to answer the questions correctly, independent of their experimental group.
After the review, our experimentation also provided a final survey (Q5 to Q12 in Table 4) that participants filled in at the end. Results shown in Figure 2 indicate that subjects judge equally incorrect the changeset (Q5), found no difficulty in understanding the changeset (Q6), agree on having understood the rationale behind the changeset (Q7), did not find the changeset hard to navigate (Q9), and believe that the changeset was comprehensible (Q11). Answers to questions Q9 and Q11 are surprising to us, as we would expect dissimilar results for code navigation and comprehension. In fact, changedecomposition should allow subjects to navigate code easier, as well as improve source comprehension. On the other hand, subjects judge differently when asked if the changeset was partitioned according to a logical separation of concerns (Q8), if the relationships among the changes were well structured (Q10) and if the changes were spanning too many features (Q12). These answers are in line with what we would expect, given the different structure of the code to be reviewed. The answers are different with a statististical significance for Q8, Q10 and Q12.
|Questions on understanding the rationale of the changeset|
|The purpose of this changeset entails …|
|Q1||… changing a method for the enemy AI|
|Q2||… the refactoring of some methods|
|Q3||… changing the game UI panel|
|Q4||… changing some method signature|
|Questions on participant’s perception on the changeset|
|Q5||The changeset was functionally correct|
|Q6||I found no difficulty in understanding the changeset|
|Q7||The rationale of this changeset was perfectly clear|
|Q8||The changeset [showed] a logical separation of concerns|
|Q9||Navigating the changeset was hard|
|Q10||The relations among the changes were well structured|
|Q11||The changeset was comprehensible|
|Q12||Code changes were spanning too many features|
RQ4. Tangled vs. Untangled review patterns
|What is the rationale behind this code change? (Tao et al., 2012)||Rationale|
|Is this change correct? Does it work as expected? (Tao et al., 2012)||Correctness|
|Who references the changed classes/methods/fields? (Tao et al., 2012)||Context|
|How does the caller method adapt to the change of its callees? (Tao et al., 2012)||Caller/Callee|
|Is there a precedent or exemplar for this? (Sillito et al., 2006)||Similar/Precedent|
|ID||1st concept||2nd concept||3rd concept||Defect||FP||Defect||FP|
For our last research question, we seek to identify differences in patterns and features during review, and their association to quantitative results. We derived such patterns from Tao et al. (Tao et al., 2012) and Sillito et al. (Sillito et al., 2006). These two studies are relevant as they investigate the role of understanding code during the software development process. Tao et al. (Tao et al., 2012) lay out a series of information needs derived from state-of-the-art research in Software Engineering, while Sillito et al. (Sillito et al., 2006) focus on questions asked by professional experienced developers while working on implementing a change. The mapping found in the screencasts is shown in Table 5.
Table 6 contains the qualitative characterization, ordered by the sum of defects found. Values in each row correspond to how many times a participant in either group used that pattern to address a defect or point to a false positive.
Results indicate that pattern P1 is the one that led to most issues being addressed in the control group (eight), but at the same time is the most imprecise one (three false positives). We stress out the lack of context-seeking concept. Patterns P1 and P3 have most false positives addressed in the control group. In the treatment group, pattern P2 led to more issues being addressed (five), followed by the previously mentioned P1 (four).
Analyzing the transcribed screencasts, we noted an overall trend of reviewing code changes in the control group, exploring the changeset using less context exploration than in the treatment. Among the participants belonging to the treatment, we witnessed a much more structured way of conducting the review. The overall behavior is that of getting the context of the single change, looking for the files involved, called, or referenced by the changeset, in order to grasp the rationale. All of the subjects except three repeated this step multiple times to explore a chain of method calls, or to seek for more context in that same file opening it in GitHub. We consider this the main reason to explain that untangled pull requests lead to more precise (fewer false positives) results.
7.1. Implications for Researchers
In past studies, researchers found that developers call for tool and research support for decomposing a composite change (Tao et al., 2012). For this reason, we were surprised that our experiment was not able to highlight differences in terms of reviewers’ effectiveness (number of defects found) and reviewers’ understanding of the change rationale, when the subjects were presented with smaller, self-contained changes.
If we exclude latent problems with the experiment design that we did not account for, this result may indicate that reviewers are still able to conduct their work properly, even when presented with tangled changes. However, the results may change in different contexts. For example, the cognitive load for reviewers may be higher with tangled changes, thus the negative effects in terms of effectiveness could be visible when a reviewer has to assess a large number of changes every day, as it happens with integrators of popular projects in GitHub (Gousios et al., 2015). Moreover, the changes we considered are of average size and difficulty, yet results may be impacted by larger changes and/or more complex tasks. Finally, participants were not core developers of the considered software system; it is possible that core developers would be more surprised by tangled changes, find them more convoluted or less “natural,” thus rejecting them (Hellendoorn et al., 2015). We did not investigate these scenarios further, but studies can be designed and carried out to determine whether and how these aspects influence the results of the code review effort.
Given the remarks and comments of professional developers on tangled changes (Tao et al., 2012), we found also surprising that the experiment did not highlight any differences in the net review time between the treatment groups. Barring experimental design issues, this result can be explained by the additional context switch, which does not happen in the tangled pull request (control) because the changes are done in the same files. An alternative explanation could be that the reviewers with the untangled pull requests (treatment) spend more time “wondering around” and pinpointing small issues because they found the important defects quicker; this would be in line with the cognitive bias known as Parkinson’s Law (Parkinson and Osborn, 1957) (all the available time is consumed). However, time to find the first and second defects (3) is the same for both experimental groups thus voiding this hypothesis. Moreover, similarly to us, Tao and Kim also did not find difference with respect to time to completion in their preliminary user study (Tao and Kim, 2015). Further studies should be designed to replicate our experiment and, if results are confirmed, to derive a theory on why there is no reduction in review time.
Our initial hypothesis on why time does not decrease with untangled code changes is that reviewers of untangled changes (control) may be more willing to build a more appropriate context for the change. This behavior seems to be backed up by our qualitative analysis (Section 6), through the context-seeking actions that we witnessed for the treatment group. If our hypothesis is not refused by further research, this could indicate that untangled changes may lead to a more thorough low-level understanding of the codebase. Despite we did not measure this in the current study, it may explain why we see less false positives with untangled changes. Finally, untangled changes may lead to better transfer of code knowledge, one of the positive effects of code review (Bacchelli and Bird, 2013).
7.2. Recommendation for Practitioners
Our experiment is not able to show no negative effects when changes are presented as separate, untangled changesets, despite the fact that reviewers have to deal with two pull requests instead of one, with the subsequent added overhead and a more prominent context switch. With untangled changesets, our experiment highligthed an increased number of suggested improvements, more context-seeking actions (which, it is reasonable to assume, increase the knowledge transfer created by the review), and a lower number of wrongly reported issues.
For the aforementioned reasons, we support the recommendation that change authors prepare self-contained, untangled changeset when they need a review. In fact, untangled changeset are not detrimental to code review (despite the overhead of having more pull-requests to review), but we found evidence of positive effects. We expect the untangling of code changes to be minimal in terms of cognitive effort and time for the author. This practice, in fact, is supported by answers Q8, Q10, Q12 to the questionnaire and by comments written by reviewers in the control group (i.e., “Please make different commit for these two features”, “I would prefer having two pull requests instead of one if you are fixing two issues”).
The goal of the study presented in this paper is to investigate the effects of changedecomposition on modern code review (Cohen, 2010), particularly in the context of the pull-based development model (Gousios et al., 2014).
We involved 28 subjects, who performed a review of pull request(s) pertaining to (1) a refactoring and (2) the addition of a new feature in a Java system. The control group received a single pull request with both changes tangled together, while the treatment group received two pull requests (one per type of change). We compared control and treatment groups in terms of effectiveness (number of defects found), number of false positives (wrongly reported issues), number of suggested improvements, time to complete the review(s), and level of understanding the rationale of the change. Our investigation involved also a qualitative analysis of the review performed by subjects involved in our study.
Our results suggests that untangled changes (treatment group) lead to: (1) fewer reported false positives defects, (2) more suggested improvements for the changeset, (3) same time to review (despite the overhead of two different pull requests), (4) same level of understanding the rationale behind the change, (5) and more context-seeking patterns during review.
Our results support the case that committing changes belonging to different concepts separately should be an adopted best practice in contemporary software engineering. In fact, untangled changes not only reduce the noise for subsequent data analyses (Herzig et al., 2016), but also support the tasks of the developers in charge of reviewing the changes by increasing context-seeking patterns.
The authors would like to thank all participants of the experiment and the pilot. We furthermore thank the fellow researchers who gave critical suggestion to help strengthening the methodology of our study. Bacchelli gratefully acknowledges the support of the Swiss National Science Foundation through the SNF Project No. PP00P2_170529.
- Bacchelli and Bird (2013) A. Bacchelli and C. Bird. 2013. Expectations, Outcomes, and Challenges of Modern Code Review. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 712–721.
- Barnett et al. (2015) M. Barnett, C. Bird, J. Brunet, and S.K. Lahiri. 2015. Helping Developers Help Themselves: Automatic Decomposition of Code Review Changesets. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE ’15). IEEE Press, Piscataway, NJ, USA, 134–144.
- Baum et al. (2017) T. Baum, K. Schneider, and A. Bacchelli. 2017. On the Optimal Order of Reading Source Code Changes for Review. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 329–340.
- Beller et al. (2015) M. Beller, G. Gousios, A. Panichella, and A. Zaidman. 2015. When, How, and Why Developers (Do Not) Test in Their IDEs. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 179–190.
- Binkley et al. (2013) D. Binkley, M. Davis, D. Lawrie, J.I. Maletic, C. Morrell, and B. Sharif. 2013. The impact of identifier style on effort and comprehension. Empirical Software Engineering 18, 2 (2013), 219–276.
- Bosu et al. (2015) A. Bosu, M. Greiler, and C. Bird. 2015. Characteristics of Useful Code Reviews: An Empirical Study at Microsoft. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. 146–156.
- Campbell and Stanley (2015) D.T. Campbell and J.C. Stanley. 2015. Experimental and quasi-experimental designs for research. Ravenio Books.
- Cliff (1993) N. Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin 114, 3 (1993), 494.
- Cohen (1992) J. Cohen. 1992. Statistical power analysis. Current directions in psychological science 1, 3 (1992), 98–101.
- Cohen (2010) J. Cohen. 2010. Modern Code Review. In Making Software, Andy Oram and Greg Wilson (Eds.). O’Reilly, Chapter 18, 329–338.
- di Biase et al. (2016) M. di Biase, M. Bruntink, and A. Bacchelli. 2016. A Security Perspective on Code Review: The Case of Chromium. In 16th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2016, Raleigh, NC, USA, October 2-3, 2016. IEEE Press, 21–30.
- di Biase et al. (2018) M. di Biase, M. Bruntink, A. van Deursen, and A. Bacchelli. 2018. The effects of change-decomposition on code review - A Controlled Experiment - Online appendix. https://doi.org/10.4121/uuid:826f7051-35f6-4696-b648-8e56d3ea5931
- Dias et al. (2015) M. Dias, A. Bacchelli, G. Gousios, D. Cassou, and S. Ducasse. 2015. Untangling fine-grained code changes. In Proceedings of the 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2015). IEEE Computer Society, 341–350.
- Gousios et al. (2014) G. Gousios, M. Pinzger, and A. van Deursen. 2014. An exploratory study of the pull-based software development model. Proceedings of the 36th International Conference on Software Engineering - ICSE 2014 May 2014 (2014), 345–355.
- Gousios et al. (2015) G. Gousios, A. Zaidman, M. Storey, and A. van Deursen. 2015. Work Practices and Challenges in Pull-based Development: The Integrator’s Perspective. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE ’15). IEEE Press, Piscataway, NJ, USA, 358–368.
- Hellendoorn et al. (2015) V. J. Hellendoorn, P. T. Devanbu, and A. Bacchelli. 2015. Will they like this? Evaluating code contributions with language models. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 157–167.
- Herzig et al. (2016) K. Herzig, S. Just, and A. Zeller. 2016. The impact of tangled code changes on defect prediction models. Empirical Software Engineering 21, 2 (2016), 303–336.
- Herzig and Zeller (2013) K. Herzig and A. Zeller. 2013. The impact of tangled code changes. In Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on. IEEE, 121–130.
- Kirinuki et al. (2014) H. Kirinuki, Y. Higo, K. Hotta, and S. Kusumoto. 2014. Hey! Are You Committing Tangled Changes?. In Proceedings of the 22nd International Conference on Program Comprehension (ICPC 2014). ACM, New York, NY, USA, 262–265.
- Ko et al. (2015) A.J. Ko, T.D. LaToza, and M.M. Burnett. 2015. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering 20, 1 (2015), 110–141.
- Mayring (2000) P. Mayring. 2000. Qualitative content analysis. Forum: Qualitative Social Research (2000).
- McIntosh et al. (2014) S. McIntosh, Y. Kamei, B. Adams, and A.E. Hassan. 2014. The Impact of Code Review Coverage and Code Review Participation on Software Quality: A Case Study of the Qt, VTK, and ITK Projects. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). ACM, New York, NY, USA, 192–201.
- McIntosh et al. (2016) S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan. 2016. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 21, 5 (2016), 2146–2189.
- Morales et al. (2015) R. Morales, S. McIntosh, and F. Khomh. 2015. Do code review practices impact design quality? A case study of the Qt, Vtk, and Itk projects. In Proceedings of the 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER 2015). IEEE, 171–180.
- Murphy-Hill et al. (2012) E. Murphy-Hill, C. Parnin, and A.P. Black. 2012. How we refactor, and how we know it. IEEE Transactions on Software Engineering 38, 1 (2012), 5–18.
- Oppenheim (2000) A.N. Oppenheim. 2000. Questionnaire design, interviewing and attitude measurement. Bloomsbury Publishing.
- Parkinson and Osborn (1957) C. N. Parkinson and R. C. Osborn. 1957. Parkinson’s law, and other studies in administration. Vol. 24. Houghton Mifflin Boston.
- Perneger (1998) T. V. Perneger. 1998. What’s wrong with Bonferroni adjustments. British Medical Journal 316, 7139 (1998), 1236.
- Prechelt and Tichy (1998) L. Prechelt and W.F. Tichy. 1998. A controlled experiment to assess the benefits of procedure argument type checking. IEEE Transactions on Software Engineering 24, 4 (1998), 302–312.
- Rigby et al. (2012) P. Rigby, B. Cleary, F. Painchaud, M. Storey, and D. German. 2012. Contemporary peer review in action: Lessons from open source development. IEEE software 29, 6 (2012), 56–61.
- Rigby et al. (2014) P. Rigby, D.M. German, L. Cowen, and M. Storey. 2014. Peer Review on Open-Source Software Projects. ACM Transactions on Software Engineering and Methodology 23, 4 (2014), 1–33.
- Rigby and Bird (2013) P. C. Rigby and C. Bird. 2013. Convergent Contemporary Software Peer Review Practices. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013). ACM, 202–212.
- Schreier (2013) M. Schreier. 2013. Qualitative Content Analysis. In The SAGE handbook of qualitative data analysis. SAGE, 170–183.
- Sharif et al. (2012) B. Sharif, M. Falcone, and J.I. Maletic. 2012. An eye-tracking study on the role of scan time in finding source code defects. In Proceedings of the Symposium on Eye Tracking Research and Applications. ACM, 381–384.
- Sillito et al. (2006) J. Sillito, G.C. Murphy, and K. De Volder. 2006. Questions programmers ask during software evolution tasks. In Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 23–34.
- Slavin (1987) R.E. Slavin. 1987. Mastery learning reconsidered. Review of educational research 57, 2 (1987), 175–213.
- Tao et al. (2012) Y. Tao, Y. Dang, T. Xie, D. Zhang, and S. Kim. 2012. How Do Software Engineers Understand Code Changes?: An Exploratory Study in Industry. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE ’12). ACM, New York, NY, USA, Article 51, 11 pages.
- Tao and Kim (2015) Y. Tao and S. Kim. 2015. Partitioning composite code changes to facilitate code review. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE, 180–190.
- Thongtanunam et al. (2017) P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida. 2017. Review participation in modern code review. Empirical Software Engineering 22, 2 (2017), 768–817.
- Uwano et al. (2006) H. Uwano, M. Nakamura, A. Monden, and K. Matsumoto. 2006. Analyzing individual performance of source code review using reviewers’ eye movement. In Proceedings of the 2006 symposium on Eye tracking research & applications. ACM, 133–140.
- Wohlin et al. (2012) C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson, B. Regnell, and A. Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.