Code smells are sub-optimal design decisions [fowler:1999]. Our research community has found empirical evidence of the negative impact of code smells on software maintainability [palomba2017diffuseness, olbrich2009evolution, khomh2012exploratory, Spadini:icsme18], program comprehensibility [abbes2011empirical], and development effort [sjoberg2013quantifying], as well as on the problems that developers encounter when dealing with these smells [tufano2017when, tufano2016empirical, peters2012evaluating, bavota2015experimental, palomba2017scent, palomba2018beyond].
At the same time, our community has found that code smells are generally not a primary concerns for practitioners [YamashitaM12, yamashita2013developers, palomba2014they, taibi2017developers, silva2016we, palomba2017exploratory, kim2014empirical]. Morales et al. reported an exception to this behavior [morales2015code]: They found initial evidence suggesting that developers may be concerned with code smells, but only during code review (a development phase not investigated by previous studies). Analyzing Qt, VTK, and ITK projects, Morales et al. found that the classes that receive better code review treatment are less prone to the occurrence of code smells. Indeed, this finding is in line with previous research that has shown that developers during code reviews are not only worried about finding defects but also about general code improvements and design quality [bacchelli:2013]. Interestingly, the existence of a relation between code review and code smells could indicate an important context in which researchers on code smells can focus their attention (e.g., to have the most impact with their tools on detecting and refactoring smells).
The method used by Morales et al. [morales2015code], however, did not allow the authors to establish a causal relationship between code review and code smells. Despite the authors carefully considered confounding factors with their models (e.g., size, complexity, ownership), it is impossible to exclude that other factors may be the actual cause of the measured relationship.
In this study, we continue on this line of work and propose a study whose goal is to investigate the effect of code review on code smells. We use a novel research method to reach stronger conclusions about the causality of the relationship between code review and code smells: (1) we consider a fine-granularity (i.e., file-level, as opposed to component level) (2) compare the code smells in each file before it enters a code review vs. after it exits from it (as opposed to looking at aggregate values, e.g., on code review coverage), and (3) manually inspect cases in which the severity of code smells decreases in a review.
Overall, we (1) collect data of 21,879 code reviews from seven Java software projects, (2) measure any decrease in severity concerning six selected code smells in the files under review, (3) use statistical analysis to investigate the relationship between any severity decrement to the activity in the corresponding code review, (4) assess whether specific code smells are more often decreasing in severity within a code review, and (5) manually analyze 365 cases in which the severity of a code smell in a file decreased in a code review.
Our results (1) indicate that active and participated code reviews have a higher chance of reducing code smells, (2) fail to confirm a relationship between what developers reported as more critical code smells in previous studies [palomba2014they, taibi2017developers] and code review activity, and (3) reveal that most of the changes in code smells are not due to a direct comment on design issues, rather a side effect of code changes requested by the reviewers on matters unrelated to code smells.
Overall, the main contributions of this work include:
Empirical evidence, based on a large-scale study, that active and participated code reviews have the potential of reducing the impact of code smells on reviewed files;
Empirical evidence on the types of code smells whose severity is most likely to be reduced because of code review, which includes the discussion of contrasting results with respect to previous findings achieved in the field of code smell perception and refactoring;
A classification clarifying the causes that led to code smell severity decrements, whose results highlight that only rarely reviewers actively stimulate the discussion on how to make the design of a class more functional;
An online appendix [appendix], including all data and scripts used to conduct our study.
Ii Background and related work
In this section, we report the background information on the topic of interest and connected related work.
Ii-a Related Work
In the following we contextualize our work with respect to literature on code review and code smells.
Code Review. The research community has extensively investigated how code review activities, particularly participation and coverage of the process, influence source code quality. Abelein et al. [Abelein2015] demonstrated the impact of user participation and involvement on system success. Thongtanunam et al. [Thongtanunam2015, thongtanunam2016revisiting] found that reviewing participation and expertise can decrease defect-proneness, while Rigby et al. [Rigby2014] reported that review participation is the most influential factor influencing code review efficiency. Moreover, studies indicated that patches should be reviewed by at least two developers to maximize fault identification [porter1998comparing, Sauer2000, kemerer2009impact, Rigby2013, kononenko2015investigating, di2016security, Thongtanunam2016, kononenko2016code].
Bavota and Russo found that patches with a low number of reviewers tend to have a higher chance of inducing new defects [Bavota2015]. Furthermore, McIntosh et al. [mcintosh2014impact, Mcintosh2016] measured the extent to which the proportion of reviewed patches and the amount of participation in a module impact software quality, finding that a low percentage has a negative impact on code quality. Nevertheless, these findings were later partially contradicted by Sadowski et al. [sadowski2018modern] in the case of Google. Ram et al. [ram2018makes] further suggested that reviewers should be involved in the inspection of small patches to be more efficient, while Pascarella et al. [pascarella2018information] identified a set of information needs that would help developers while reviewing code.
Mäntylä and Lassenius [mantyla2009types] conducted a study aimed at investigating what types of defects are discovered in code review, finding that, in addition to functional defects, code reviews find many evolvability defects, (i.e., issues related to the design and non-functional quality of the code). Later, Beller et al. [beller2014modern], as well as Bacchelli and Bird [bacchelli:2013], confirmed similar results in different contexts.
Practitioners And Code Smells. Researchers have investigated whether and how developers perceive code smells as an issue. Arcoverde et al. [Arcoverde:WRT2011] studied how practitioners react to the presence of code smells in the source code. Their primary finding indicates that code smells tend to live in the source code for a long time because developers avoid refactoring to prevent unexpected consequences. In an industrial environment, Yamashita and Moonen [yamashita2013developers] reported the results of an empirical study aimed at evaluating the code smell severity perceived by developers of commercial software. The authors found that 32% of the interviewed developers overlook code smells; therefore, the removal is not in their priority. Confirming this trend, Sjøberg et al. [sjoberg2013quantifying] found that code smells pose only secondary design problems, thus justifying how developers tend to ignore their presence. Opposite findings were reported by Palomba et al. [palomba2017scent], who found that the previous observation was not entirely correct when considering a subclass of textual code smells only. Developers tend to diagnose better design problems characterized by textually-detectable code smells rather than those of structurally-detectable ones. These results are aligned with a previous study where Palomba et al. [palomba2014they] surveyed a pool of practitioners belonging to three different open source systems and reported that developers consider code smells associated with complex and lengthy source code harmful for maintainability. At the same time, whether the surveyed developers perceived the other code smells as a problem strongly depended on the severity of the problem.
Ii-B Comparison with Morales et al. [morales2015code]
Morales et al. [morales2015code] are the first who investigated the relation between code review and code smells. They found that software components with limited review coverage and participation are more prone to the occurrence of code smells compared to components whose review process is more active.
Our study has been greatly inspired by Morales et al. [morales2015code]. We aim to determine a more reliable causal relationship between code review and code smells, by using a novel method. In the following we detail the differences between our study and the one by Morales et al. [morales2015code], by considering the scenario in a software system’s evolution depicted in Figure 1.
A scenario of a software system’s evolution. In Figure 1, the horizontal line represents an interval in the evolution of a software system: This interval depicts three commits (), two code reviews ( and ) and a software system’s release (, represented by the vertical red line). First, a developer modifies two files (A and B) that belong to the same component (foo) based on the current state of the main branch and submits the patch (first-patch) for review (). The review includes three reviewers who provide two comments over the course of an unspecified number of iterations. During the review, the patch is changed until the reviewers are eventually satisfied with the final state of the patch (last-patch) and allow the author to merge it into the main branch of the system (commit ). In commit , a developer commits changes to foo.A to the main branch directly, without a review. Finally, a developer takes the current state of the main branch (i.e., as it is after ) and modifies foo.B; then, the changed file (first-patch) is sent to review (); and the final version of the file after the review loop (last-patch) is committed () to the main branch.
Approach by Morales et al. [morales2015code]. Considering the aforementioned scenario, the approach by Morales et al. considers one of the available releases () and takes all the released source code files (in our case, we focus on foo.A and foo.B). Subsequently, the authors compute the code smells for these released files, aggregate the information at component (i.e., package) level, and consider the result as their dependent variable. Furthermore, the authors compute control variables in the form of product (e.g.
, lines of code, at the moment of the release) and process (e.g., ownership and churn, over the history) metrics [rahman2013and]. Finally, Morales et al. compute review metrics regarding coverage (e.g., the proportion of changes that have been reviewed) and participation (e.g., the proportion of changes without discussion) as independent variables. Following this approach, they consider the whole set of changes and commits that affect the evolution of the project between two releases—including the ones where developers did not perform a code review—to compute code review metrics. Although this effect is balanced out by two independent variables (i.e., ‘proportion of reviewed changes’ and ‘proportion of reviewed churn’), this methodology allowed the authors to only obtain a coarse-grained overview of the relationship between code review practices and software design quality. For instance, in the scenario of Figure 1, changes that do not pass through code review (e.g., foo.A in ) may increase/decrease the number of code smells independently from code review activities.
Our proposed alternative. In our study, we propose a novel method to better capture the causal link between code review and code smells. First, we consider classes (i.e., files) instead of components (i.e., directories), and we only consider code changes inspected in a code review. Then, we compute code smell severity before vs. after each code review (i.e., we compute code metrics considering only changes in the first-patch and the last-patch). As depicted in Figure 1, our dependent variable is named isDecreased: it indicates whether the severity of a smell has decreased after the code review and is computed computing the difference between the severity of the smell before and after a code review. Changes like the one leading to are not considered by our approach, because they have no direct relationship with any code review activity.
Our goal is to investigate the causal link between code review and code smells, with the purpose of understanding whether the inspection activities performed by developers reduce the severity of code smells. The perspective is of both researchers and practitioners, who are interested in gaining a deeper understanding of the effects of code review.
Iii-a Research Questions
First, we investigate to what extent the severity of code smells is reduced by the code review process:
Then, we analyze whether some types of code smells exhibit a higher likelihood to be reduced with a code review. Since past studies found that developers perceive certain code smells as more relevant [palomba2014they, taibi2017developers], we investigate whether these smells are removed with a higher likelihood.
Finally, we aim to qualitatively understand why the code smell severity changes with a code review (e.g., do the reviewers explicitly request the author to manage code smells?):
Iii-B Context of the Study
To collect data for our study, we leverage CROP (Code Review Open Platform), i.e., a publicly available platform and dataset developed by Paixao et al. [Paixao2018]. This dataset is ideal for our research goals, because it (1) stores all review details, (2) contains a large amount of data on projects with different size and scope and developed by different communities, and (3) is fully compliant with the European Union General Data Protection Regulation [voigt2017eu] (randomly generated data replace real names and email addresses of developers).
Subject systems. Among the 11 software systems available in CROP, we focus on the seven ones written in Java.111Our research community has more widespread and better validated code smells detection strategies for Java than for other languages (e.g., those defined by Lanza and Marinescu [Lanza:2006]), thus we prefer Java to avoid uncertainties caused by instruments that have not been previously evaluated. The first two columns of Table I report information on the subject software systems, including their names and the time span for which we have data.
Subject code reviews. The dataset available in CROP includes the whole suite of patch-sets with the related discussions for each code review in the considered systems. It also gives easy access to the first-patch and last-patch sets (Figure 1) that represent a two-fold snapshot of the source code before and after the code review. Having this information at disposal is a key advantage for our study, because we can measure the severity of code smells before vs. after each code review, thus we can address our research questions. The last two columns of Table I report the considered code review data.
Subject code smells. We consider six different types of code smells, which are reported in Table II. We selected these code smells because: (1) they have different characteristics (e.g., complexity vs. design compliance), thus allowing us to analyze a wider range of design issues that frequently occur in real software systems [palomba2017diffuseness], and (2) they have already been considered by previous work concerning their relevance for developers [palomba2014they, taibi2017developers, yamashita2013developers], therefore they can be used to assess the role of developers’ perception within code review activities.
|System||Time Span||Reviews||Reviewed Files|
|couchbase-java-client||01-12 to 11-17||916||4,479|
|couchbase-jvm-core||04-14 to 11-17||841||4,217|
|eclipse.linuxtools||06-12 to 11-17||4,129||23,088|
|eclipse.platform.ui||02-13 to 11-17||4,756||52,441|
|egit||10-09 to 11-17||5,336||18,647|
|jgit||09-09 to 11-17||5,382||23,037|
|spymemcached||05-10 to 07-17||519||2,777|
|Complex Class [brown1998antipatterns]||Classes exhibiting a high cyclomatic complexity, being therefore hard to test and understand [abbes2011empirical, palomba2017diffuseness]||Y [palomba2014they, taibi2017developers]|
|Functional Decomposition [fowler:1999]||Classes where object-oriented constructs are poorly used, declaring and implementing few methods [fowler:1999].||N [taibi2017developers]|
|God Class [fowler:1999]||Classes that are poorly cohesive, usually characterized by large size and implementing several functionalities [fowler:1999].||Y [palomba2014they, palomba2017scent]|
|Inappropriate Intimacy [fowler:1999]||Classes having a too high coupling with another class of the system [fowler:1999].||N [palomba2014they, taibi2017developers]|
|Lazy Class [fowler:1999]||Classes generally composed of few lines of code and that have few relationships with other classes of the system [fowler:1999]. According to Fowler [fowler:1999], this represents a problem because the time on maintaining these classes might be not worthy.||N [palomba2014they]|
|Spaghetti Code [fowler:1999]||Classes without a structure that declare long methods without parameters [fowler:1999].||Y [palomba2014they, taibi2017developers]|
Iii-C Detection Of Code Smells And Their Severity
A key measure in our study is the variation in the severity of the code smells in classes before they enter the review process and after they exit from it. In other words, we are interested in deriving relative measures such as the coefficient of variation for the severity between two events rather than obtaining an absolute value [saaty1993relative]. To that end, we first need to detect the instances of the six considered code smells. We use a metric-based code smell identification strategy [pecorelli2019comparing] and, particularly, detection rules inspired to those defined by Lanza and Marinescu [Lanza:2006]: we compute object-oriented metrics [CKMetrics] characterizing the classes that were subject to code review under different angles such as size, complexity, and coupling, then define thresholds that discriminate code artifacts affected (or not) by specific types of code smells.
After the code smell identification phase, we compute the code smell severity variation () by comparing the extent to which the smelly instance exceeds the given threshold before and after a code review, as in the following formula:
Negative values of indicate that the severity of a code smell decreased as a result of the code review process, while positive values mean that it increased.
Thresholds. The detection mechanism applied can output some false positives, i.e.
, classes erroneously classified as code smells: should the number of false positives be high, the results of our study would be compromised. To mitigate this issue, we initially select the detection rules used in previous literature[fernandes2016review]; this also includes the thresholds adopted to discriminate between smelly and non-smelly artifacts [garcia2016improved, ferenc2014source, lopez2005relevance, fontana2015towards]. After selecting the detection rules, we manually analyzed some code smell instances given as output of the detection. Therefore, we realized that the thresholds were too relaxed (resulting in too many false positives). To solve this problem, we applied more strict rules, i.e., we increased the thresholds for the cyclomatic complexity, or LOC, etc. This tuning allowed us to capture slight variations of reducing false positives. Our appendix reports the entire list of our thresholds [appendix].
Iii-D Rq. Methodology
|Product||Complexity||The Weighted Method per Class metric computed on the version of the class before and after the review.||Classes with high complexity are potential candidates to be refactored.|
|Size||The Lines of Code metric computed on the version of the class before and after the review.||Large classes are hard to maintain and can be more prone to be subject of refactoring [bavota2015experimental].|
|PatchSize||Total number of files being subject of review.||Large patches can be more prone to be analyzed with respect to how the involved classes are designed.|
|Process||Churn||Sum of the lines added and removed in the version of the class before and after the review.||Classes having code smells are more change-prone [khomh2012exploratory, palomba2017diffuseness].|
|PatchChurn||Sum of the lines added or removed in all the classes being subject of review.||Large classes are hard to maintain and can be more prone to be subject of refactoring.|
|Participation||Reviewers||Number of reviewers involved in the code review.||Classes having design issues may need more reviewers to be assessed.|
|SelfApproved||Whether the submitted patch is only approved for integration by the original author.||The original author already believes that the code is ready for integration, hence the patch has essentially not been reviewed.|
|ReviewDuration||Time spent to perform the code review.||Classes having design issues may enforce developers to spend more time discussing refactoring options.|
|File coverage||PercentageComments||Proportion of comments on the class under review to total number of comments.||Classes receiving less attention (such as comments) may display design issues.|
|PercentageWords||Proportion of words in the comments on the class to total number of words in all the comments.||Classes receiving less attention (such as number of words) may display design issues.|
|Activity||Comments||Number of comments made by reviewers on the class under review.||Classes having design issues can create more discussion among the reviewers on how to refactor them.|
|Words||Number of words in the comments made by reviewers on the class under review.||Classes having design issues can create deeper discussion among the reviewers on how to refactor them.|
To address RQ, we investigate whether the decrement in the severity of the code smells in the classes submitted for code review vs. the same classes after the code review is influenced by metrics related to engagement and activity in the code review, while controlling for other characteristics. The following subsections report the methodology adopted to define and compute dependent, independent, and control variables as well as the statistical modeling we used. The granularity of our model is class-level: We compute each variable for every class that enters a review and exits from it (e.g., as depicted in the bottom table in Figure 1).
- Dependent Variable.
Our dependent variable is a binary variable‘isDecreased’ that takes the value ‘True’ if the severity of at least one code smell (existing in the file before the review) is decreased after the review and ‘False’ otherwise.
- Control Variables.
Some factors may affect the outcome if not adequately controlled. Since we were inspired by the study of Morales et al. [morales2015code], whenever feasible, we have used control variables as similar as possible to the ones used in the study by Morales et al. [morales2015code]. The first five rows of Table III describe the families of variables we use as control variables in our statistical model, (i.e., ‘Product’ and ‘Process’ metrics [rahman2013and]).
- Independent Variables.
Not all code reviews are done with the same care. Indeed, past research has defined proxy metrics to define engagement and participation in code review and has shown that higher values in these metrics related to a better outcome for a code review [mcintosh2014impact]. In this study, we proceed similarly: we selected as independent variables a set of metrics to capture the engagement as well as activity during code review and we expect that code review with higher corresponding values will lead to a stronger effect on code smell reduction. The second part of the rows in Table III describes the metrics that we consider as independent variables: They belong to the three categories of metrics related to code review activity and engagement, i.e., ‘Participation’, ‘File coverage’, and ‘Activity’. The variables in the ‘Participation’ category are computed at review level (e.g., they are equals among all the files in the same code review), while the variables in ‘File coverage’ and ‘Activity’ are computed at file level.
- Statistical Modeling.
Before building our models, we ensure that our explanatory variables are not highly correlated with one another using Spearman rank correlation tests (). We choose a rank correlation, instead of other types of correlation (e.g.
, Pearson), because it is resilient to data that is not normally distributed. We consider a pair of variables highly correlated when, and only include one from the pair in the model, favoring the simplest.
Once we computed the variables of our study, we run binary logistic regression
binary logistic regression[judge1982introduction] of the dependent variable:
Here represents the explained binary value (‘isDecreased’) for a file in a review ,
represents the log odds of being a code smell severity reduced in a review, while parameters, , , etc. represent the differentials in the log odds of being a code smell severity reduced for a file reviewed in a review with characteristics , , , etc. Moreover, we run multilevel models to take into considerations that the reviewed files are nested into different projects (level 1), as neglecting a multilevel structure may lead to biased coefficients [robinson2009ecological].
On the basis of the statistically significant codes given by the logistic regression model, we address RQ by verifying the significance of our independent variables, i.e., the code review coverage, participation, and activity metrics.
Iii-E Rq. Methodology
In the second research question, we aim at understanding the influence of code review on specific types of code smells. We also investigate whether the types of code smell previously shown to be perceived as design problems by developers [palomba2014they, taibi2017developers] are treated differently. In the context of RQ, we exclude all the classes in which more than one code smell was detected. This is done to account for the observation-bias that might arise when performing analyses on classes affected by more smells at the same time: indeed, the perception of a specific type of smell can be biased if another design flaw co-occurs [abbes2011empirical, yamashita2013developers, palomba2018large].
We first report the number of code smells for each type (e.g., ‘Complex Class’) in our dataset: this helps us initially observe whether any trend in the observations exists or whether we should expect that the type of smell and the reported perception of the developers do not represent relevant factors for the reduction of the severity.
Subsequently, similarly to RQ, we verify if the statistical significance value given by the Chi-square test remains stable when controlling for additional factors. Therefore, we build a new binary regression model using the same dependent variable (i.e., ‘decreased’ or ‘not decreased’) as well as the same explanatory variables shown in Table III
, to which we add a categorical variable representing the type of the smell (e.g., ‘God Class’). Finally, we address RQ by verifying the significance of these variables: should they turn to be significant, it would indicate that the type of smell is related to the code smell severity reduction. We also analyze whether the significant smells (if any) are those reported as more important/perceived by developers when surveyed in previous literature.
Iii-F Rq. Methodology
|System||Inspected Reviews||Inspected Files|
To answer RQ, we conduct a qualitative analysis of reviewers’ discussions. We consider the code reviews in which we observe a variation in code smell severity (as done with RQ, we exclude code smell co-occurrences to avoid biases in our observations). Specifically, we perform an open card sort [nelson1976modified] that involves two of the authors of this paper (a graduate student and a research associate - who have more than seven years of programming experience each). Hereafter, we refer to them as the inspectors. An open card sort is a well-established technique used to extract salient themes and it allows the definition of taxonomies from input data [nelson1976modified]. In our case, first we used it to organize reviewers’ discussions and classify whether the variation in code smell severity is due to (i) a specific request of the reviewers (i.e., a reviewer explicitly asks the author to modify the code to reduce its smelliness) or (ii) a side-effect of the reviewers’ requests (i.e., no smell/design issue was mentioned, thus the smelliness was reduced as a consequence of other types of comments made by the reviewers).
In the cases in which the smelliness is reduced because of an explicit request of the reviewers, we further verify whether this is due to refactoring. In remaining cases, we classify the side-effects leading to a variation of code smell severity, with the aim of understanding if there are particular reviewers’ requests connected to the phenomenon of interest. To perform such an analysis, we extract a random statistically significant sample of 365 cases222This corresponded to analyzing 262 distinct reviews. Some reviews, in fact, contained more than one file with a decreased code smell severity. in which the severity of a code smell in a file was reduced with a review from our dataset. Table IV reports the number of files and reviews inspected, by system. Afterwards, the process is conducted by two authors of this paper. They started by creating the list of 365 cases to analyze. For each case, the list reported the file path and the link to the online code review. Then, each inspector independently analyzed half of the list. After inspecting the first few cases, the two inspectors opened a discussion on some unclear cases to reach a consensus on the names and types of both code smell symptoms and side-effects triggering a variation of the severity. In a few other cases, the other authors of the paper also participated in the discussion to validate the operations done by the two inspectors and suggest possible improvements.
Iii-G Threats to Validity
- Construct validity.
We employed a heuristic-based code smell detector that computes code metrics and combines them to identify instances of the six considered design flaws. Such heuristics are inspired by those defined by Lanza and Marinescu[Lanza:2006]. After selecting the detection rules, we manually analyzed some code smell instances given as output of the detection and we applied more strict rules to reduce the number of false positives, which we deemed as too high. To allow replication of our study, the detector is publicly available in our online appendix [appendix].
As a proxy to measure reviewers’ perception of code smells (RQ), we considered whether a reviewer explicitly referred to a symptom of a smell within a discussion. To reduce the subjectivity of the classification, some cases where jointly evaluated by two inspectors.
Finally, we relied on the reviewers’ discussions to understand the reasons behind the variation of code smell severity. To try to mitigate the risk that developers may discuss design decisions through other channels [guzzi2013communication], we select software systems whose developers are mostly remote [Bavota2015, mcintosh2014impact, Mcintosh2016] and that have a large number of code review data (thus indicating that developers actively use the code review platform).
- Conclusion validity.
To ensure that the selected binary logistic regression models (RQ and RQ) were appropriate for the data taken into account, we performed a number of preliminary checks such as: (1) we ran a multilevel regression model [raudenbush2002hierarchical] to account for the multilevel structure determined by the presence of data from different projects; and (2) we built the models by adding the selected independent variables step-by-step and found that the coefficients remained stable, thus further indicating little to no interference among the variables. Furthermore, in our statistical model, we control for product and process metrics, which have been shown by previous research to be correlated to code smells [khomh2012exploratory, palomba2017diffuseness].
- External validity.
We conducted our study on the code review data belonging to a set of seven systems having different size and scope. Even though this size compares well with other experiments done in code review research [Bavota2015, mcintosh2014impact, Mcintosh2016, morales2015code], a study investigating different projects may lead to different results. Furthermore, we only considered projects written in Java (due to the lack of techniques and tools able to accurately identify code smells in different programming languages [de2018systematic, azeem2019machine]), thus results in other languages may vary.
Iv Analysis of the Results
This section reports the results of the study, discussing each research question independently.
Iv-a The Influence Of Code Review On Code Smells
Our analysis highlights that among 128,691 reviewed files (one file could be reviewed multiple times), 89,562 (70%) were affected by at least one code smell before entering a review. The severity of the smells in these affected files is reduced only for a small fraction (4%) with code reviews. This is in line with previous findings in the field [peters2012evaluating, chatzigeorgiou2010investigating, palomba2017scent], which report how the degree of smelliness of classes tends to increase over time.
Previous research showed that code review participation, percentage of discussion, and churn are good proxies to the goodness of a code review [spadini2018when, Thongtanunam2015, thongtanunam2016revisiting]. As a preliminary study, we compare these code review metrics and the decrease of severity using (i) the Wilcoxon rank sum test [conover] (with confidence level 95%) and (ii) Cohen’s d [cliff]
to estimate the magnitude of the observed difference. We choose the Wilcoxon test since it is a non-parametric test (it does not have any assumption on the underlying data distribution), while we interpret the results of Cohen’s d relying on widely adopted guidelines,i.e., negligible for 0.10, small for 0.10 0.33, medium for 0.33 0.474, and large for 0.474 [cliff].
Figure 2 shows that code reviews with more participation, coverage, and activity are notably different with respect to the distribution of decreased and not decreased smell severity. In other words, the reviews with more reviewers, a higher percentage of comments, and more churn are those in which the severity of smells tends to decrease more often. In all the three cases, the differences observed in the distributions are statistically significant ( 0.001), with a medium effect size when considering the number of reviewers as well as -churn, and a small one in the case of the percentage of comments. Thus, on the basis of the results achieved so far, we see that (i) code smells generally do not decrease in terms of severity even if they are reviewed but (ii) the higher the quality of the code review process, the higher the likelihood to observe a reduction of code smell severity.
|Has the severity of any smell decreased?|
|marginal R = 0.40, conditional R = 0.45|
|Significance codes: ’***’0.001, ’**’0.01, ’*’0.05, ’.’0.1|
We now consider all the collected metrics (depicted in Table III) using a regression model. From Table V we observe that all the explanatory variables, including the ones referring to the code review process, appear to be statistically significant.333Some of the variables in Table III are not in the statistical models: These variables were removed after we found that they were collinear with others. In cases of collinearity, we kept the variable(s) that were simpler to compute and explain, in the interest of having a simpler model. Thus, also controlling for possible confounding factors (such as the number of files reviewed and the size of the file under review), the three considered dimensions of code review represent a relevant factor to explain the reduction of code smell severity. We also evaluated the goodness of fit of our model, i.e., how well the built model actually fits a set of observations. With multilevel logistic regression model, we cannot use the traditional Adjusted R [ohtani2000bootstrapping], thus we used the method proposed by Nakagawa and Schielzeth [nakagawa2013general]: the Marginal R
(the variance explained by the fixed effects) measured 0.40 and theConditional R (the variance explained by the entire model) measured 0.45. These results are in line with the one reported by Morales et al. [morales2015code] when studying the impact of code review on design quality.
Iv-B The Influence Of Review On Smells, By Smell Type
|Has the severity of the smell decreased?|
|marginal R = 0.62, conditional R = 0.64|
|Significance codes: ’***’0.001, ’**’0.01, ’*’0.05, ’.’0.1|
Figure 3 shows the distribution of decreased vs. not decreased severity, by type of smells. The Chi-square statistical test reports that the distributions differ significantly (). The biggest Chi-square contributions come from ‘Inappropriate Intimacy’ and ‘Lazy Class’ classes, while the others contribute almost nothing. Particularly, with an odds ratio [bland2000odds] of 2.5, ‘Lazy Class’ is 2.5 times more likely to decrease after a code review than any other smell.
Table VI reports the results of the logistic regression modeling with the added independent variable ‘SmellType’. We also consider whether each smell was previously reported as perceived (P) or not perceived (NP) by developers in previous studies [palomba2014they, taibi2017developers, palomba2017scent]. Confirming the results of the aforementioned single-value analysis, ‘LazyClass’ is statistically significant. Moreover, ‘God Class’ is also significant, though with a lower value, which is most likely due to the lower overall incidence of this smell in the dataset (as visible in Figure 3). Furthermore, there is no trend on whether the smells that are expected to be “more perceived” by developers are those that are actually decreased more often in code review: In fact, both not perceived smells (i.e., ‘LazyClass’) and perceived ones (i.e., ‘God Class’) decrease with code review.
Iv-C The Causes Of The Code Smell Severity Decrement
The findings of the previous RQs showed that (i) code review participation, coverage, and activity are related to the reduction of code smell severity and (ii) it is the specific type of code smells rather than their actual perception to be relevant when explaining the reduction of severity. In this last research question we aim at shedding lights on (i) the actual motivations leading code smell severity to decrease during a code review and (ii) whether instances of the ‘Lazy Class’, ‘Complex Class’, and ‘God Class’ smells, which appeared to be the most treated ones during code review in RQ, are subject to more refactoring activities than other smells.
Table VII reports the results of our manual analysis conducted on a statistically significant sample of 365 cases in which the severity of a code smell in a file was reduced with a review. Our taxonomy is in line with that proposed by Tufano et al. [tufano2017when], who studied how code smells are removed over software evolution. We describe our findings by category.
|Category||Count||Percentage (%)||Ref. to smells|
- Code Addition.
In this category, a discussion during a code review suggests improving the code by adding new statements. Our manual analysis found 42% of the code review discussions falling into this category. We could observe that code insertion requests are mainly concerned with the management of ‘Lazy Class’ instances. While the recommended operation in these cases is represented by the ‘Inline Class’ refactoring (i.e., the incorporation of the lazy class within an existing one [fowler:1999]), we observed that in some cases reviewers recommend the addition of new functionalities, thus the insertion of new code. For example, this is the case for the class CouchbaseClient of the couchbase-java-client project: The reviewer explicitly requested the insertion of additional logging mechanisms that would have made the class more connected with the other classes of the system.
- Code Replacement.
The code review encourages the replacement of a certain piece of code and, consequently, the code smell (as well as the associated severity) can disappear, increase, or decrease. Typically, code review discussions in this category recommend a rewriting of the code from scratch instead of producing a refactoring of the original one. In our analysis, we observe that, in 32% of the cases, the severity of code smells changes because of code replacements and it is a side effect rather than a direct action of the code review. Usually, reviewers suggest changes that correct defects or improve performance; following these suggestions, the author of the patch produces a new version with the effect of re-organizing the structure of the code and decrease the severity of the code smell.
- Code Removal.
The discussion during the code review leads to the suppression444Note that this category refers to code that is permanently removed. of a piece of source code (e.g., a few statements or a method). This removal affects artifacts connected to the code smell and the smell severity decreases. Code review discussions that belong to this category often lead to a drop in the code smell severity as a side effect rather than a conscious decision. Our manual analysis confirms that the severity decreased due to indirect actions in 55 cases that, generically, refer to the suppression of ‘Blob’ and ‘Complex Class’ code smells. This is in line with past findings [tufano2017when]: sometimes the severity of code smells decreases as a side effect of code removal.
In this category, we refer to code changes that are explicitly replaced by applying one or multiple refactoring operations. The few refactoring actions requested within the considered reviewers’ discussions (i.e., 6%) confirm the results of previous researchers [bavota2015experimental, murphy2012we, silva2016we, tufano2017when, tufano2016empirical]: Developers tend not to apply refactoring operations on classes affected by code smells. Here we see that such refactorings are not even recommended. However, in eight cases reviewers suggest action to reduce the code smell severity.
- Major Restructuring.
In this category, a reviewer recommends a restructuring of the system’s architecture that consequently changes the structure of the code under review. By nature, this category involves radical changes in the source code and, as such, it may implicitly include one of the above categories. However, it differs from the others since in this case we are not able to identify the exact code change leading to the smell removal. We could only see that it is a consequence of a major system’s restructuring. In our manual analysis we found that a major restructuring of the reviewed code is performed in 5% of the cases and, as a side effect, this causes the smell severity to change.
This category clusters the code changes that do not fit into the categories previously defined. In particular, this category contains all remaining code changes that reduce the associated code smells severity, but whose meaning or action is hard to understand. Only two cases from our manual analysis belong to this category.
Finally, the last column of Table VII shows the number of times a code smell is referred to during code review. This number is very small: 19 times out of 365 (5%). This result extends previous literature [Arcoverde:WRT2011, yamashita2013developers]555The authors found that developers avoid refactoring of code smells as it is not one of their top priorities [Arcoverde:WRT2011, yamashita2013developers]. with the case of code review.
Furthermore, during manual analysis, we searched for explicit references to code smells, i.e., “We should refactor this class before it becomes a God class.” However, code smells are often symptoms of irregular trends of CK metrics (e.g., code complexity, LOC, fan in or out, number of parameters) values. Hence, during code review, reviewers might discuss these problems (such as a complex class) without mentioning the code smell related to it (i.e., ‘God Class’). In these cases, we did not consider these comments as “referencing to a code smell” because we wanted to investigate to what extent developers are aware specifically of code smells.
V Discussion and Implications
Our results highlighted some points to be further discussed as well as a set of practical implications for both the research community and tool vendors. In particular:
- Keep the quality of code review high.
We find that code reviews in which participation, coverage, and activity are higher tend to lead to a non-targeted, yet significant reduction in code smell severity. As an outcome, we confirm the importance of the research effort devoted to the definition of methodologies and novel approaches helping practitioners keep the quality of the review process high [sadowski2018modern, thongtanunam2015should]. It is, thus, important when thinking of code review as a set of activities that, in addition to finding defects, include knowledge transfer or team awareness [bacchelli:2013]. In these conditions, the presence of smells in the code might produce improper learning or misleading assumptions. Consequently, our study, in line with Pascarella et al. [pascarella2018information], revealed the need for further mechanisms to stimulate developers and make them aware of the advantages of participating and actively discussing in code review. These observations impact on educational aspects, which researchers in the code review field should further investigate and promote, but also to the definition of effective involvement strategies (e.g., code review gamification [unkelos2015gamifying]), which are still under-investigated by the research community.
- On the developers’ awareness of code smells.
A clear outcome of our research is that the severity of code smells is often left intact or even increased during code review. This result may be a consequence that developers do not perceive code smells, that code smells cannot be properly visualized, or even that developers prefer to ignore these design issues. Moreover, our findings confirmed a problem that the research community already pointed out in the past: developers tend not to refactor or deal with code smells [bavota2015experimental, silva2016we, palomba2017exploratory, kim2014empirical] and, unfortunately, code review does not seem to change the state of things either. We believe that this outcome has implications for a broad set of researchers. First, effective and contextual code smell detectors and static analyzers [pantiuchina2018towards, sae2016context, vassallo2018context], able to pinpoint the presence of design issues within the scope of a commit, should be made available to reviewers; as such, researchers in the field of code smells are called to devise new methods and tools allowing a just-in-time detection that can be applied at code review time. Second, we point out a problem of visualization of code smell-related information in code review: indeed, one possible reason behind the results achieved in our research is that existing code review tools are too focused on the changed lines and do not offer an overview of the classes under analysis. In other words, code review tools are yet unable to signal to reviewers the presence of design flaws. Thus, an implication of our study is that augmenting code review tools with new visualizations that support the detection and correction of code smells could be beneficial. Finally, our findings support the research on how to order code changes within code review [baum2017optimal, spadini2019test], which may be used as a way to prioritize the analysis and correction of code smells.
- Promoting smart discussions in code review.
As both high participation and discussion within the code review process tend to promote the application of changes that have the side-effect of reducing code smell severity, a possible outcome of our research consists in calling for the definition of novel methodologies that directly address the people involved in the code review rather than the process itself. For example, novel machine learning-based methods that recognize the current context of a discussion and propose smart suggestions to lead reviewers discussing of specific design flaws that impact the maintainability of source code might be a potentially useful way to foster code quality within the code review process. Similarly, the recent advances on code documentation[pascarella2017classifying, pascarella2019classifying, sridhara2010towards], showing and classifying the comments that help developers understanding source code, might be put in the context of code review to provide developers with further contextual information to locate technical debts.
In the study presented in this paper, we investigated the relation of code review to code smells by mining more than 21,000 code reviews belonging to seven Java open-source projects and considering six well-known code smells. Our findings pointed out that active and participated code reviews have a significant influence on the reduction of code smell severity. However, such a reduction is typically a side effect of code changes that are not necessarily related to code smells. The results of the study represent the input for our future research agenda, which primarily includes the definition and evaluation of non-intrusive alert systems that notify the presence of code smells at code review-time. Furthermore, we aim at replicating our study considering projects from different domains as well as closed-source projects, which may have different guidelines for code review activities.
A. Bacchelli and F. Palomba gratefully acknowledge the support of the Swiss National Science Foundation through the SNF Projects No. PP00P2_170529 and PZ00P2_186090. This project has received funding from the European Union’s H2020 programme under the Marie Sklodowska-Curie grant agreement No. 642954.