Are automated static analysis tools worth it? An investigation into relative warning density and external software quality

Automated Static Analysis Tools (ASATs) are part of software development best practices. ASATs are able to warn developers about potential problems in the code. On the one hand, ASATs are based on best practices so there should be a noticeable effect on software quality. On the other hand, ASATs suffer from false positive warnings, which developers have to inspect and then ignore or mark as invalid. In this article, we ask the question if ASATs have a measurable impact on external software quality, using the example of PMD for Java. We investigate the relationship between ASAT warnings emitted by PMD on defects per change and per file. Our case study includes data for the history of each file as well as the differences between changed files and the project in which they are contained. We investigate whether files that induce a defect have more static analysis warnings than the rest of the project. Moreover, we investigate the impact of two different sets of ASAT rules. We find that, bug inducing files contain less static analysis warnings than other files of the project at that point in time. However, this can be explained by the overall decreasing warning density. When compared with all other changes, we find a statistically significant difference in one metric for all rules and two metrics for a subset of rules. However, the effect size is negligible in all cases, showing that the actual difference in warning density between bug inducing changes and other changes is small at best.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/08/2021

On the differences between quality increasing and other changes in open source Java projects

Static software metrics, e.g., size, complexity and coupling are used in...
03/25/2021

A Better Approach to Track the Evolution of Static Code Warnings

Static bug detection tools help developers detect code problems. However...
03/05/2021

Does chronology matter in JIT defect prediction? A Partial Replication Study

Just-In-Time (JIT) models detect the fix-inducing changes (or defect-ind...
10/27/2021

From Complexity Measurement to Holistic Quality Evaluation for Automotive Software Development

In recent years, the role and the importance of software in the automoti...
11/20/2019

Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection

Defect prediction research has a strong reliance on published data sets ...
03/03/2021

Shipwright: A Human-in-the-Loop System for Dockerfile Repair

Docker is a tool for lightweight OS-level virtualization. Docker images ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated Static Analysis Tools or linters are programs that perform rule matching of source code via different representations, e.g., Abstract Syntax Trees, call graphs or bytecode to find potential problems. Rules are predefined by the ASAT and based on common coding mistakes and best practices. If a rule is matched, a warning is generated for the developer who can then inspect the given file, line number and rule. Common coding best practices involve ASATs use at different contexts (Vassallo et al., 2020), e.g., as part of Continuous Integration (CI), within IDEs, or to support code reviews. Developers also think of these tools as quality improving when used correctly (Christakis and Bird, 2016; Vassallo et al., 2020; Devanbu et al., 2016; Querel and Rigby, 2021). However, due to their rule matching nature, ASATs are prone to false positives, i.e., warnings about code that is not problematic (Johnson et al., 2013)

. This hinders the adoption of these tools and their usefulness, as developers have to inspect every warning that is generated whether it is a false positive or not. As a result, research into classifying

ASAT warnings into true and false positives or actionable warnings is conducted, e.g., Heckman and Williams (2009), Kim and Ernst (2007) and Koc et al. (2017). Due to these two aspects, ASATs are perceived as quality improving while at the same time require manual oversight and corrections.

Due to this manual effort, we want to have a closer look on the impact of ASATs on measurable quality. Previous research regarding the impact on quality can be divided into direct measures which target bug fixing commits, e.g., Vetro et al. (2011), Thung et al. (2012); Habib and Pradel (2018) and indirect measures which use ASAT warnings as part of predictive models, e.g., Nagappan and Ball (2005), Plosch et al. (2008), Rahman et al. (2014), Querel and Rigby (2018), Lenarduzzi et al. (2020), Trautsch et al. (2020b) and Querel and Rigby (2021). Both approaches usually suffer from data validation issues. ASAT warnings measured directly in bug fixing commits are usually validated via manual inspection of the warnings either by students (Vetro et al., 2011) or by the researchers themselves (Thung et al., 2012; Habib and Pradel, 2018). Predictive models that include ASAT warnings as features includes bugs and bug fixing commits which introduces new data validity problems, e.g., with noisy issue types as reported by Antoniol et al. (2008) and Herzig et al. (2013b), links from bug reports to commits, or the SZZ (Śliwerski et al., 2005) implementation (Fan et al., 2019). Recently Rosa et al. (2021) demonstrated in their investigation that finding bug inducing information is challenging, when using common or even state-of-the art SZZ approaches. In this article, we mitigate this by using manually validated bug fixing data from a large-scale validation study (Herbold et al., 2021). While this reduces the data available for the study, we believe that this is a worthwhile trade off to be able to get more signal with less noise from the available data.

Between direct and indirect impact studies is a missing piece however. Direct impact studies make sense for security focused ASATs (Flawfinder, RATS) or bug focused ASATs (FindBugs, SpotBugs), but not necessarily for general purpose ASATs (PMD, SonarQube). General purpose ASATs are able to uncover readability related issues, e.g., style or naming issues in addition to common programming errors or violations of best practices. These issues, while possibly not directly responsible for a bug, may introduce bugs later, or, in other parts of the file due to the reduced understandability of the code. Indirect studies that rely on ASAT warnings as features for defect prediction may be too indirect to measure the actual relation between the ASATs and defects, as this is by necessity done in relation to prediction models using other features of changes. Moreover, this approach ignores problems due to differences in warning density between projects or correlations with size or time.

In order to make general statements, differencing techniques can be used, e.g., differences between bug inducing commits and previous commits as is done in the work by Lenarduzzi et al. (2020). What is missing however, is a more general view of the differences between warning densities of the files and the rest of the project at each time step. Such a view would allow us to explore whether files that contain more static analysis warnings than the rest of the project also induce more bugs.

In prior work (Trautsch et al., 2020a) we investigated trends of ASAT warnings and investigated whether usage of an ASAT has an influence on software quality measured via defect density (Fenton and Bieman, 2014). However, our analysis with regards to bugs via defect density was coarse grained. We address this limitation in this article. We adopt a recently introduced approach for fine-grained just-in-time defect prediction by Pascarella et al. (2019) to perform a more targeted investigation of the impact of static analysis warnings on quality via introduced bugs as a proxy for software quality instead of the more coarse grained defect density for one year. The approach by Pascarella et al. (2019) provides the advantage of handling files within commits and not only commits which makes it suitable for us to study the impact of static analysis over the history of all files within our study subjects.

In our previous work, we found that static analysis warnings are evolving over time and that we can not just use the density or the sum of warnings in our case study (Trautsch et al., 2020a). Therefore, we use an approach that is able to produce a current snapshot view of the files we are interested in by measuring the file that induces a bug and, at the same time, all other files of the project. This ensures that we are able to produce time and project independent measurements. The drawback of this approach is that it requires a large computational effort, as we have to run the ASAT under study on every file in every revision of all study subjects. However, the resulting empirical data yields insights for researchers and practitioners.

The research question that we answer in our exploratory study is:

  • Do bug inducing files contain more static analysis warnings than other files?

We apply a modified fine-grained just-in-time defect prediction data collection method to extract software evolution data including bug inducing file changes and static analysis warnings from a general purpose ASAT. We chose PMD111https://pmd.github.io/ as the general purpose ASAT as it has been available for a long time and provides a good mix of available rules. Using this data and a warning density based metric calculation, we investigate the differences between bug inducing files and the rest of the studied system at the point in time when the bug is introduced. In summary, this article contains the following contributions.

  • A unique and simple approach to measure impact of ASATs that is independent of differences between projects, size and time.

  • Complete static analysis data for PMD for 23 open source projects for every file in every commit.

  • An investigation into relative warning density differences within bug inducing changes.

The main findings of our exploratory study are:

  • Bug inducing files do not contain higher warning density than the rest of the project at the time when the bug is introduced.

  • When comparing bug inducing warning density with all other changes we can measure higher warning density on a subset of PMD warnings that is a popular default for two metrics and for all available rules for one metric.

The rest of this article is structured as follows. Section 2 lists previous research related to this article and discusses the differences. Section 3 describes the case study setup, methodology, analysis procedure and the results. Section 4 discusses the results of our case study and relates them to the literature. Section 5 lists and discusses threats to validity we identified for our study. Section 6 concludes the article with a short summary and provides a short outlook.

2 Related Work

In this article, we explore a more general view of ASATs and the warning density differences of bug inducing changes. This can be seen as a mix of a direct and indirect impact study. Therefore we describe related work for both direct and indirect impact studies within this section.

The direct impact is often evaluated by exploring if bugs that are detected in a project are fixed by removing ASAT warnings, i.e., did the warning really indicate a bug that needed to be fixed later.

Thung et al. (2012) investigated bug fixes of three open source projects and three ASATs: PMD, JLint, and FindBugs. The authors look at how many defects are found fully and partially by changed lines and how many are missed by the ASATs. Moreover, the authors describe the challenges of this approach: not every line that is changed is really a fix for the bug, therefore the authors perform manual investigation on a per-line level to identify the lines. They were able to find all lines responsible for 200 of 439 bugs. In addition, the authors find that PMD and FindBugs perform best, however their warnings are often very generic.

Habib and Pradel (2018) perform an investigation of the capabilities to find real world bugs via ASATs. The authors used the Defects4J dataset by Just et al. (2014) with an extension222https://github.com/rjust/defects4j/pull/112 to investigate the number of bugs found by three static analysis tools, SpotBugs, Infer and error-prone. The authors show that 27 of 594 bugs are found by at least one of the ASATs.

In contrast to Thung et al. (2012) and Habib and Pradel (2018), we only perform an investigation of PMD. However, due to our usage of SmartSHARK (Trautsch et al., 2017), we are able to investigate 1,723 bugs for which at least three researchers achieved consensus on the lines responsible for the bug. Moreover, as PMD includes many rules related to readability and maintainability, we build on the assumption that while they are not directly indicating a bug, resolving these warnings improves the quality of the code and may prevent future bugs. This extends previous work by taking possible long term effects of ASAT warnings into account.

Indirect impact is explored by using ASAT warnings as features for predictive models and providing a correlation measure of ASAT warnings to bugs.

Nagappan and Ball (2005) explore the ability of ASAT warnings to predict defect density in modules. The authors found in a case study with Microsoft, that static analysis warnings can be used to predict defect density, therefore they can be used to focus quality assurance efforts on modules that show a high number of static analysis warnings. In contrast to Nagappan and Ball (2005), we are exploring open source projects. Moreover, we explore warning density differences between files and the project they are contained in.

Rahman et al. (2014) compare static analysis and statistical defect prediction. They find that FindBugs is able to outperform statistical defect prediction, while PMD does not. Within our study, we focus on PMD as a general purpose ASAT. Instead of a comparison with statistical defect prediction we explore, whether we can measure a difference of ASAT warnings between bug inducing changes and other changes.

Plosch et al. (2008) explores a correlation between ASAT warnings as features for a predictive model and the dependent variable, i.e., bugs. They found that static analysis warnings may improve the performance of predictive models and that they are correlated with bugs. In contrast to Plosch et al. (2008), we are not building a predictive model. We are exploring whether we can find an effect of static analysis tools without a predictive model with multiple features, instead we strive to keep the approach as simple as possible.

Querel and Rigby (2018) improve the just-in-time defect prediction based commit guru (Rosen et al., 2015) by adding ASAT warnings to the predictive model. The authors show, that just-in-time defect prediction can be improved by adding static analysis warnings. This means that there should be a connection between external quality in the form of bugs and static analysis warnings. In a follow up study (Querel and Rigby, 2021) the authors found that while there is an effect of ASAT warnings the effect is likely small. In our study, we explore a different view on the data. We explore warning density differences between bug inducing files and the rest of the project.

Lenarduzzi et al. (2020) investigated SonarQube as an ASAT and if the reported warnings can be used as features to detect reported bugs. The authors are combining direct with indirect impact but are more focused on predictive model performance measures. In contrast to Lenarduzzi et al. (2020), we are mainly interested in the differences in warning density between bug inducing files and the rest of the project. We are also investigating an influence, but in contrast to Lenarduzzi et al., we are comparing our results for bug fixing changes to all other changes to determine whether what we see is really part of the bug fixing change and not a general trend of all changes.

3 Case Study

The goal of the case study is to find evidence if usage of ASATs have a positive impact on the external software quality of our case study subjects. In this section, we explain the approach and ASAT choice. Moreover, we explain our study subject selection and describe the methodology and analysis procedure. At the end of this section we present the results.

3.1 Static analysis

Static analysis is a programming best practice. ASATs scan source code or byte code and match against a predefined set of rules. When a rule matches, the tool creates a warning for the part of the code that matches the rule.

There are different tools for performing static analysis of source code. For Java these would be, e.g., Checkstyle, FindBugs/SpotBugs, PMD, or SonarQube. In this article, we focus on Java as a programming language because it is widely used in different domains and has been in use for a long time. The static analysis tool we use is PMD. There are multiple reasons for this. PMD does not require the code to be compiled first as, e.g., FindBugs does. This is an advantage especially with older code that might not compile anymore due to missing dependencies (Tufano et al., 2017)

. PMD supports a wide range of warnings of different categories, e.g., naming and brace rules as well as common coding mistakes. This is an advantage over, e.g., Checkstyle which mostly deals with coding style related rules. This enables PMD to give a better overview of the quality of a given file instead of giving only probable bugs within it. The relation to software quality that we expect of PMD stems directly from its rules. The rules are designed to make the code more readable, less error prone and overall more maintainable.

3.2 Just-in-time defect prediction

The idea behind just-in-time defect prediction is to assess the risk of a change to an existing software project (Kamei et al., 2013). Previous changes are extracted from the version control system of the project and, as they are in the past, it is known whether the change induced a bug. This can be observed by subsequent removal or alteration of the change as part of a bug fixing commit. If the change was indeed removed or altered as part of a bug fixing operation it is traced back to its previous file and change and labeled as bug inducing, i.e., it introduced a bug that needed to be fixed later. In addition to these labels, certain characteristics of the change are extracted as features, e.g., lines added or the experience of the author to later train a model to predict the labels correctly for the commits. The result of the model is then a label or probability whether the change introduces a bug, i.e., the risk of the change.

However, ASATs are working on a file basis and we also want to investigate longer-term effects of ASATs. This means we need to track a file over its evolution in a software project. To achieve this, we are building on previous work by Pascarella et al. (2019) which introduced fine-grained just-in-time defect prediction. In a previous study, we improved the concept by including better labels and static analysis warnings as well as static code metrics as features (Trautsch et al., 2020b). Similar to Pascarella et al. (2019), we are building upon PyDriller (Spadini et al., 2018). In this article, we build upon our previous work and include not only counts of static analysis warnings but relations between the files, e.g., how different is the number of static analysis warnings in one file from the rest of the project. We also include aggregations of warnings with and without a decay over time.

3.3 Study Subjects

Our study subjects consist of 23 Java projects under the umbrella of the Apache Software Foundation333https://www.apache.org previously collected by (Herbold et al., 2020). Table 1 contains the list of our study subjects. We only use projects which contain fully validated bug fixing on a line-by-line level collected in a crowd sourcing study (Herbold et al., 2021). Every line in our data was labeled by four researchers. We only consider bug fixing lines for which at least three researchers agree that it fixes the considered bug. This naturally restricts the number of available projects but improves the noise to signal ratio of the data.

Project #commits #file changes #issues Time frame
ant-ivy 1,647 7,860 296 2005-2017
commons-bcel 850 9,604 27 2001-2017
commons-beanutils 561 2,648 28 2001-2017
commons-codec 810 2,062 21 2003-2017
commons-collections 1,687 11,296 32 2001-2017
commons-compress 1,401 3,566 87 2003-2017
commons-configuration 1,659 4,177 97 2003-2017
commons-dbcp 729 2,211 39 2001-2017
commons-digester 1,131 3,750 11 2001-2017
commons-io 985 2,781 51 2002-2017
commons-jcs 774 7,775 37 2002-2017
commons-lang 3,028 6,312 109 2002-2017
commons-math 4,135 21,440 190 2003-2017
commons-net 1,076 4,666 96 2002-2017
commons-scxml 469 1,774 39 2005-2017
commons-validator 557 1,324 37 2002-2017
commons-vfs 1,098 7,209 67 2002-2017
giraph 819 7,715 109 2010-2017
gora 464 2,256 38 2010-2017
opennlp 1,166 9,679 82 2010-2017
parquet-mr 1,053 5,957 69 2012-2017
santuario-java 1,177 8,503 41 2001-2017
wss4j 1,711 12,218 120 2004-2017
Sum 28,987 146,783 1,723
Table 1: Study subjects in our case study

We now give a short overview what the potential problems are and how we mitigate them. When we look at external quality, we want to extract data about defects. However, there are several additional restrictions we want to apply. First, we want to extract defects from the Issue Tracking System (ITS) of the project and link them to commits in the Version Control System (VCS) to determine bug fixing changes. Several data validity considerations need to be taken in to account here. The ITS usually has a kind of label or type to distinguish bugs from other issues, e.g., feature requests. However, research shows that this label is often incorrect, e.g., Antoniol et al. (2008), Herzig et al. (2013a) and Herbold et al. (2020). Moreover, with this kind of software evolution research, we are interested in bugs existing in the software and not bugs which occur because of external factors, e.g., new environments or dependency upgrades. Therefore, we are only considering intrinsic bugs (Rodriguez-Pérez et al., 2020).

The next step is the linking between the issue from the ITS and the commit from the VCS. This is achieved via a mention of the issue in the commit message, e.g., fixes JIRA-123. While this seems straightforward there are certain cases where this can be problematic. The simplest one being that there is a typo in the project key, e.g., JRIA-123.

Moreover, not all changes within bug fixing commits contribute to bug fixes. Unrelated changes can be tangled with the bug fix. The restriction of all data to only changes that directly contribute to the bug fix further reduces noise in the data. We are only interested in the lines of the changes that contribute to the bug fix. This is probably the hardest to manually validate.

This was achieved in a prior publication (Herbold et al., 2020) which served as the base for the publication which data we use in this article (Herbold et al., 2021). In (Herbold et al., 2021) a detailed untangling is performed by four different persons for each change that fixes a bug that meets our criteria.

3.4 Replication Kit

We provide all data and scripts as part of a replication kit444https://github.com/atrautsch/emse2021a_replication.

3.5 Methodology

To answer our research question, we extract information about the history of our study subjects including bugs and the evolution of static analysis warnings. While the bulk of the data is based on (Herbold et al., 2021) we include several additions necessary for answering our research question.

To maximize the relevant information within our data we include as much information from the project source code repository as possible. After extracting the bug inducing changes, we build a commit graph of all commits of the project and then find the current main branch, usually master. After that, we find all orphan commits, i.e., all commits without parents. Then we discard all orphans that do not have a path to the last commit on the main branch, this discards separate paths in the graph, e.g., gh-pages555https://docs.github.com/en/pages for documentation. As we also want to capture data on release branches which are never merged back into the main branch, we add all other branches that have a path to one of our left over orphan commits. The end result is a connected graph which we traverse via a modified breadth first search. We take the date of the commit into account while we traverse the graph.

The traversal is an improved version of previous work (Trautsch et al., 2020b). In addition to the previously described noise reduction via manual labeling, we additionally restrict all files to production code. One of the results of Herbold et al. (2021) is that non-production code is often tangled with bug fixing changes. Therefore we only add files that are production files to our final data analogous to Trautsch et al. (2020a). This also helps us to provide a clearer picture of warning density based features as production code may have a different evolution of warning density than, e.g., test or example code.

In our previous study (Trautsch et al., 2020a) we found that static analysis warnings are correlated to Logical Lines of Code (LLOC). This is not surprising as we are observing large portions of our study subjects code history. Large files that are added and removed have an impact on the number of static analysis warnings. While we do not want to discard this information we also want to avoid the problem of large changes overshadowing information in our data. Therefore, like in our previous study we are using warning density as a base metric in this study analogous to prior studies, e.g., Aloraini et al. (2019) and Penta et al. (2009).

Warning density () is the ratio of the number of warnings and the size of the analyzed part of the code.

(1)

Product size is measured in LLOC. If we measure the warning density of a system , we sum warnings and LLOC for each file. If we measure the warning density of a file , we restrict the number of warnings and the LLOC to that file.

While this measure provides a size independent metric, we also need to take differences between projects into account. Warning density can be different between projects and even more so for different points in time for each project. To be able to use all available data we account for these differences by using differences in warning density between the files of interest and the rest of the project under study (the system) at the specific point in time.

We calculate the warning density difference between the file and the system .

(2)

If the file at time contains less static analysis warnings per LLOC than the system at time the value is negative and if it contains more it is positive. We can use this metric to investigate bug inducing commits and determine whether the files responsible for bugs contain less or more static analysis warnings per LLOC than the system they belong to.

While this yields information corrected for size, project differences, and time of the change we also want to incorporate the history of each file. Therefore, we also sum this difference in warning density for all changes to the file. We assume that recent changes are more important than old changes. Therefore, we introduce a decay in our warning density derived features.

(3)

For the decayed file system warning density delta we compute the decayed, cumulative sum of the difference between the warning density of the file () and the warning density of the system (). The rationale is that if a file is constantly better, with regards to static analysis warnings, than the mean of the rest of the system this should have a positive effect. As the static analysis rules are diverse this can be improved readability, maintainability or robustness due to additional null checks. Within our study, we explore if this effect has a measurable effect on buggyness, i.e., the lower this value is the less often the file should be part of bug inducing commits.

Instead of using all warnings for warning density we can also restrict these warnings to a smaller set to see if this has an effect. While we do not want to choose multiple subsets to avoid false positive findings, we have to investigate whether our approach to use all available warnings just waters down the ability to indicate files which may contain bugs. To this end, we also investigate the warning density consisting only of PMD warnings that are enabled by default by the maven-pmd plugin666https://maven.apache.org/plugins/maven-pmd-plugin/ which we denote as default rules. This restricts the number of warnings that are the basis of the warning density calculation to a subset of 49 warnings that are generally considered relevant in comparison to the total number of 314 warnings. Their use as default warnings serves to restrict this subset to generally accepted important warnings.

To answer our research question we compare the warning density for each bug inducing file against the project at the time before and after the bug inducing change. If the difference is positive this means that the file had a higher warning density than the rest of the project and negative vice versa. We plot the difference in warning density in a box plot for all bug inducing files to provide an overview over all our data.

As this is influenced by a continuously improving warning density we also measure the differences between bug inducing file changes and all other file changes. We first perform a normality test and find that the data is not normal in all cases. Thus, we apply a Mann-Whitney U test (Mann and Whitney, 1947) with that there is no difference between both populations and that bug inducing files have a different warning density. We set a significance level of 0.05. Additionally, we perform a Bonferroni (Abdi, 2007) correction for 8 normality tests and 4 Mann-Whitney U tests. Therefore we reject at p 0.0042. If the difference is statistically significant we calculate the effect size with Cliff’s  (Cliff, 1993).

3.6 Results

We now present the results of our study and the answer to our research question whether bug inducing files contain more static analysis warnings than other files. For this, we divide the results into three parts. First, we look at the warning density via at the time before and after a bug is induced and after a bug is induced777before is already part of the formula. Second, we look at the differences between our study subjects and the prior number of changes for bug inducing file changes. Third, we compare bug inducing file changes with all other changes and determine if they are different.

3.6.1 Differences of warning density before and after the bug inducing change

Figure 1 shows the difference in warning density between the each bug inducing file and the rest of the system at the point in time before inducing the bug and after.

Figure 1: Box plot of for all bug inducing files before and after the bug inducing change and for all bug inducing files after the bug inducing change, median value in parentheses. Fliers are omitted.
Figure 2: Box plot of for only default warnings of all bug inducing files before and after the bug inducing change, median value in parentheses. Fliers are omitted.

Surprisingly, we see a negative warning density median difference for . This means that the warning density of the files in which bugs are induced is lower than the rest of the project. The drop in warning density shows that the code before the change had less warning density than after the bug inducing change. This means that code that on average contains more static analysis warnings was introduced as part of the bug inducing change.

Now, we are also interested if the history of preceding differences in warning density makes a difference. Instead of using the warning density difference at the point in time of the bug inducing change we use a decayed sum of the warning density differences leading up to the considered bug inducing change.

Figure 1 shows a negative median for as well. The accumulated warning density differences between the file and the rest of the project are therefore also negative. Figure 2 shows and for bug inducing changes restricted to default rules. We can see, that the warning density for default only is much lower due to the lower number of warnings that are considered. We can also see, that the same negative median is visible when we restrict the set of ASAT rules to default. Overall, bug inducing changes have lower warning density than the other files of the project at the time the bug was induced. However, as we will see later, this is an effect of overall decreasing warning density of our study subjects.

3.6.2 Differences between projects and number of changes

Figure 3: Box plots of separately for all study subjects. The number of bug inducing file changes are in parentheses, median value in parentheses. Fliers are omitted.

Instead of looking at all files combined we can also look at each project on its own. We provide this data in Figure 3. However, we note that the number of bug inducing files is low in some projects. Such projects may be influenced by few changes with extreme values. Hence, the results of single projects should be interpreted with caution. Instead we consider trends visible in the data. While we can combine all our data due to our chosen method of metric calculation we still want to provide an overview of the per project values. This is shown in Figure 3 for . Figure 3 also demonstrates the difference between projects. For example, the median for comons-codec is is positive, i.e., files which induce bugs contain more warnings. The opposite is the case for, e.g., commons-digester, where the median is negative.

Overall, Figure 3 shows that the median is negative for 16 of 23 projects. This means that bug inducing changes have less warning density than the rest of the project for most study subjects. A possible explanation for this could be that files which have a lower warning density are changed more often and those are the same that could be inducing bugs. If we look at the number of changes a file has in Figure 4, we can see that bug inducing files have a bit more changes. However, the sample sizes for both are vastly different.

Figure 4: Number of changes for bug inducing files and other files. Fliers are omitted.

3.6.3 Comparison with all other changes

We now take a look at how warning density metrics differ in bug inducing changes from all other changes. We notice that the median is below zero in all cases. This is due to the effect that warning density usually decreases over time (Trautsch et al., 2020a). Therefore, we provide a comparison of bug inducing changes with all other changes.

Figure 5 shows for bug inducing and other changes for both all rules and only the default rules. We can see that bug inducing changes have a slightly higher warning density than other changes. If we apply only default rules we see that bug inducing changes are also slightly higher.

Figure 5: Box plots of before the bug inducing change for all and default only rules for bug inducing and other file changes, median value in parentheses. Fliers are omitted.

Figure 6 shows the same comparison for . The difference for all rules is very small. However, the median for bug inducing changes is slightly higher. In contrast, we can see that for default rules the bug inducing changes have a slightly higher warning density than other changes.

Figure 6: Box plots of for all and default only rules bug inducing and other file changes, median value in parentheses. Fliers are omitted.

Table 2 shows the results of the statistical tests for differences between the values for Figure 5 and Figure 6.

We can see that for all rules there is a statistically significant difference. This shows that bug inducing file changes have a higher warning density than other changes.

Overall, we see that there is a significant difference with a negligible effect size for and for default rules between bug inducing and other changes. The data shows that in these cases the bug inducing file changes have a higher warning density than other changes.

WD Metric Median other Median bug inducing P-value Effect size
-0.0440 -0.0300 <0.0001 0.05 (n)
(default) -0.0098 -0.0072 <0.0001 0.10 (n)
-0.0948 -0.0661 0.0247 -
(default) -0.0228 -0.0170 <0.0001 0.07 (n)
Table 2: Median values, Mann-Whitney U test p-values and effect sizes for all warning density metrics.

Together with Figure 5, Figure 6 and Table 2 we can conclude, that bug inducing file changes contain more static analysis warnings than other file changes. Restricting the rules to the default set increases the effect size slightly. However, the effect sizes are still negligible in all cases.

3.6.4 Summary

In summary, we have the following results for our research question.

       RQ Summary: Do bug inducing files contain more static analysis warnings than other files? We find that bug inducing files contain less static analysis warnings than other files at bug inducing time. However, this is not because these files have a higher quality, but rather because the warning density decreases over time, i.e., most files that are changed have less warnings than the rest of the project. When we compare the differences between bug inducing and other changes, we find that it depends on the applied rules. If we apply all rules we find a statistically significant difference in . If we apply only default rules we find a statistically significant difference in and . This indicates that bug inducing files have slightly more static analysis warnings than other files, although the effect size in all cases is negligible.   

4 Discussion

We found that the bug inducing change itself increased the warning density of the code in comparison to the rest of the project as shown in Figure 1. This means that the actual change in warning density is as we expected, i.e., the change that induces the bug is increasing the warning density in comparison to the rest of the project. This is an indication that warning density related metrics can be of use in just-in-time defect prediction scenarios, i.e., change based scenarios, as also shown by Querel and Rigby (2021) and in our previous work Trautsch et al. (2020b). However, the effect is negligible in our data. This was also the case for predictive models by Querel and Rigby (2021). Thus, any gain in prediction models due to general static analysis warnings is likely very small.

However, when we look at the median difference between bug inducing files and the rest of the project at that point in time we see that bug inducing files contain less static analysis warnings. This counter intuitive result can be fully explained by the overall decreasing warning density over time we found in our previous study (Trautsch et al., 2020a). This finding is highly relevant for researchers, because this shows the importance of accounting for time as confounding factor for the evaluation of the effectiveness of methods. Without the careful consideration of the change over time, we would now try to explain why bug inducing files have less warnings and other researchers may built on this faulty conclusion. Therefore, this part of our results should also be a cautionary tale for other researchers that investigate the effectiveness of tools and methods: if the complete project is used as a baseline, it should always be considered when source code was actually worked on. If parts of the source code have been stable for a long time, they are not suitable for a comparison with recently changed code, without accounting for general changes, e.g., in coding or testing practices, over time.

However, we did find that code with more PMD warnings leads to more bugs when changed. When looking into the differences between bug inducing file changes and all other file changes we find significant differences in 3 of 4 cases. While the effect size is negligible, in all cases using only the default rules yields a higher effect size. These rules were hand-picked by the Maven developers, arguably because of their importance for the internal quality. For practitioners, this finding if of particular importance: not only does it reduce the number of alerts to carefully select ASAT warnings from a large set of candidates, it also helps to reduce general issues that are associated with bugs.

This also has implications for researchers when including warning density based metrics into predictive models. Our data shows that the model might be improved by choosing an appropriate subset of the possible warnings of an ASAT. Using all warnings without considering their potential relation to defects is not a good strategy. Our data also shows that a good starting point might be a commonly used default, e.g., for PMD the maven-pmd-plugin default rules.

5 Threats to validity

In this section, we discuss the threats to validity we identified for our work. To structure this section we discuss four basic types of validity separately, as suggested by Wohlin et al. (2000).

5.1 Construct validity

A threat to the relation between theory and observation may occur in our study from the measurement of warning density. We restrict the data to production code to mitigate effects test code has on warning density as it is often much simpler than production code.

5.2 Internal validity

A general threat to internal validity would be a selection of static analysis warnings. We mitigate this by measuring the warning density for all warnings and for only default warnings as a common subset for Java projects. Due to the nature of our approach, we mitigate differences between projects regarding the handling of warnings as well as the impact of size.

5.3 External validity

Due to the usage of manually validated data in our study, our study subjects are restricted to those for which we have this kind of data. This is a threat to the generalizability of our findings, e.g., to all Java projects or to all open source projects. Still, as we argue in (Herbold et al., 2021), our data should be representative for mature Java open source projects.

Moreover, we observe only one static analysis tool (PMD). While this may also restrict the generalizability of our study, we believe that due to the large range of rules of this ASAT our results should generalize to ASATs that are broad in scope. ASATs of a different focus, e.g., on coding style (Checkstyle) of directly finding bugs (FindBugs, SpotBugs) may result in different results.

5.4 Conclusion validity

We report lower warning density for bug inducing files in comparison to the rest of the project at that point in time. While this reflects the difference in warning density between the file and the project, it can be influenced by constantly decreasing warning density. We mitigate this by also including a comparison between bug inducing changes and all other changes.

6 Conclusion

In this article we provide evidence for a common assumption in software engineering, i.e., that static analysis tools provide a net-benefit to software quality even though they suffer from problems with false positives. We use an improved state-of-the art approach used for fine-grained just-in-time defect prediction to establish a link between files within commits that induce bugs and measure warning density related features which we aggregate over the evolution of our study subjects. This approach runs on data which allows us to remove several noise factors from our data, wrong issue types, wrong issue links to commits and tangled bug fixes. The analysis approach allows us to merge the available data as it mitigates differences between projects, sizes and to some extend the evolution of warnings over time.

We find that bugs are induced in files which have a comparably low warning density, i.e., less static analysis warnings than the files of the rest of the project at the time the bug was induced. However, this difference can be explained by the fact that the warning density decreases over time. When we compare the bug inducing changes with all other changes, we do find a significant higher warning density when using all PMD rules in one of two metrics. However, the effect size is negligible. When we use a small rule set that restricts the 314 PMD warnings to the 49 warnings hand-picked by the Maven developers as default warnings, we find that bug inducing changes have a significant but also negligible larger warning density. However, the effect size increases for the default rule set. Assuming that the smaller rule set was crafted with the intent to single out the most important rules for the quality, this indicates that there is indeed a (weak) relationship between general ASAT tools and bugs.

This is also direct evidence for a common best practice in the use of static analysis tools: Appropriate rules for ASATs should be chosen for the project. This not only reduces the number of alarms, which is important for the acceptance by developers, but also has a better relationship with the external quality of the software measured through bugs.

Declarations

This work was partly funded by the German Research Foundation (DFG) through the project DEFECTS, grant 402774445.
The authors have no competing interests to declare that are relevant to the content of this article.

References

  • H. Abdi (2007) Bonferroni and Sidak corrections for multiple comparisons. In Encyclopedia of Measurement and Statistics, pp. 103–107. Cited by: §3.5.
  • B. Aloraini, M. Nagappan, D. M. German, S. Hayashi, and Y. Higo (2019) An empirical study of security warnings from static application security testing tools. Journal of Systems and Software 158, pp. 110427. External Links: ISSN 0164-1212, Document, Link Cited by: §3.5.
  • G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y. Guéhéneuc (2008) Is it a bug or an enhancement? a text-based approach to classify change requests. In Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON ’08, New York, NY, USA. External Links: ISBN 9781450378826, Document Cited by: §1, §3.3.
  • M. Christakis and C. Bird (2016) What developers want and need from program analysis: an empirical study. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, New York, NY, USA, pp. 332–343. External Links: ISBN 978-1-4503-3845-5, Link, Document Cited by: §1.
  • N. Cliff (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychological Bulletin. Cited by: §3.5.
  • P. Devanbu, T. Zimmermann, and C. Bird (2016) Belief evidence in empirical software engineering. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), Vol. , pp. 108–119. External Links: Document, ISSN Cited by: §1.
  • Y. Fan, X. Xia, D. Alencar da Costa, D. Lo, A. E. Hassan, and S. Li (2019) The impact of changes mislabeled by szz on just-in-time defect prediction. IEEE Transactions on Software Engineering (), pp. 1–1. External Links: Document Cited by: §1.
  • N. Fenton and J. Bieman (2014) Software metrics: a rigorous and practical approach, third edition. 3rd edition, CRC Press, Inc., Boca Raton, FL, USA. External Links: ISBN 1439838224, 9781439838228 Cited by: §1.
  • A. Habib and M. Pradel (2018) How many of all bugs do we find? a study of static bug detectors. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 317–328. External Links: ISBN 978-1-4503-5937-5, Link, Document Cited by: §1, §2, §2.
  • S. Heckman and L. Williams (2009) A model building process for identifying actionable static analysis alerts. In 2009 International Conference on Software Testing Verification and Validation, Vol. , pp. 161–170. External Links: Document, ISSN Cited by: §1.
  • S. Herbold, A. Trautsch, B. Ledel, A. Aghamohammadi, T. A. Ghaleb, K. K. Chahal, T. Bossenmaier, B. Nagaria, P. Makedonski, M. N. Ahmadabadi, K. Szabados, H. Spieker, M. Madeja, N. Hoy, V. Lenarduzzi, S. Wang, G. Rodríguez-Pérez, R. Colomo-Palacios, R. Verdecchia, P. Singh, Y. Qin, D. Chakroborti, W. Davis, V. Walunj, H. Wu, D. Marcilio, O. Alam, A. Aldaeej, I. Amit, B. Turhan, S. Eismann, A. Wickert, I. Malavolta, M. Sulir, F. Fard, A. Z. Henley, S. Kourtzanidis, E. Tuzun, C. Treude, S. M. Shamasbi, I. Pashchenko, M. Wyrich, J. Davis, A. Serebrenik, E. Albrecht, E. U. Aktas, D. Strüber, and J. Erbel (2021) Large-scale manual validation of bug fixing commits: a fine-grained analysis of tangling. Note: Article was recently accepted at Empirical Software Engineering External Links: 2011.06244, Link Cited by: §1, §3.3, §3.3, §3.5, §3.5, §5.3.
  • S. Herbold, A. Trautsch, and F. Trautsch (2020) Issues with szz: an empirical assessment of the state of practice of defect prediction data collection. Note: Article was recently accepted at Empirical Software Engineering External Links: Link Cited by: §3.3, §3.3, §3.3.
  • K. Herzig, S. Just, and A. Zeller (2013a) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp. 392–401. External Links: ISBN 9781467330763 Cited by: §3.3.
  • K. Herzig, S. Just, and A. Zeller (2013b) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In Proceedings of the International Conference on Software Engineering, ICSE ’13, Piscataway, NJ, USA, pp. 392–401. External Links: ISBN 978-1-4673-3076-3, Link Cited by: §1.
  • B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge (2013) Why don&#039;t software developers use static analysis tools to find bugs?. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, Piscataway, NJ, USA, pp. 672–681. External Links: ISBN 978-1-4673-3076-3, Link Cited by: §1.
  • R. Just, D. Jalali, and M. D. Ernst (2014) Defects4J: a database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, New York, NY, USA, pp. 437–440. External Links: ISBN 9781450326452, Document Cited by: §2.
  • Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering 39 (6), pp. 757–773. External Links: Document, ISSN 2326-3881 Cited by: §3.2.
  • S. Kim and M. D. Ernst (2007) Which warnings should i fix first?. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07, New York, NY, USA, pp. 45–54. External Links: ISBN 978-1-59593-811-4, Document Cited by: §1.
  • U. Koc, P. Saadatpanah, J. S. Foster, and A. A. Porter (2017) Learning a classifier for false positive error reports emitted by static code analysis tools. In

    Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

    ,
    pp. 35–42. External Links: ISBN 978-1-4503-5071-6, Document Cited by: §1.
  • V. Lenarduzzi, F. Lomio, H. Huttunen, and D. Taibi (2020) Are sonarqube rules inducing bugs?. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 501–511. Cited by: §1, §1, §2.
  • H. B. Mann and D. R. Whitney (1947)

    On a test of whether one of two random variables is stochastically larger than the other

    .
    Annals of Mathematical Statistics 18 (1), pp. 50–60. Cited by: §3.5.
  • N. Nagappan and T. Ball (2005) Static analysis tools as early indicators of pre-release defect density. In Proceedings of the 27th International Conference on Software Engineering, ICSE ’05, New York, NY, USA, pp. 580–586. External Links: ISBN 1-58113-963-2, Link, Document Cited by: §1, §2.
  • L. Pascarella, F. Palomba, and A. Bacchelli (2019) Fine-grained just-in-time defect prediction. Journal of Systems and Software 150, pp. 22 – 36. External Links: ISSN 0164-1212, Document, Link Cited by: §1, §3.2.
  • M. D. Penta, L. Cerulo, and L. Aversano (2009) The life and death of statically detected vulnerabilities: an empirical study. Information and Software Technology 51 (10), pp. 1469 – 1484. Note: Source Code Analysis and Manipulation, SCAM 2008 External Links: ISSN 0950-5849, Document, Link Cited by: §3.5.
  • R. Plosch, H. Gruber, A. Hentschel, G. Pomberger, and S. Schiffer (2008) On the relation between external software quality and static code analysis. In 2008 32nd Annual IEEE Software Engineering Workshop, Vol. , pp. 169–174. External Links: Document, ISSN 1550-6215 Cited by: §1, §2.
  • L. Querel and P. C. Rigby (2018) WarningsGuru: integrating statistical bug models with static analysis to provide timely and specific bug warnings. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, New York, NY, USA, pp. 892–895. External Links: ISBN 9781450355735, Document Cited by: §1, §2.
  • L. Querel and P. C. Rigby (2021) Warning-introducing commits vs bug-introducing commits: A tool, statistical models, and a preliminary user study. In 29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021, Madrid, Spain, May 20-21, 2021, pp. 433–443. External Links: Document Cited by: §1, §1, §2, §4.
  • F. Rahman, S. Khatri, E. T. Barr, and P. Devanbu (2014) Comparing static bug finders and statistical prediction. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, New York, NY, USA, pp. 424–434. External Links: ISBN 978-1-4503-2756-5, Link, Document Cited by: §1, §2.
  • G. Rodriguez-Pérez, M. Nagappan, and G. Robles (2020) Watch out for extrinsic bugs! a case study of their impact in just-in-time bug prediction models on the openstack project. IEEE Transactions on Software Engineering (), pp. 1–1. External Links: Document Cited by: §3.3.
  • G. Rosa, L. Pascarella, S. Scalabrino, R. Tufano, G. Bavota, M. Lanza, and R. Oliveto (2021) Evaluating SZZ implementations through a developer-informed oracle. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021, pp. 436–447. External Links: Document Cited by: §1.
  • C. Rosen, B. Grawi, and E. Shihab (2015) Commit guru: analytics and risk prediction of software commits. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, New York, NY, USA, pp. 966–969. External Links: ISBN 9781450336758, Document Cited by: §2.
  • J. Śliwerski, T. Zimmermann, and A. Zeller (2005) When do changes induce fixes?. SIGSOFT Softw. Eng. Notes 30 (4), pp. 1–5. External Links: ISSN 0163-5948, Document Cited by: §1.
  • D. Spadini, M. Aniche, and A. Bacchelli (2018) PyDriller: Python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2018, New York, New York, USA, pp. 908–911. External Links: Document, ISBN 9781450355735, Link Cited by: §3.2.
  • F. Thung, Lucia, D. Lo, L. Jiang, F. Rahman, and P. T. Devanbu (2012) To what extent could we detect field defects? an empirical study of false negatives in static bug finding tools. In 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, Vol. , pp. 50–59. External Links: Document, ISSN Cited by: §1, §2, §2.
  • A. Trautsch, S. Herbold, and J. Grabowski (2020a)

    A Longitudinal Study of Static Analysis Warning Evolution and the Effects of PMD on Software Quality in Apache Open Source Projects

    .
    Empirical Software Engineering. External Links: Document Cited by: §1, §1, §3.5, §3.5, §3.6.3, §4.
  • A. Trautsch, S. Herbold, and J. Grabowski (2020b) Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction. In 36th International Conference on Software Maintenance and Evolution (ICSME 2020), Cited by: §1, §3.2, §3.5, §4.
  • F. Trautsch, S. Herbold, P. Makedonski, and J. Grabowski (2017) Addressing problems with replicability and validity of repository mining studies through a smart data platform. Empirical Software Engineering. External Links: Document Cited by: §2.
  • M. Tufano, F. Palomba, G. Bavota, M. D. Penta, R. Oliveto, A. D. Lucia, and D. Poshyvanyk (2017) There and back again: can you compile that snapshot?. Journal of Software: Evolution and Process 29 (4). External Links: Link Cited by: §3.1.
  • C. Vassallo, S. Panichella, F. Palomba, S. Proksch, H. C. Gall, and A. Zaidman (2020) How developers engage with static analysis tools in different contexts. Empirical Software Engineering 25. External Links: ISSN 1573-7616, Document Cited by: §1.
  • A. Vetro, M. Morisio, and M. Torchiano (2011) An empirical validation of findbugs issues related to defects. In 15th Annual Conference on Evaluation Assessment in Software Engineering (EASE 2011), Vol. , pp. 144–153. External Links: Document, ISSN Cited by: §1.
  • C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell, MA, USA. External Links: ISBN 0-7923-8682-5 Cited by: §5.