How to Find Actionable Static Analysis Warnings

Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show here that effective predictors of such warnings can be created by methods that locally adjust the decision boundary (between actionable warnings and others). These methods yield a new high water-mark for recognizing actionable static code warnings. For eight open-source Java projects (CASSANDRA, JMETER, COMMONS, LUCENE-SOLR, ANT, TOMCAT, DERBY) we achieve perfect test results on 4/8 datasets and, overall, a median AUC (area under the true negatives, true positives curve) of 92%.


page 6

page 9

page 10

page 14


Evaluating Maintainability Prejudices with a Large-Scale Study of Open-Source Projects

Exaggeration or context changes can render maintainability experience in...

Validating Static Warnings via Testing Code Fragments

Static analysis is an important approach for finding bugs and vulnerabil...

A True Positives Theorem for a Static Race Detector - Extended Version

RacerD is a static race detector that has been proven to be effective in...

Test Suites as a Source of Training Data for Static Analysis Alert Classifiers

Flaw-finding static analysis tools typically generate large volumes of c...

Assessing Validity of Static Analysis Warnings using Ensemble Learning

Static Analysis (SA) tools are used to identify potential weaknesses in ...

Sorald: Automatic Patch Suggestions for SonarQube Static Analysis Violations

Previous work has shown that early resolution of issues detected by stat...

Tai-e: A Static Analysis Framework for Java by Harnessing the Best Designs of Classics

Static analysis is a mature field with applications to bug detection, se...

1 Introduction

Static analysis (SA) tools report errors in source code, without needing to execute that code. This makes them very popular in industry. For example, the FindBugs tool [ayewah2010google] of Figure 1 has been downloaded over a million times. Unfortunately, due to the imprecision of static analysis and the different contexts where bugs appear, SA tools often suffer from a large number of false alarms that are deemed to be not actionable [tomassi2021real]. Hence, developers never act on most of their warnings [heckman2008establishing, heckman2009model, kim2007warnings]. Previous research work shows that 35% to 91% of SA warnings reported as bugs by SA tools are routinely ignored by developers [heckman2009model, heckman2008establishing, kim07].

Those false alarms produced by SA tools are a significant barrier to the wide-scale adoption of these SA tools  [johnson2013don, ChristakisB16]. Accordingly, in 2018 [wang2018there], 2020 [yang2021learning] and 2021 [yang2021understanding], Wang et al. and Yang et al. proposed data miners that found the subset of static code warnings that developers found “actionable” (i.e. those that motivate developers to change the code). But in 2022, Kang et al. [kang2022detecting] showed that (a) of the 31,000+ records used by Wang et al. and Yang et al., they could only generate 768 error-free records– which meant all the prior Wang and Yang et al. results need to be revisited.

When Kang et al. tried to build predictors from the 768 good records, they found that their best-performing predictors were not effective (e.g., very low median AUCs of 41%), for details see Table II. Hence the following remains an open research question:

7ptRQ1: For detecting actionable static code warnings, what data mining methods should we recommend? This paper conjectures that prior work failed to find good predictors because of a locality problem. In the learners used in that prior work, the decision boundary between between actionable warnings and other was determined by a single global policy. More specifically, we conjecture that:

For complex data, global treatments perform worse than localized treatments which adjust different parts of the landscape in different ways.

To test this, we use local treatments to adjust the decision boundary in different ways in different parts of the data.

  1. Boundary engineering: adjust the decision boundary near our data points;

  2. Label engineering

    : control outliers in a local region by using just a small fraction of those local labels;

  3. Instance engineering: addressing class imbalance in local regions of the data

  4. These treatments are combined with parameter engineering to control how we build models.

We call this combination of treatments GHOST2 (GHOST2 extends GHOST [yedida2021value] which just used one of these treatments). When researchers propose an intricate combination of ideas, it is prudent to ask several questions:

7ptRQ2: Does GHOST2’s combination of instance, label, boundary and parameter engineering, reduce the complexity of the decision boundary?

Later in this paper, we will show evidence that our proposed methods simplifies the “error landscape” of a data set (a concept which we will discuss, in detail in §4).

7ptRQ3: Does GHOST2’s use of instance, label, boundary and parameter treatments improve predictions?

Using data from Kang et al. (768 records from eight open-source Java projects), we show that GHOST2 was able to generate excellent predictors for actionable static code warnings.

Figure 1: Example of a static analysis warning, generated via the FindBugs tool [ayewah2010google].

7ptRQ4: Are all parts of GHOST2 necessary; i.e. would something simpler also achieve the overall goal?

To answer RQ4, this paper reports an ablation study that removes one treatment at a time from our four recommended treatments. For the purposes of recognizing and avoiding static code analysis false alarms, it will be shown that, ignoring any part of our proposed solution leads to worse predictors. Hence, while we do not know if changes to our design might lead to better predictors, the ablations study does show that removing anything from that design makes matters worse.

This work has six key contributions:

  1. As a way to address, in part, the methodological problems raised by Kang et al. GHOST2 makes its conclusions using a small percentage of the raw data (10%). That is, to address the issues of corrupt data found by Kang et al., we say “use less data” and, for the data that is used, “reflect more on that data”.

  2. A case study of successful open collaboration by software analytics researchers. This paper is joint work between the Yang et al. and Kang et al. teams from the United States and Singapore. By recognizing a shared problem, then sharing data and tools, in eight weeks these two teams produced a new state-of-the-art result that improves on all of the past papers by these two teams (within this research arena). This illustrates the value of open and collaborative science, where groups with different initial findings come together to help each other in improving the state-of-the-art for the benefit of science and industry.

  3. Motivation for changing the way we train software analytics newcomers. It may not be enough to just reflect on the different properties of off-the-shelf learners. Analysts may need to be skilled in boundary, label, parameter and instance engineering.

  4. GHOST2’s design, implementation, and evaluation.

  5. A new high-water mark in software analytics for learning actionable static code warnings.

  6. A reproduction package that other researchers can use to repeat/refute/improve on our results111

The rest of this paper is structured as follows. The next section offers some background notes. §3 discusses the locality problem for complex data sets and §4 offers details on our treatments. §5 describes our experimental methods after which, in §6, we show that GHOST2 outperforms (by a large margin) prior results from Kang et al. We discuss threats to validity for our study in §7, before a discussion in §8 and concluding in §9.

Before all that, we digress to stress the following:

  • A learned model must be tested on the kinds of data expected in practice.

  • Hence, any treatments to the data (e.g. via instance, label, boundary engineering) are restricted to the training data, and do not affect the test data.

2 Background

2.1 Static Code Analysis

Automatic static analysis (SA) tools, such as Findbugs (see Figure 1), are tools for detecting bugs in source code, without having to execute that code. As they can find real bugs at low cost [thung2012extent, habib2018many], they have been adopted in open source projects and in industry [ayewah2010google, sadowski2018lessons, beller2016analyzing, zampetti2017open, panichella2015would, vassallo2020developers]. However, as they do not guarantee that all warnings are real bugs, these tools produce false alarms. The large number of false alarms produced is a barrier to adoption [johnson2013don, ChristakisB16]; it is easy to imagine how developers will be frustrated by using tools that require them to inspect numerous false alarms before finding a real bug. While false alarms include spurious warnings caused by the over-approximation of possible program behaviors during program analysis, false alarms also refer to warnings that developers do not act on. For example, developers may not think that the warning represents a bug (e.g. due to “style” warnings that developers perceive to be of little benefit) or may not wish to modify obsolete code.

The problem of addressing false alarms from static analysis tools has been widely studied. There have been many recent attempts to address the problem. Some researchers have proposed new SA tools that use more sophisticated, but costly, static analysis techniques (e.g. Infer [calcagno2015moving], NullAway [banerjee2019nullaway]). Despite their improvements, these tools still produce many false alarms [tomassi2021real]. Other attempts to prune false alarms include the use of test case generation to validate the presence of a bug at the source code location indicated by the warning [kallingal2021validating]. As generating test cases is expensive, these techniques may face issues when scaling up to larger projects, limiting their practicality.

2.2 Early Results: Wang et al., 2018

By framing the problem as a binary classification problem, machine learning techniques can identify actionable warnings (allowing us to prune false alarms) 

[hanam2014finding, heckman2008establishing, liang2010automatic, ruthruff2008predicting, wang2018there, yang2021learning, yang2021understanding]

. These techniques use features extracted from code analysis and metrics computed over the code and warning’s history in the project. Figure


illustrates this process. A static analyzer is ran on a training revision and the warnings produced are labelled. When applied to the latest revision, only warnings classified as actionable warnings by the machine learner are presented to the developers.

Figure 2: To detect actionable warnings, a learner is trained on warnings from a training revision. Each warning is annotated with a label. When deployed on the latest revision, only warnings classified as actionable warnings by the machine learner are presented to the developers.

To assess proposed machine learners, datasets of warnings produced by Findbugs have been created. As the ground-truth label of each warning is not known, a heuristic was applied to infer them. This heuristic compares the warnings reported at a particular revision of the project against a revision set in the future. If a warning is no longer present, but the file is still present, then the heuristic determines that the warning was fixed. As such, the warning is actionable. Otherwise, if the warning is still present, then the warning is a false alarm.

Wang et al. [wang2018there] ran a systematic literature review to collect and analyze 100+ features proposed in the literature, categorizing them into 8 categories. To remove ineffective features, they performed a greedy backward selection algorithm. From the features, they identified a set of features that offered effective performance.

2.3 Further Result: Yang et al., 2021

Yang et al. [yang2021learning] further analyzed the features using the data collected by Wang et al. [wang2018there]. They found that all machine learning techniques were effective and performed similarly to one another. Their analysis revealed that the intrinsic dimensionality of the problem was low; the features used in the experiments were more verbose than the actual attributes required for classifying actionable warnings. This motivates the use of simpler machine learners over more complex learners. From their analysis, SVMs were recommended for use in this problem, as they were both effective and can be trained at a low cost. In contrast, deep learners were effective but more costly to train.

For each project in their experiments, one revision (training revision) was selected for extracting warnings for training the learner, and another revision (testing revision) set chronologically in the future of the training revision is selected for extracting warnings for evaluating the learner. This simulates a realistic usage scenario of the tool, where the learner is trained using past data before developers apply it to another revision of the source code.

2.4 Issues in Prior Results: Kang et al., 2022

Subsequently, Kang et al. [kang2022detecting] replicated the Yang et al. [yang2021learning] study to find subtle methodological issues in the Wang et al. data [wang2018there] which led to overoptimistic results.

Firstly, Kang et al. found data leakage where the information regarding the warning in the future, used to determine the ground-truth labels, leaked into several features. Five features (warning context in method, file, for warning type, defect likelihood, discretization of defect likelihood) measure the ratio of actionable warnings within a subset of warnings (e.g. warnings in a method, file, of a warning type). To determine if a warning is actionable, the ground-truth label was used to compute these features, leading to data leakage. Kang et al. reimplemented the features such that they are computed using only historical information, without reference to the ground truth determined from the future state of the projects. As only the features were reimplemented, the total number of training and testing instances remained unchanged.

Secondly, they found many warnings appearing in both the training and testing dataset. As some warnings remain in the project at the time of both the training and testing dataset, the model has access to the ground-truth label for the warning at training time. Kang et al. addressed this issue by removing warnings that were already present during the training revision from the testing dataset, ensuring that the learner does not see the same warning in both datasets. After removing these warnings, the number of warnings in the testing revision decreased from 15,695 to 2,615.

Evaluation Metric Description

area under the receiver operating characteristics curve (the true positive rate against the false positive rate)

False alarm rate
Table I: Evaluation metrics based on TP (true positives); TN (true negatives); TP (true positives) and FP (false positives)
Dataset Precision AUC False alarm rate Recall
cassandra 0.67 0.33 0.25 0.67
commons 0.67 0.52 0.57 0.62
lucene-solr 0.56 0.70 0.36 0.71
median 0.52 0.41 0.19 0.32
jmeter 0.50 0.36 0.14 0.17
tomcat 0.52 0.41 0.19 0.32
derby 0.20 0.64 0.12 0.08
ant 0.00 0.00 0.00 0.00
Table II: The Kang et al. predictors did not perform well on the repaired data. In this table,lower false alarms are better while higher precisions, AUC, and recall are better.

Next, Kang et al. analyzed the warning oracle, based on the heuristic comparing warnings at one revision to another revision in the future, used to automatically produce labels for the warnings in the dataset. After manual labelling of the actionable warnings, Kang et al. found that only 47% of warnings automatically labelled actionable were considered by the human annotators to be actionable. This indicates that the heuristic employed as the warning oracle is not sufficiently reliable for automatically labelling the dataset.

Kang et al. manually labelled 1,357 warnings. After filtering out duplicates and uncertain labels, a dataset of 768 warnings remained. On this dataset, Kang et al. again applied off-the-shelf SVM models, assessing them with the evaluation metrics listed in Table I.

For their reasoning, Kang et al. used the learners recommended by prior work; i.e. radial bias SVMs. The results of the SVM are shown in Table II. Those results are hardly impressive:

  • Median precisions barely more than 50%;

  • Very low median AUCs of 41%;

  • Extremely low median recalls of 32%.

That is to say, while Kang et al. were certainly correct in their criticisms of the data used in prior work, based on their paper, it is still an open issue about how to generate good predictors for static code false alarms.

3 Rethinking the Problem

This section suggests that detecting actionable static code warnings is a “bumpy” problem (defined below) and that such problems can not be understood by learners that use simplistic boundaries between classes.

The core task of any classification problem is the creation of a hyperspace boundary that let us isolate what is most desired or most interesting. Different learners build their boundaries in different ways:

Boundaries can be changed by adjusting the parameters that control the learner. For example, in Kang et al.’s radial basis functions, the regularization parameter is used to set the tolerance of the model to (some) classifications. By adjusting , an analyst can change the generalization error; i.e. the error when the model is applied to as-yet-unseen test data.

Figure 3 show how changes to can alter the decision boundary between some red examples and blue examples. Note that each setting to changes the accuracy of the predictor; i.e. for good predictions, it is important to fit the shape of the decision boundary to the shape of the data.

(Technical aside: while this example was based on SVM technology, the same line of argument applies to any other classifier; i.e. changing the control parameters of the learner also changes the hyperspace boundary found by that learner and, hence, the predictive prowess of that learner.)

We have tried applying hyperparameter optimization to in a failed attempt to improve that performance (see the C1 results of Table VII). From that failed experiment, we conclude that however work for radial bias functions, they do not work well enough to fix the unimpressive predictive performances – see Table II.

Why do radial bias SVMs fail in this domain? Our conjecture is that the hyperspace boundary dividing the static code examples (into false positives and others) is so “bumpy”222“Bumpy” data contain complexities such as many local minima, saddle points, very flat regions, and/or widely varying curvatures. For example, see Figure 4. that the kinds of shape changes seen in Figure 3 can never adequately model those examples.

Figure 3: The parameter of a radial basis function alters the shape of the hyperspace boundary. Acc is accuracy which is the ratio of true positives plus true negatives divided by a SVM making predictions across that boundary. Example from [kumar20].

To test that conjecture, we first checked for “bumpiness” using a technique from li2018visualizing

. That technique visualizes the “error landspace” (i.e. how fast small changes in the independent variables altered the error estimation). For our TOMCAT data, Li et al.’s methods resulted in Figure

4. There, we see a “bumpy” landscape with several multiple local minima.

Having confirmed that our data is “bumpy”, our second step was to we look for ways to reduce that bumpiness. Initially, we attempted to use neural nets since that kind of learner is meant to be able to handle complex hyperspace boundaries [WittenFH11]. As discussed in §6, that attempt failed even after trying several different architectures such as feedforward networks, CNN, and CodeBERT [rumelhart1986learning, habib2018many, vaswani2017attention] (with and without tuning learner control parameters).

Since standard neural net technology failed, we tried several manipulation techniques for the training process, described in the next section.

Figure 4: Error landscape in the TOMCAT data before applying the methods of this paper. In the plot, the larger the vertical axes, the greater the loss value. Later in this paper, we will show this plot again, after it has been smoothed via the methods of §4 (see Figure 5 and Table IX).

4 Treatments

This section discusses a framework that holds operators for treating the data in order to adjust the decision boundary (in different ways for different parts of the data). For the purposes of illustration and experimentation, we offer operational examples for each part of the framework:

  • SMOTE for instance engineering;

  • SMOOTH for label engineering;

  • GHOST for boundary engineering;

  • DODGE for parameter engineering.

Before presenting those parts we note here that the framework is more than just those four treatments. As SE research matures, we foresee that our framework will become a workbench within which researchers replace some/all of these treatments with more advanced options.

That said, we have some evidence that SMOTE, SMOOTH, GHOST, DODGE are useful:

  • The ablation study of §5.2 shows that removing any one of these treatments leads to worse performance.

  • All these treatments are very fast: sub-linear time for SMOTE and SMOOTH, linear time for GHOST, and DODGE is known to be orders of magnitude faster than other hyperparameter optimizers [agrawal2019dodge].

4.1 Instance Engineering (via SMOTEing)

To remove the “bumpiness” in data like Figure 4, we need to pull and push the decision boundaries between different classes into a smoother shape. But also, unlike simplistic tuning available in radial SVMs, we want that process to perform differently in different parts of the data.

One way to adjust the decision boundary in different parts of the data is to add (or delete) artificial examples around each example . This builds a little “hill” (or valley) in the local region. As a result, in that local region, it becomes more (or less) certain that all predictions which reach the same conclusion as . In effect, adding/deleting examples pushes the decision boundary away (or, in the case of deletions, pulls it closer). SMOTE [chawla2002smote] is one instance engineering technique that:

  • Finds five nearest neighbors to with the same label;

  • Selects one at random;

  • Creates a new example , with the same label as at some random point between and .

4.2 Label Engineering (via SMOOTHing)

SMOTE has seen much success in recent SE papers as a way to improve predication efficacy [agrawal2018better]. But this technique makes a linearity assumption that all the data around is correctly labelled (in our case, as examples of actionable or unactionable static code warnings). This may not be true. cordeiro2020survey and recent SE researchers [frugal, debtfree, 9064604, jitterbug] note that noisy labels can occur when human annotators are present [mcnicol2005primer] or those humans have divergent opinions about the labels [barkan2021reduce, ma2019blind]. Although our labels were re-checked by the authors of kang2022detecting, our ablation study (below) reports that it is best to apply some mitigation method for poorly labelled examples. For example, in this work we applied the following SMOOTHing operator where data is assigned labels using multiple near neighbors. This has the effect of removing outliers in the data. Our SMOOTH operator works as follows:

  • Given training samples (and therefore, labels), we keep at random and discard the rest.

  • Next, we use a KD-tree to recursively sub-divide the remaining data into leaf clusters of nearest neighbors. Within each leaf, all examples are assigned a label that is the mode of the labels in that leaf.

One interesting and beneficial side-effect of SMOOTHing is that we make conclusions on our test data using just 10% of the training data. By reducing the labelling required to make conclusions, SMOOTHing offers a way to help future studies avoid the problems reported by Kang et al. [kang2022detecting]:

  • One of the major finding of the Kang et al. study was that earlier work [yang2021learning]

    had mislabelled much of its data. From that study, we assert that it is important for analysts to spend more time checking their labels. We note that there many other ways to reduce the labels required for supervised learning.

  • SMOOTHing reduces the effort required for that checking process (by a factor of ten).

As an aside, we note that SMOOTHing belongs to a class of algorithms called semi-supervised learning [frugal, debtfree]

that try to make conclusions using as few labels as possible. The literature on semi-supervised learning is voluminous

[berthelot2019mixmatch, fairssl, kingma2014semi, zhai2019s4l, zhu2005semi] and so, in the theory, there could be many other better ways to perform label engineering. This would be a productive area for future research. But for now, the ablation study (reported below) shows that SMOOTHing is useful (since removing it degrades predictive performance).

4.3 Boundary Engineering (via GHOSTing)

As defined above, instance and label engineering do not reflect on the quality of data in the local region.

To counter that, this study employs a boundary method called “GHOSTing”, recently developed and applied to software defect prediction by yedida2021value. Boundary engineering is different to label and instance engineering since it adjusts the frequency of different classes in the local region (while the above typically end up repeating the same label for a particular locality). Hence, in that region, it changes the decision boundary.

GHOSTing addresses class imbalance issues in the data. When an example with one label is surrounded by too many examples of another label, then the signal associated with example can be drowned out by its neighbors To fix this, for a two-class dataset with class being the minority, GHOSTing oversamples the class by adding concentric boxes of points around each minority sample. The number of concentric boxes is directly related to the class imbalance: higher the imbalance, more the number of boxes. Specifically, if is the fraction of samples in the minority class, then boxes are added. While the trivial effect of this is to oversample the class (indeed, as pointed out by yedida2021value, this reverses the class imbalance), we note that the algorithm effectively builds a wall of points around minority samples. This pushes the decision boundary away from the training samples, which is preferred since a test sample that is close to a training sample has a lesser chance of being misclassified due to the decision boundary being in between them.

Our pre-experimental intuition was that boundary engineering would replace the need to use instance engineering. However, as shown by our ablation study, for recognizing actionable static code warnings, we needed both tools. On reflection, we realized both may be necessary since while (a) boundary engineering can help make local adjustments to the decision boundary, it can (b) only work in regions where samples exist; instance engineering can help fill in gaps in sparser regions of the dataset.

Learner Hyper-parameter Range
Feedforward network #layers
#units per layer
Logistic regression Penalty
Random forest Criterion gini, entropy
Decision Tree Criterion gini, entropy
Splitter best, random
Kernel sigmoid, rbf, polynomial
CNN #convolutional blocks [1, 4]
#convolutional filters {4, 8, 16, 32, 64}

Dropout probability

(0.05, 0.5)
Kernel size {16, 32, 64}
Table III: List of hyper-parameters tuned in our study. CodeBERT is not shown in that table since, as mentioned in the text, this analysis lacked the resources required to tune such a large model.

4.4 Parameter Engineering (via DODGEing)

We noted above that different learners generate different hyperspace boundaries (e.g. decision learners generate straight-line borders while SVMs with radial bias functions generate circular borders). Further, once a learner is selected, then as seen in Figure 4, it is possible to further adjust a border by altering the control parameters of that learner (e.g. see Figure 3). We call this adjustment parameter engineering.

Parameter engineering is like a scientist probing some phenomenon. After the data is divided into training and some separate test cases, parameter engineering algorithms conduct experiments on the training data looking for parameter settings that improve the performance of a model learned and assessed on the training data. Once some conclusions are reached about what parameters are best, then these are applied to the test data. Importantly, the parameter engineering should only use the training data for its investigations (since otherwise, that would be a threat to the external validity of the conclusions).

Parameter engineering executes within the space of control parameters of selected learners. These learners have the internal parameter space shown in Table III. We selected this range of learners using the following rationale:

  • In order to compare our new results to prior work by Kang et al.  [kang2022detecting], we use the Kang et al. SVMs with the radial basis kernel and balanced class weights.

    Feedforward networks

    These are artificial neural networks, comprising an acyclic graph of nodes that process input and produce an output. These dates back to the 1980s, and the parameters of these models are learned via backpropagation

    [rumelhart1986learning]. These networks have

    parameters. For these networks, we used the ReLU (rectified linear activation) function (

    ). This is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
    A convolutional neural net (CNN) is a structured neural net where the first several layers are sparsely connected in order to process information (usually visual). CNN is an example of an deep learner and are much larger than feedforward networks (these may span parameters). Optimizing an CNN is a very complex task (so many parameters) so following advice from the literature [ioffe2015batch, srivastava2014dropout], we used the following architecture. Our CNNs had multiple “convolutional blocks” defined as follows:
    1. ReLU activation

    2. Conv (with “same” padding)

    3. Batch norm [ioffe2015batch]

    4. Dropout [srivastava2014dropout]

    We note that this style of building convolutional networks, by building multiple “convolutional blocks” is very popular in the CNN literature [krizhevsky2012imagenet, lecun1989backpropagation]. Our specific design of the convolutional blocks was based on a highly voted answer on Stack Overflow 333 Note that with that architecture there is still room to adjust the ordering of the blocks– which is what we adjust when we tune our CNNs.
    CodeBERT [feng2020codebert] is a transformer-based model that been pre-trained model using millions of examples from contemporary programming languages such as Python, Java, JavaScript, PHP, Ruby, and Go. Such transformer models are those based on the “self-attention” mechanism proposed by vaswani2017attention. CodeBERT is even large than CNN and can contain parameters. One advantage of such large models is that can learn intricacies that are missed by smaller models.
    Table IV: Neural net architectures used in this study.
  • In order to compare our work to Kang et al.  [yang2021learning], we used a range of traditional learners (logistic regression, random forests, and single decision tree learners);

  • Also, we explored the various neural net algorithms shown in Table IV since these algorithms have a reputation of being able to handle complex decision boundaries [WittenFH11]. In this textbook on Empirical Methods for AI, Cohen [cohen1995empirical] advises that supposedly more complex solutions should be compared to a range of alternatives, including very simple methods. Accordingly, for neural nets, we used (a) feedforward networks from the 1980s; (b) the CNN deep learner used in much of contemporary SE analytics; and (c) the state-of-the-art CodeBERT model.

There are many algorithms currently available for automatically tuning these learning control parameters. As recommended by a prior study [agrawal2021simpler], we use Agrawal et al.’s DODGE algorithm [agrawal2019dodge]. DODGE is based on early work by Deb et al. in 2005 that proposed a “-domination rule” [Deb05]; i.e.

If one setting to an optimizer yield results within or another, then declare the region as “tabu” and search elsewhere.

A surprising result from Agrawal et al.’s research was that

can be very large. Agrawal et al. noted that if learners were run 10 times, each time using 90% of the training data (selected at random), then they often exhibited a standard deviation of 0.05 (or more) in their performance scores. Assuming that performance differences less than

, are statistically insignificantly different, then Agrawal reasoned that could be as large as . This is an important point. Suppose we are trying to optimize for two goals (e.g. recall and false alarm). Since those measures have the range zero to one, then divides the output space of those two goals divides into just a regions. Hence, in theory, DODGE could find good optimizations after just a few dozen random samples to the space of possible configurations.

When this theoretical prediction was checked experimentally of SE data, Agrawal [agrawal2021simpler] found that DODGE with

defeated traditional single-point cross-over genetic algorithms as well as state-of-the-art optimizers (e.g. Bergstra and Bengio’s HYPEROPT algorithm 

[Bergstra12]444At the time of this writing (April 2022), the paper proposing HYPEROPT has 7,557 citations in Google Scholar.). Accordingly, this study used DODGE for its parameter engineering.

Our pre-experimental intuition was that DODGEing would be fast enough to tune even the largest neural net model. This turned out not to be the case. The resources required to adjust the CodeBERT model are so large that, for this study, we had to use the “off-the-shelf” CodeBERT.

5 Experimental Methods

5.1 Data

This paper tested the efficacy of instance, label, boundary and parameter engineering using the revised and repaired data from Kang et al. paper [kang2022detecting].

Recall that Kang et al. manually labelled warnings from the same projects studied by Yang et al. [yang2021learning] to assess the level of agreement between human annotators and the heuristic. The manual labelling was performed by two human annotators. When the annotators disagreed on the label of a warning, they discussed the disagreement to reach a consensus. While they achieved a high level of agreement, achieving a Cohen’s Kappa of above 0.8, manual labelling is costly, requiring human analysis of both the source code and the commit history of the code. That said, this label is essential since it removed closed warnings which are not actionable (e.g., the warnings may have been removed for reasons unrelated to the Findbugs warning).

Two other filters employed by Kang et al. where:

  • Unconfirmed actionable warnings were removed;

  • False alarms were randomly sampled to ensure a balance of labels (40% of the data were actionable) consistent with the rest of the experiments.

One of the complaints of the Kang et al. paper [kang2022detecting] against earlier work [yang2021learning] was that, for data that comes with some time stamp, it is inappropriate to use future data to predict past labels. To avoid that problem, in this study, we sorted the Kang et al. data by time stamps, then used 80% of the past data to predict the remaining 20% future labels.

The Kang et al. data comes from eight projects and we analyzed each project’s data separately. The 80:20 train:test splits resulted in the train:test sets shown in Table V (exception: for MAVEN, we split 50:50, since there are only 4 samples in total).

Pre-experimentally, we were concerned that learning from the smaller data sets of Table V would complicate our ability to make any conclusions from this data. That is, we needed to know:

7ptRQ5: Are larger training sets necessary (for the task of recognizing actionable static code warnings)?

This turned out not to be a critical issue. As shown below, the performance patterns in our experiments were stable across all the six smaller data sets used in this study.

Technical aside: In other papers, we have run repeated trials with multiple 80:20 splits for training:test data. This was not here since some of our data sets are too small (see the first few rows of Table V) that any reduction in the training set size might disadvantage the learning process. Hence, the external validity claims of this paper come from patterns seen in eight different software projects.

Project # train # labels imbalance% # test
maven 2 1 33 1
cassandra 9 4 38 4
jmeter 10 4 43 4
commons 12 5 59 5
lucene-solr 19 5 38 6
ant 22 6 36 7
tomcat 134 13 41 37
derby 346 20 37 92
total 554 58 156
Table V: Summary of the data distribution

max width= Engineering decisions Treatment Boundary Label Learner Parameter Instance % Labels Description A1 F 10 Our recommended method A2 F 10 A1 without instance engineering (no SMOTE) A3 F 10 A1 without hyper-parameter engineering (no DODGE) A4 F 10 A1 without boundary engineering (no GHOST) A5 F 100 A1 without label engineering (no SMOOTH). From TSE’21 [yedida2021value] A6 T 100 A1 without label engineering, replacing feedforward with traditional learners A7 T 10 A1 replacing feedforward with traditional learners B1 T 10 A1 without boundary engineering, replacing feedforward with traditional learners B2 C 10 A1 without boundary engineering, replacing feedforward with CNN C1 T 100 A1 without boundary engineering or label engineering, replacing feedforward with traditional learners C2 C 100 A1 without boundary engineering or label engineering, replacing feedforward with CNN D1 T 100 Setup used by the yang2021learning and kang2022detecting studies. CodeBERT B 100 CodeBERT without modifications

Table VI: Design of our ablation study. In the learner choice column, F = feedforward networks, T = traditional learners, C = CNN, B = CodeBERT.

5.2 Experimental Rig

This study explores:

  • pre-processors (boundary, label, parameter, instance) that could be mixed in ways.

  • Six traditional learners: logistic regression, decision trees, random forests, SVMs (with 3 basis functions);

  • Three neural net architectures: CNN, CodeBERT, feedforward networks;

To clarify the reporting of these treatments, we made the following decisions. Firstly, when reporting the results of the traditional learner, just show the results of the one that beat the other traditional learners (which, in our case, was typically random forest or logistic regression).

Secondly, we do not apply pre-processing or parameter engineering on CodeBERT. This decision was required, for pragmatic reasons. Due to the computational cost of training that model, we could only run off-the-shelf CodeBERT.

Thirdly, rather than explore all 16 combinations of use/avoid different pre-processing, we ran the ablation study recommended in Cohen’s Empirical Methods for AI textbook [cohen1995empirical]. Ablation studies let us explore some combination of parts can be assessed in time , not . Such ablation studies work as follows:

  • Commit to a preferred approach, with parts;

  • If removing any part degrades performance, then conclude that all parts are useful.

With these decisions, instead of having to report on 144 treatments, we need only show the 13 treatments in the ablation study of Table VI. In that table, for treatments that use any of boundary or label or parameter or instance engineering, we apply those treatments in the order recommended by the original GHOST paper [yedida2021value]. That paper found that it could improve recall by 30% (or more) by multiple rounds of SMOTE + GHOST. As per that advice, A1 executes our pre-processors in the order:

This paper does not explore the effect of different orderings; rather, our core idea is that the different engineering techniques work together to produce strong results. We leave the exploration of the effect of ordering to future work.

All the treatments labelled “A” (A1,A2,A3,A4,A5) in Table VI

, use the order shown above, perhaps (as part of the ablation study) skipping over one or more the steps. We acknowledge that there are many possible ways to order the applications of our treatments, which is a matter we will for future work. For the moment,the ordering shown above seems useful (evidence: see next section).

maven cassandra jmeter commons lucene-solr ant tomcat derby median
PRECISION (better results are larger)
A1 1 1 1 1 0.8 1 0.79 0.72 1
A2 0 0.25 1 0.67 1 1 0.68 0.73 0.71
A3 0.5 0.25 0.33 0.2 0.25 0.33 0.33 0.4 0.33
A4 1 0.75 1 1 0.75 1 1 0.75 1
A5 1 1 1 1 0.8 1 0.72 0.84 1
A6 1 1 0.5 0.33 0.67 1 0.85 0.89 0.87
A7 1 0.5 0.5 1 1 0 0.55 0.42 0.53
B1 (DODGE) 1 1 0 0.5 0 0 0.47 0.59 0.49
B2 (CNN) 0.5 0 0 0.6 0 0.29 0.51 0.61 0.4
C1 (DODGE) 1 1 1 0.33 0.67 0.67 0.67 0.81 0.74
C2 (CNN) 0.5 0.5 0.5 0.6 0.83 0.43 0.4 0.73 0.5
D1 0 0.5 0 0.6 0 0 0.39 0.39 0.2
CodeBERT 0.5 1 0.8 0.63 0.6 0 0.41 0.25 0.55
AUC: TP vs. TN (better results are larger)
A1 1 1 0.83 1 0.75 1 0.68 0.57 0.92
A2 0 0.5 0.75 0.83 0.63 0.75 0.6 0.7 0.67
A3 0.5 0.5 0.67 0.5 0.38 0.55 0.51 0.51 0.51
A4 1 0.5 1 0.75 0.63 0.8 0.54 0.59 0.69
A5 1 0.67 1 0.88 0.75 0.9 0.67 0.78 0.83
A6 1 1 0.83 0.75 0.88 1 0.85 0.76 0.87
A7 1 0.83 0.83 1 0.75 0.5 0.59 0.62 0.79
B1 (DODGE) 1 1 0.5 0.88 0.5 0.5 0.58 0.62 0.6
B2 (CNN) 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.62 0.5
C1 (DODGE) 1 1 1 0.75 0.88 0.9 0.8 0.76 0.89
C2 (CNN) 0.5 0.5 0.17 0.5 0.5 0.5 0.63 0.82 0.5
D1 0.5 0.17 0.5 0.5 0 0.38 0.48 0.47 0.48
CodeBERT 0.5 0.56 0.68 0.53 0.63 0.48 0.44 0.63 0.54
FALSE ALARM RATE (better results are smaller)
A1 0 0 0 0 0.5 0 0.29 0.79 0
A2 0 1 0 0.33 0 0 0.4 0.38 0.17
A3 1 1 0.67 1 0.75 0.4 0.71 0.05 0.73
A4 0 1 0 0 0.5 0 0 0.48 0
A5 0 0 0 0 0.5 0 0.57 0.41 0
A6 0 0 0.33 0.5 0.25 0 0.09 0.03 0.06
A7 0 0.33 0.33 0 0 0 0.17 0.44 0.09
B1 (DODGE) 0 0 0 0.25 0 0 0.35 0.11 0
B2 (CNN) 1 0 0 1 0 1 1 0.25 0.63
C1 (DODGE) 0 0 0 0.5 0.25 0.2 0.26 0.06 0.13
C2 (CNN) 1 1 1 1 1 1 0.46 0.17 1
D1 0 1 0 1 1 0.25 0.77 0.67 0.72
CodeBERT 1 0 0.2 1 0.25 0 0.28 0.17 0.23
RECALL (better results are larger)
A1 1 1 0.67 1 1 1 0.65 0.94 1
A2 0.5 1 0.5 1 0.25 0.5 0.59 0.77 0.54
A3 1 1 1 1 0.5 0.5 0.75 0.07 0.88
A4 1 1 1 0.5 0.75 0.6 0.09 0.67 0.71
A5 1 0.33 1 0.75 1 0.8 0.91 0.97 0.94
A6 1 1 1 1 1 1 0.79 0.55 1
A7 1 1 1 1 0.5 0 0.36 0.69 0.85
B1 (DODGE) 1 1 0 1 0 0 0.5 0.34 0.42
B2 (CNN) 1 0 0 1 0 1 1 0.49 0.75
C1 (DODGE) 1 1 1 1 1 1 0.86 0.59 1
C2 (CNN) 1 1 0.33 1 1 1 0.73 0.82 1
D1 0 0.33 0 1 0 0 0.73 0.61 0.17
CodeBERT 1 0.33 0.67 1 0.5 0 0.26 0.25 0.42
Table VII: Our results across eight datasets on four metrics.

As to the specifics of the other treatments:

  • Treatment A5 is the treatments from the TSE’21 paper that proposed GHOSTing [yedida2021value].

  • Treatment D1 contains the treatments applied in prior papers by Yang et al. [yang2021learning] and Kang et al.  [kang2022detecting].

  • Anytime we applied parameter engineering, this meant that some automatic algorithm (DODGE) selected the control parameters for the learners (otherwise, we just used the default off-the-shelf settings).

  • Anytime we apply label engineering, we are only used 10% of the labels in the training data.

  • The last line, showing CodeBERT, has no pre-processing or tuning. As said above, CodeBERT is so complex that we must run it “off-the-shelf”.

6 Results

From right-
From hand-side
Table II of Table VII Improvement
precision 50 100 50
higher is better AUC 41 90 59
recall 19 100 89
lower is better false alarm 32 0 32
Table VIII: Median performance improvements seen after applying all the treatments A1 (defined in §4); i.e. all of instance, label, boundary and parameter engineering.

The results of the Table VI treatments are shown in Table VII (and another brief summary is offered in Table VIII). These results are somewhat extensive so, by way of an overview, we offer the following summary tool. The cells shown in pink are those that are worse than the A1 results (and A1 is our recommended GHOST2 method). Looking over those pink cells we can see that across our data sets and across our different measures, our recommend method (A1) does as well (or better) than anything else.

(Technical aside: looking at this pink cells, it could be said that A5 comes close to A1, but A5 loses a little on recalls). Nevertheless, we have strong reasons for recommending A1 over A5 since, recalling Table VI, A5 requires a labelling for 100% of the data. On the other hand A1, that uses label engineering, achieves its results using 10% of the labels. This is important since, as said in our introduction, one way to address, in part, the methodological problems raised by Kang et al. GHOST2 makes its conclusions using a small percentage of the raw data (10%). That is, to address to issues of corrupt data found by Kang et al., we say “use less data” and, for the data that is used, “reflect more on that data”.)

Using these results, we can answer our research questions as follows.

Rq1: For detecting actionable static code warnings, what data mining methods should we recommend?

Regarding feedforward networks versus, say, traditional learners (decision trees, random forests, logistic regression and SVMs), the traditional learners all performed worse than the feedforward networks used in treatment A1 (evidence: compare treatments A1 with A7 which use feedforward or traditional learners, respectively; there are four perfect AUCs for feedforward networks in A1, i.e AUC=100%, but only two for the A7 results).

As to why the 1980s style feedforward networks worked better than newer neural net technology, we note that feedforward networks run so fast than it is easier to extensively tune them. Perhaps (a) faster learning plus (b) more tuning might lead to better results that then non-linear modeling of an off-the-shelf learner. This could be an interesting avenue for future work.

As to the value of boundary, label, instance and parameter engineering, in the ablation study, removing any of these led to worse results. For example, with boundary engineering, A1 (that uses boundary engineering) generates more perfect scores (e.g. AUC=100%) than A4 (that does not use it). Also, for recall, A1 always performed as good or better than A4 in 6/8 data sets. Similarly, A4 suffers from a drop in AUC score across the board.

As for label engineering, from A1 to A5, specializing our data to just 10% of the labels (in A1) yields nearly the same precisions which using 100% of the data (in A5) in nearly all the AUC results. Moreover, the AUC score for A1 is perfect in 4/8 cases, while for A5, it is rarely the case.

As to instance engineering, without it the precision can crash to zero (compare A1 to A2, particularly the smaller data sets) while often leading to lower recalls. The smaller datasets also see a decrease in AUC for A2.

Measured in terms of false alarm, these results strongly recommend parameter engineering. Without parameter engineering, some of those treatments could find too many static code warnings and hence suffer from excessive false alarms (evidence: see the A3 false alarm results in nearly every data set). A1 (which used all the treatments of §4) had lower false alarm rates than anything else (evidence: we rarely see the dark blue A1 spike in the false alarm results). The only exception to the observation that “parameter engineering leads to lower false alarm results” are seen in the DERBY data set. That data set turns out to be particularly tricky in that, nearly always, modeling methods that achieved low false alarm rates on that data set also had to be satisfied with much lower recalls.

Figure 5: Error landscape in the TOMCAT after applying the treatments of §4. To understand the simplifications achieved via our methods, the reader might find it insightful to compare this figure against Figure 4.

One final point is that these results do not recommend the use of certain widely used neural network technologies such as CNN or CodeBERT for finding actionable static code warnings. CNN-based treatments (B2 and C2) suffer from low precision and AUC scores (see Table VII). Similarly, as shown Table VII, CodeBERT often suffers from low precision and poor false alarms and (in the case of CodeBERT) some very low recalls indeed.

In summary: 7pt Answer 1: To recognize actionable static code warnings, apply all the treatments of §4. Also, spend most tuning faster feedforward neural nets rather than trusting (a) traditional learners or (b) more recent “bleeding edge” neural net methods.

Rq2: Does GHOST2’s combination of instance, label, boundary and parameter engineering, reduce the complexity of the decision boundary?

Previously, this paper argued that reason for the poor performance seen in prior was due to the complexity of the data (specifically, the bumpy shape seen in Figure 4). Our treatments of §4 were designed to simplify that landscape. Did we succeed?

Figure 5 shows the landscape in TOMCAT after the treatments of §4 were applied. By comparing this figure with Figure 4, we can see that our treatments achieved the desired goal of removing the “bumps”.

Dataset % change
maven 158.87
cassandra 73.09
jmeter 55.53
tomcat 36.34
derby 31.35
commons 29.61
ant 24.78
lucene-solr 16.46
median 33.85
Table IX: Percent changes in li2018visualizing’s smoothness metric, seem after applying the methods of this paper.

As to the other data sets, li2018visualizing propose a “smoothness” equation to measure a data set’s “bumps”. Table IX shows the percentage change in that smoothness measure seen after applying the methods of this paper. All these changes are positive, indicating that the resulting landscapes are much smoother. For an intuition of what these numbers mean, the TOMCAT change of 36.35% results in Figure 4 changing to Figure 5.

Hence we say:

7pt Answer 2: Label, parameter, instance and boundary engineering can simplify the internal structure of training data.

Rq3: Does GHOST2’s combination of instance, label, boundary and parameter improve predictive performance?

Table VIII shows the performance improvements after smoothing out our training data from (e.g.) Figure 4 to Figure 5. On 4/8 datasets, we achieve perfect scores. Moreover, we showed through an ablation study that each of the components of GHOST2 is necessary. For example, row A3 in Table VII is another piece of evidence that hyper-parameter optimization is necessary. The feedforward networks of our approach outperformed more complex learners (CNNs and CodeBERT)–we refer the reader to rows B2, C2, and CodeBERT in Table VII. On the other hand, going too simple for traditional learners leads to A7, which suffers from poor precision scores. Given those large improvements, we say:

7pt Answer 3: Detectors of actionable static code warnings work much better when learned from smoothed training data.

Rq4: Are all parts of GHOST2 necessary; i.e. would something simpler also achieve the overall goal?

We presented an ablation study that showed that each part of GHOST2 was necessary. Among the 13 treatments that we tested, GHOST2 was the only one that consistently scored highly in precision, AUC, and recall, while also generally having low false alarm rates. The crux of our ablation study was that each component of GHOST2 works with the others to produce a strong performer.

Based on the above ablation study results, we say: 7pt Answer 4: Ignoring any of part of instance, label, boundary or parameter engineering leads to worse results than using all parts (at least for the purpose of recognizing actionable static code warnings).

Rq5: Are larger training sets necessary (for the task of recognizing actionable static code warnings)?

In the above discussion, when we presented Table V, it was noted that several of the train/tests used in this study were very small. At that time, we expressed a concern that, possibly, our data sets explored were too small for effective learning.

This turned out not to be the case. Recall that in Table VII, the data set were sorted left-to-right from smallest to largest training set size. There is no pattern there that smaller data sets perform worse than large ones. In fact– quite the opposite: the smaller data sets were always associated with better performance than those seen on right-left-side. Hence we say:

7pt Answer 5: The methods of this paper are effective, even for very small data sets.

This is a surprising result since one of the truisms of data mining is “the more data the better”. Large data sets are often cited as the key to success for data mining applications. For example, in his famous talk, “The Unreasonable Effectiveness of Data”, Google’s former Chief Scientist Peter Norvig argues that “billions of trivial data points can lead to understanding” [norvig11] (a claim he supports with numerous examples from vision research).

7 Threats to Validity

As with any empirical study, biases can affect the final results. Therefore, any conclusions made from this work must be considered with the following issues in mind:

1. Sampling bias threatens any classification experiment; i.e., what matters there may not be true here. For example, the data sets used here comes prior work and, possibly, if we explored other data sets we might reach other conclusions. On the other hand, repeatability is an important part of science so we argue that our decision to use the Kang et al. data is appropriate and respectful to both that prior work and the scientific method.

2. Learner bias: Machine learning is a large and active field and any single study can only use a small subset of the known algorithms. Our choice of “local learning” tools was explained in §4. That said, it is important to repeat the comments made there that our SMOTEing, SMOOTHing, GHOSTing and DODGEing operators are but one set of choice within a larger framework of possible approaches to instance, label, boundary, and parameter engineering (respectively). As SE research matures, we foresee that our framework will become a workbench within which researchers replace some/all of these treatments with better options. That said, in defence of the current options, we note that our ablation study showed that removing any of them can lead to worse results.

3. Parameter bias: Learners are controlled by parameters and the resulting performance can change dramatically if those parameters are changed. Accordingly, in this paper, our recommended methods (from Table 4) includes parameter engineering methods to find good parameter settings for our different data sets.

4. Evaluation bias: This paper use four evaluation criteria (precision, AUC, false alarm rate, and recall) and it is certainly true that by other measures, our results might not work be seen to work as as well. In defence of our current selection, we note that we use these measures since they let us compare our new results to prior work (who reported their results using the same measures).

Also, to repeat a remark made previously, another evaluation bias was how we separated data into train/test. In other papers, we have run repeated trials with multiple 80:20 splits for training:test data. This was not here since some of our data sets are too small (see the first few rows of Table 5) that any reduction in the training set size might disadvantage the learning process. Hence, the external validity claims of this paper come from patterns seen in eight different software projects.

8 Discussion

This discussion section steps back from the above to make some more general points.

We suggest that this paper should lead to a new way of training newcomers in software analytics:

  • Our results show that there is much value in decades-old learning technology (feedforward networks). Hence, we say that when we train newcomers to the field of software analytics, we should certainly train them in the latest techniques (deep learning, CodeBERT, etc).

  • That said, we should also ensure that they know of prior work since (as shown above), sometimes those older methods still have currency. For example, if some learner is faster to run, then it is easier to tune. Hence, as shown above, it can be possible for old techniques to do better than new ones, just by tuning.

For future work, it would be useful to check what other SE domains simpler, faster, learners (plus some tuning) out-perform more complex learning methods.

That said, we offer the following cautionary note about tuning. Hyper-parameter optimization (HPO, which we have call “parameter engineering” in this paper) has received much recent attention in the SE literature [agrawal2019dodge, yedida2021value, agrawal2021simpler] We have shown here that reliance on just HPO can be foolhardy since better results can be obtained by the judicious use of HPO combined with more nuanced approaches that actually reflect the particulars of the current problem (e.g. our treatments that adjusted different parts of the data in different ways). As to how much to study the internals of a learner, we showed above that there are many choices deep within a learner than can greatly improve predictive performance. Hence we say that it is very important to know the internals of a learner and how to adjust them. In our opinion, all too often, software engineers use AI tools as “black boxes” with little understanding of their internal structure.

Our results also doubt some of the truisms of our field. For example:

  • There is much recent work on big data research in SE, the premise being that “the more data, the better”. We certainly do not dispute that but our results do show that it is possible to achieve good results with very small data sets.

  • There is much work in software analytics suggesting that deep learning is a superior method for analyzing data [yedida2021value, wang2018deep, li2017cclearner, white2015deep]. Yet when we tried that here, we found that a decades-old neural net architecture (feed-forward networks, discussed in Table IV) significantly out-performed deep learners.

For newcomers to the field of software analytics, truisms might be useful. But better results might be obtained when teams of data scientists combine to suggest multiple techniques– some of which ignore supposedly tried-and-true truisms.

9 Conclusion

Static analysis tools often suffer from a large number of false alarms that are deemed to be unactionable [tomassi2021real]. Hence, developers often ignore many of their warnings. Prior work by Yang et al. [yang2021learning] attempted to build predictors for actionable warnings but, as shown by Kang et al. [kang2022detecting], that study used poorly labelled data.

This paper extends the Kang et al. result as follows. Table II shows that building models for this domain is a challenging task. The discussion section of §3 conjectured that for the purposes of detecting actionable static code warnings, standard data miners can not handle the complexities of the decision boundary. More specifically, we argued that:

For complex data, global treatments perform worse than localized treatments which adjust different parts of the landscape in different ways.

§4 proposed four such localized treatments, which we called instance, parameter, label and boundary engineering.

These treatments were tested on the data generated by Kang et al. (which in turn, was generated by fixing the prior missteps of Yang et al.). On experimentation, it was shown that the combination of all our treatments (in the “A1” results of Table VI) performed much better than than the prior results seen in Table II. As to why these treatments before so well, the analysis of Table IX showed that instance, parameter, label and boundary engineering did in fact remove complex shapes in our decision boundaries. As to the relative merits of instance versus parameter versus label versus boundary engineering, an ablation study showed that using all these treatments produces better predictions that alternative treatments that ignored any part.

Finally, we comment here on the value of different teams working together. The specific result reported in this paper is about how to recognize and avoid static code analysis false alarms. That said, there is a more general takeaway. Science is meant to be about a community critiquing and improving each other’s ideas. We offer here a successful example of such a community interaction where teams from Singapore and the US successfully worked together. Initially, in a 2022 paper [kang2022detecting], the Singapore team identified issues with the data that result in substantially lower performance of the previously-reported best predictor of actionable warnings [wang2018there, yang2021learning, yang2021understanding]. Subsequently, in this paper, both teams combined to produce new results that clarified and improved the old work. That teamwork leads us to trying methods which, according to the truisms of our field, should not have worked. The teamwork that generated this paper should be routine, and not some rare exceptional case.


This work was partially supported by an NSF Grant #1908762.