An Expert System for Learning Software Engineering Knowledge (with Case Studies in Understanding Static Code Warning)

by   Xueqi Yang, et al.

Knowledge-based systems reason over some knowledge base. Hence, an important issue for such systems is how to acquire the knowledge needed for their inference. This paper assesses active learning methods for acquiring knowledge for "static code warnings". Static code analysis is a widely-used methods for detecting bugs and security vulnerabilities in software systems. As software becomes more complex, analysis tools also report lists of increasingly complex warnings that developers need to address on a daily basis. Such static code analysis tools often usually over-cautious; i.e. they often offer many warns about spurious issues. Previous research work shows that about 35 warnings reported as bugs by SA tools are actually unactionable (i.e., warnings that would not be acted on by developers because they are falsely suggested as bugs). Experienced developers know which errors are important and which can be safely ignoredHow can we capture that experience? This paper reports on an incremental AI tool that watches humans reading false alarm reports. Using an incremental support vector machine mechanism, this AI tool can quickly learn to distinguish spurious false alarms from more serious matters that deserve further attention. In this work, nine open source projects are employed to evaluate our proposed model on the features extracted by previous researchers and identify the actionable warnings in priority order given by our algorithm. We observe that our model can identify over 90 humans to ignore 70 to 80



There are no comments yet.


page 1

page 2

page 3

page 4


Assessing Validity of Static Analysis Warnings using Ensemble Learning

Static Analysis (SA) tools are used to identify potential weaknesses in ...

Getafix: Learning to fix bugs automatically

Static analyzers, including linters, can warn developers about programmi...

Integration of the Static Analysis Results Interchange Format in CogniCrypt

Background - Software companies increasingly rely on static analysis too...

How to Recognize Actionable Static Code Warnings (Using Linear SVMs)

Static code warning tools often generate warnings that programmers ignor...

Learning based Methods for Code Runtime Complexity Prediction

Predicting the runtime complexity of a programming code is an arduous ta...

Deploying Static Analysis

Static source code analysis is a powerful tool for finding and fixing bu...

Enhanced Labeling of Issue Reports (with F^3T)

Standard automatic methods for recognizing problematic code can be great...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge acquisition problem is a longstanding and challenging bottleneck in artificial intelligence, especially like Semantic Web project 


. Traditional knowledge engineering methodologies handcraft the knowledge prior to testing that data on some domain 

hoekstra2010knowledge . Such handcrafted knowledge is expensive to collect. Also, building competent systems can require extensive manually crafting– which leads to a long gap between crafting and testing knowledge.

Figure 1: Example of a static code analysis warning, generated via the FindBugs tool.

In this paper, we address these problems a self-adaptive incrementally active learning approach that uses a human-in-the-loop process. Our case study is learning how to distinguish spurious vs serious static warnings. Static code analysis is a common methods for detecting bugs and security vulnerabilities in software systems.

The wide range of commercial applications of static analysis demonstrates the industrial perception that these tools have a very large economic value. For example the FindBugs static code analysis tool111 (shown in Figure 1) has been downloaded over a million times so far. However, due to high rates of unactionable alerts (i.e., warnings that would not be acted on by developers because they are falsely suggested as bugs by SA tools), the utility of such static code analysis tools is questionable Previous research work shows that about 35% to 91 % warnings reported as bugs by SA tools are actually unactionable  kim2007warnings ; heckman2008establishing ; heckman2011systematic .

Experienced developers know which errors are important and which can be safely ignored. Our active learning methods incrementally acquire and validate that knowledge. By continuously and incrementally constructing and updating the model, our approach can help SE developers to identify more actionable static warnings with very low inspection cost and provide an efficient way to deal with software mining on early life cycle.

This paper evaluates this approach using four research questions:

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ1. What is the baseline rate for bad static warnings?

While this is more a systems question rather than a research question, it is a necessary precondition to our work since it documents the problem we are trying to address. For this questions, we report results from FindBugs. These results will serve as the baseline for the rest of our work.

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners]

RQ2. What is the previous state-of-the-art method to tackle the prevalence of actionable warnings in SA tools?

Wang et al. wang2018there is a systematic evaluation of all the public available features (116 features in total) that discuss static code warnings. That work offered a ”golden set of features”; i.e. 23 features that Wang et al. wang2018there

argued were most useful for extracting serious bug reports our of FindBugs. Our experiments combining three supervised learning models from the literature with these 23 features.

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ3. Does incrementally active learning reduce the cost to identify actionable Static Warning?

We will show that incrementally active learning greatly reduces the cost of identifying actionable warnings dramatically (and obtains performance almost as good as supervised learning).

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ4. How many samples should be retrieved to identify all the actionable Static Warning?

In this case study, incrementally active learning can identify over 90% of actionable warnings by learning from about 20% to 30% of data. Hence, we recommend this system to developers who wish to reduce the time they waste chasing spurious errors.

1.1 Organization of this Paper

The remainder of this paper is organized as follows. Research background and related work is introduced in Section 2. In Section 3, we describe the detail of our methodology. Our experiment details is introduced in Section 4. In Section 5, We answer proposed research questions. Threats to validity and future work is discussed in Section 6 and we finally draw a conclusion in Section 7.

1.2 Contributions of this Paper

In the literature, active learning methods have been extensively discussed, like finding relevant papers in literature review yu2019fast2 , security vulnerability prediction yu2018cost , crowd sourced testing wang2016local , place-aware application development murukannaiah2015platys , classification of software behavior bowring2004active , and multi-objective optimization krall2015gale . The unique contribution of this work is the novel application of these methods to the problem of resolving problems with static code warnings. To the best of our knowledge, no prior work has tried to tame spurious static code warnings by treating these as an incremental knowledge acquisition problem.

2 Related Work

2.1 Reasoning About Source Code

The software development community has produced numerous static code analysis tools such as FindBugs, PMD222 or Checkstyle333 that are able to generate various warnings to help developers identifying potential code problems. Such static code analysis tools such as FindBugs leverages static analysis (SA) techniques to inspect a program’s code for the occurrence of bug patterns (i.e., the code idiom that is often an error) without actually executing nor considering an exact input. These bugs detected by FindBugs are grouped into a pattern list,(i.e, performance, style, correctness and so forth) and each bug is reported by FindBugs with priority from 1 to 20 to measure the severity, which is finally grouped into four scales either scariest, scary, troubling, and of concern ayewah2008using .

Some SA tools learn to identify new bugs using historical data from past problems. This is not ideal since it means that whenever there are chances to tasks, languages, platforms, and perhaps even developers then the old warnings might go out of date and new ones have to be learned. Static warning identification is increasingly relying on complex software systems wijayasekara2012mining . Identifying static warnings in every stage of software life cycle is essential, especially for projects in early development stage murtaza2016mining .

Arnold et al. arnold2009security suggests that every project, early in its own lifecycle, should build its own static warning system. Such advice is hard to follow since it means a tedious, time-consuming and expensive retraining process at the start of each new project. To say that another way, Arnold et al.’s advice suffers from the knowledge acquisition bottleneck problem.

2.2 Static Warning Identification

Static warning identification aims at identifying common coding problems early in the development process via SA tools and distinguish actionable warnings from unactionable ones heckman2011systematic ; hovemeyer2004finding ; yan2017revisiting .

Previous study has shown that false positive in static alerts have been one of the most important barriers for developers to use static analysis tools thung2015extent ; avgustinov2015tracking ; johnson2013don . To address this issue, many techniques have been introduced to identify actionable warnings or alerts. Various models have been mentioned in their study, including graph theory boogerd2008assessing ; bhattacharya2012graph

, machine learning 

wang2016automatically ; shivaji2009reducing etc. However, most of the studies are plagued by a common issue, choosing the appropriate warning characteristics from abundant feature artifacts proposed by SA studies so far.

Ranking schemes are one way to improve static analysis tool kremenek2004correlation Allier et al. allier2012framework proposed a framework to compare 6 warning ranking algorithms and identified the best algorithms to rank warnings. Similarly, Shen et al. shen2011efindbugs employed a ranking technique to rank the true error reports on top so as to reduce false positive warnings. Some other work also prioritize warnings by selecting different categories of impact factors liang2010automatic or by analyzing software history kim2007prioritizing .

Recent work has shown that this problem can be solved by combining machine learning techniques to identifying whether a detected warning is actionable or not, e.g., finding alerts with similar code patterns and building prediction models to classify new alerts 

hanam2014finding . Heckman and Williams did a systematic literature review revealing that most of these works focus on exploring a reasonable characteristic set, like Alert characteristics(AC) and Code characteristics(CC), to distinguish actionable and unactionable warnings more accurately heckman2011systematic ; hanam2014finding ; heckman2009model . One of the most integrated study explores 15 machine learning algorithms and 51 warning characteristics derived from static analysis tools and achieve good performance with high recall( 83-99 %) heckman2009model . However, in practice, information of bug warning pattern is limited to obtain, especially for some trivial checkers in SA tools. Also, these tools suffer from conflation issues where similar warnings are given different names in different studies.

Wang et al. wang2018there recently conducted a systematic literature review to collect all public available features(116 in total) for SA analysis and implement a tool based on Java for feature extraction. All the values of these collected features are extracted from warning reports generated by FindBugs based on 60 revisions of 12 projects. Six machine learning classifiers were employed to automatically identify actionable static warning. 23 common features were identified as the best and most usefully feature combination for Static Warning Identification, since best performance is always obtained when using these 23 golden features, better than using total feature set or other subset strategies. To the best of our knowledge, this is the most exhaustive research about SA characteristics yet published.

2.3 Active Learning

Labeled data is required by supervised machine learners. Without such data, these algorithms cannot learn predictors.

Obtaining good labeled data can sometimes be time consuming and expensive. In the case of this paper, we are concerned with learning how to label static code warnings (spurious or serious). For another example, training a good document classifier might require hundreds of thousands of samples. Usually, these examples do not come with labels, and therefore expert knowledge (e.g., recognizing a handwritten digit) are required to determine the “right” label.

Active learning settles2009active is a machine learning algorithm that enable the learners to actively choose which examples to label from amongst the currently unlabeled instances. This approach trains on a little bit of labeled data, and then asks again for some more labels for the unlabelled examples that are most “interesing” (e.g. whose labels are most uncertain). This process greatly reduces the amount of labeled data required to train a model while still achieving good predictive performance.

Active learning has been applied successfully in several SE research areas, such as finding relevant papers in literature review yu2019fast2 , security vulnerability prediction yu2018cost , crowd sourced testing wang2016local , place-aware application development murukannaiah2015platys , classification of software behavior bowring2004active , and multi-objective optimization krall2015gale . Overall, there are three different categories of active learning:

  • Membership query synthesis. In this scenario, a learner is able to generate synthetic data for labeling, which might not be applicable to all cases.

  • Stream-based selective sampling. Each sample is considered separately in the case for label querying or rejection. There is not assumptions on data distribution, and therefore it is adaptive to change.

  • Pool-based sampling. Samples are chosen from a pool of unlabeled data for the purpose of labeling. The learner is usually initially trained on a fully labeled fraction of data to generate a preliminary model, which is subsequently used to identify which sample would be most beneficial to be used next in the training set during next generation of active learning loop. Pool-based sampling scenario is the best-known one, which is also applied in our work.

Previous work have shown successful adoption of active learning in several research areas. Yu yu2019fast2

proposed tool called FAST2 to assist researchers to find the relevant papers to read. FAST2 works by 1) leveraging external domain knowledge (e.g., keyword search) to guide the initial selection of papers; 2) using an estimator of the number of remaining paper to decide when to stop; 3) applying error correction algorithm to correct human mislabeling.

HARMLESS yu2018cost is a software vulnerability reducing tool that integrates human effort and vulnerability prediction model into active learning environment. HARMLESS is able to find vulnerabilities with least amount of inspected code and guides human to stop the inspection at a desired recall.

Wang wang2016local applied active learning to identify the test reports that reveal “true fault” from large amount of test reports in crowdsourced testing of GUI applications. Within that framework, they proposed a classification technique that labels a fraction of most informative samples with user knowledge, and trained classifier based on local neighbourhood.

To the best of our knowledge, this work is the first study to utilize incrementally active learning to reduce unnecessary inspection of static warnings based on the most effective feature attributes. While Wang et al. is the closest work to this paper, we differ very much from their work.

  • In that study, their raw data was screen-snaps of erroneous conditions within a GUI. Also, they spend much effort tuning a feedback mechanism specialized for their images.

  • In our work, our raw data is all textual (the text of a static code warning). We found that a different method, based on active learning, worked best for such textual data.

3 Methodology

3.1 Overview

In this work, we propose to apply an incrementally active learning framework to identify static warnings. This is derived from active learning, which has been proved outperformed in solving the total recall problem in several areas, e.g., electronic discovery, evidence-based medicine, primary study selection, test case prioritization and so forth. As illustrated in Figure  2, we hope to achieve higher recall with lower effort to inspecting warnings generated by SA tools.

Figure 2: Learning Curve of Different Learners.

3.2 Evaluation Metrics

Figure  2 is an Alberg diagram showing the learning curve of different learners. In this figure, the x-axis and y-axis respectively represent the percentage of warnings retrieved or labeled by learners (i.e. cost) and the percentage of actionable warnings retrieved out of total actionable ones (i.e. total recall). An optimal learner will achieve higher total recall than others when a specific cost threshold is given, e.g., at the cost of 20 % effort shown in Figure 2, the best performance of different learners is obtained by optimal learner, followed by proposed learner, random learner and worst learner. This learning curve is a performance measurement at different cost thresholds settings.

3.3 Active Learning Model Operators

We apply several operators to solve the challenge of the total recall problem, as we list in Table 1. Specific details about each operators are illustrated as follows:

max width=0.48 Operator Description Machine Learning Classifier Widely-used classification technique. Presumptive non-relevant examples Alleviate the sampling bias of non-relevant examples. Aggressive Undersampling Data-balancing technique. Query strategy Uncertainly sampling and certainty sampling in active learning.

Table 1: Operators of Active Learning.


We employ three machine learning classifiers as embedded active learning model, linear SVM with weighting, Random Forest and Decision Tree with default parameter as these classifiers are widely explored in software engineering area and also reported in Wang’s paper. All of the classifiers are modules from Sckit-learn 

pedregosa2011scikit , a Python package for machine learning.

Presumptive non-relevant examples, proposed by Cormack et al. cormack2015autonomy , is a technique to alleviate the samples bias of negative samples in unbalanced dataset. To be specific, before each training process, the model samples randomly from the unlabeled pool and assumes that the sampled instance is labeled as negative in training, due to the prevalence of negative samples.

Aggressive undersampling  wallace2009meta is a sampling method to cope with unbalanced dataset by throwing away majority negative training points close to the decision plane of SVM and aggressively accessing minority positive points until the ratio of these two categories is balanced. It’s an effective approach to kill data unbalanced bias. This technique is suggested by Wallace et al. wallace2010semi after the initial stage of incremental active learning and when the established model becomes stable.

The querying strategy is the approach used to determine which data instance in unlabelled pool to query for labelling next. We adopt two of the most commonly used strategy, uncertainty sampling  settles2009active and certainty sampling miwa2014reducing .

Uncertainty sampling settles2009active is the simplest and most commonly used query strategy in active learning, where unlabeled samples closest to the decision plane of SVM or predicted to be the least likely positive by a classifier are sampled for query. Wallace et al. wallace2010semi recommended uncertainty sampling method in biomedical literature review and reduce the cost of manually screening literature efficiently.

Certainty sampling  miwa2014reducing

is a kind of greedy algorithm to maximize the utility of incremental learning model by prioritizing the samples which are most likely to be actionable warnings. Contrary to uncertainty sampling, certainty sampling method gives priority to the instances which are far away from the decision plane of SVM or have highest probability score predicted by the classifier. It speeds up the process of retrieving and plays the major role of stopping earlier.

Figure 3: Procedure of Incrementally Active Learning.

3.4 Active Learning Procedures

Figure  3 presents the procedures of incrementally active learning, and detailed description of each step is demonstrated as follows:

  1. Initial Sampling.

    We propose two initial sampling strategies to cope with the scenario that historical information is available or not.

    For software project in early life cycle without sufficient historical revisions in version control system, random sampling without replacement is used in the initial stage when labeled warning pool is NULL.

    For software projects with previous version information, we utilize version N-1 to get pre-train a model initial sampling on version N. This practice can reduce the cost of manually excluding unactionable warnings since the prevalence of false positive in SA datasets.

  2. Human or oracle labeling.

    After a warning message is selected by initial sampling or query strategy, manual inspection is required to identify whether the retrieved warning is actually actionable or not. In our simulation, the ground truth serves as a human oracle and return a label once a warning presupposed to be unlabeled is queried by active learning model.

    In static analysis, inspecting and tagging the warning being queried is considered as a main overhead of this process. As demonstrated in Table 3, this overhead is denoted as Cost and is what software developers strive to reduce.

  3. Model Training and updating.

    After a new coming-in warning is labeled by human oracle, add this data sample to training data. Retrain and update the model.

  4. Query Strategy.

    Uncertainty sampling is used when the actionable samples retrieved and labeled by model is under a specific threshold. This query strategy mainly applies when target data samples are rare in training set and building a stable model is faster is required yu2018improving .

    Finally, after actionable warning labeled exceed the threshold, certainty sampling is employed to aggressively searching for true positive and greedily reduce the cost of warning inspection.

Project Period
3 month Search engine
3 month Server
3 month Database
3 month Driver
3 month Big data manage
6 month Performance manage
6 month Build manage
6 month Java utility
6 month Project manage
Table 2: Summary of Projects Surveyed.

4 Experiment

4.1 Static Warning Dataset

The nine dataset used in this work are collected from previous research. Wang et al. wang2018there used a systematic literature review to review all publicly available features(116 in total) for SA analysis. For this research, all the values of this collected feature set were extracted from warnings reported by FindBugs on the 60 successive revisions of 12 projects. Using the Static Warning(SA) tool, we applied FindBugs to 60 revisions from 12 projects’ revision history. By collected performance statistics from three supervised learning classifiers on 12 datasets, a golden feature set(23 features) is found. We utilize the best feature combination as the warning characteristics in our research.

On closer inspection of this data, we found three projects with obvious data inconsistency issues (such as data features dismatch with data labels). Hence, our study used the remaining nine projects.

Table 2 lists the the summary of projects surveyed in our paper. For each project, there are 5 versions collected from starting revision time after a specific revision interval. We train the model on version 4 and test on version 5.

The independent variables are software metrics as shown in Table 3. In our study, the dependent variable is actionable or unactionable. These labels were generated via method proposed by previous researches heckman2008establishing ; hanam2014finding ; liang2010automatic . That is, for a specific warning, if it is closed in later revision after a revision interval when the project was collected, it will finally labeled as actionable. For warning still existing after later revision interval, it will labeled as unactionable. Otherwise, for some minority warnings which are deleted after later interval, they will be removed and ignored in our study.

Table 4 shows the number of warnings and distribution of each warning type (as reported by FindBugs) in nine software projects. Note that our data is highly imbalanced with ratio of target samples from 3 to 34 percent.

max width=0.48 Category Features File characteristics file type; file name; package name; Warning characteristics warning pattern, type, priority, rank; warnings in method, file, package; Code characteristics method, file, package size; comment length; comment-code ratio; method, file depth; method callers, callees; methods in file, package; classes in file, package; indentation; complexity; File history latest file, package modification; file, package staleness; file age; file creation; deletion revision; developers; Code history revised percentage of LOC in file in past 3 months; revised percentage of LOC in file in last 25 revisions; revised percentage of LOC in package in past 3 months; revised percentage of LOC in package in last 25 revisions; revised percentage of LOC in project in past 3 months; revised percentage of LOC in project in last 25 revisions; Warning history warning modifications; warning open revision; warning lifetime by revision, by time; Code analysis call name, class, parameter signature, return type; new type, new concrete type; operator; field access class, field; catch; field name, type, visibility, static/final; method visibility, abstract / interfact / array class; Warning combination size content for warning type; size context in method, file, package; warning context in method, file, package; warning context for warning type; fix, non-fix change removal rate; defect likelihood for warning pattern; variance of likelihood; defect likelihood for warning type; discretization of defect likelihood;

Table 3: Categories of Selected Features.

max width=0.48

Project Open/Unactionable Close/Actionable Delete
ant 1061 54 0
commons 744 42 0
tomcat 1115 326 0
jmeter 468 145 7
cass 2245 356 64
phoenix 2046 343 13
mvn 790 28 44
lucence 2257 1168 440
derby 2386 121 0
Table 4: Number of Samples on Version 5.

4.2 Evaluation Metrics

Table 5 represents all the variables involved in our study. We evaluated the active learning results in terms of total recall and cost, which are demonstrated as follows:

max width=0.48 Variable Description Set of warning that reported by static analysis tools Set of actionable warning or target samples Set of warning that has been currently retrieved or labeled Set of warning has been currently labeled and reveals actionable warning Total Recall cost /

Table 5: Description of Variables in Incrementally Active Learning.

Total recall addresses the ratio between samples labeled but not revealing actionable warning and total real actionable warning samples. The best total recall value is 1, which represents all of the target samples (or actionable warning in our case) have been retrieved and labeled as actionable.

Cost considers the set of warning that has currently been retrieved or labeled and the set of warning reported by the static warning analysis tools. The value of cost is between the ratio of actionable warning in the dataset and 1. The lower bound means active learning algorithm prioritizes all target samples without uselessly labeling any unactionable warning samples. This is a theoretical optimal value (which, in practice, may be unreachable). The upper bound means active learning algorithm successfully retrieves all the real warning samples, but at the cost of labeling them all (which is meaningless because randomly labeling samples will achieve the same goal).

AUC measures the area under the Receiver Operator Characteristic(ROC) curve witten2016data ; heckman2011systematic and reflects the the percentage of actionable warnings against the the percentage of unactionable ones so as to overall report the discrimination of a classifier wang2018there . This is a widely adopted measurement in Software Engineering, especially for imbalanced data liang2010automatic .

Input : , previous version for training
, current version for prediction
C, common set of features shared by five releases
Output :  Total Recall, total recall for version n
cost, samples retrieved by percent
// Keep reviewing until stopping rule satisfied
while  do
          // Start training or not
          if  then
                   // Query next
                   // Random Sampling
          end if
         // Simulate review
end while
return ;
Function Train()
          // Classifier: Linear-SVM,decision tree, random forest
            clfreturn clf;
Function PredictProb(,)
          // predict Probability
          return prob, ;
Function Retrieve(prob, )
          // retrieve by descending-sorted probability
          while  do
                   // Sort label by descending order
                   // Retrieve
                   while  do
                            if  then
                            end if
                   end while
                    cost.append(len() / )
          end while
         return , cost;
Algorithm 1 Pseudo Code for Supervised Learning.

4.3 Machine Learning Algorithms

We choose three machine learning algorithms, i.e., Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT). These classifiers are selected for their common use in the software engineering literature. All these three algorithms are studied in Wang’s paper wang2018there and the best performance is obtained by Random Forest, followed by Decision Tree. Regarding to SVM, it obtains the worst perform reported in six algorithms by Wang et al. wang2018there

, but due to its wide combination with active learning and promising performance in many research areas like image retrieval 

pasolli2013svm and text classification tong2001support , especially imbalanced problems ertekin2007active , we also include this algorithm in our work. We now give a brief description of these algorithms and their use in this work.

Support Vector Machine. Support Vector Machine (SVM) cortes1995support

is a supervised learning model for binary classification and regression analysis. The optimization objective of SVM is to maximize the margin, which is defined as the distance between the separating hyperplane (i.e., the decision boundary) and the training samples (i.e., support vectors) that are closest to the hyperplane. Support vector machine are powerful linear models, it also can tackle nonlinear problems through the kernel trick, which consists of multiple hyperparameters that can be tuned to make good predictions.

Random Forest. Random forests liaw2002classification can be viewed as an ensemble of decision trees. The idea behind ensemble learning is to combine weak learners to build a more robust model, a strong learner, that has a better generalization error and is less susceptible to over-fitting. Such forests can be used for both classification and regression problems, and can be used to measure the relative importance of each feature on the prediction (by counting how often attributes are used in each tree of the forest).

Decision Tree. Decision tree learners are known for their ability to decompose complex decision processes into small and simple subsets safavian1991survey while in this process an associated multistage decision tree is hierarchically developed. There are several tree-based approaches widely used in software engineering area like ID3, C4.5, CART and so forth. Decision tree is computationally cheap to use, and is easy for developers or managers to interpret.

5 Experiments

In this section, we answer the four research questions formulated in Section 1.

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ1. What is the baseline rate for bad static warnings?

5.1 Research method

Static warning tools like FindBugs, Jlint and PMD are widely used in static warning analysis. Previous research has shown that FindBugs is more reliable than other SA tools regarding to its effective discrimination between true and false positives wang2018there ; rahman2014comparing . FindBugs is also known as a cost-efficient SA tool for detecting warnings by the combination of line-level, method-level and class-level granularity, thus reports much fewer warnings with obviously more lines  rahman2014comparing ; panichella2015would . Due to all the merits mentioned above, FindBugs has gained widespread popularity among individual users and major companies, like Google444In 2009, Google held a global fixit for UMD’s FindBugs tool and aimed at gathering feedback for the 4,000 highest confidence reported by FindBugs. It has been downloaded for more than a million times so far..

For a baseline result, we used the default priority ranking reported by FindBugs. Since FindBugs generates warnings and classify them into seven categories patterns shen2011efindbugs , in which random order of warnings in the same priority have the same severity to be fixed. And the higher priority denotes that the warning report is more likely to be actionable suggested by FindBugs. This randomly ranking strategy provides a reasonable probabilistic bounded time for software developers to find bugs and implements the scene without any information to prioritize warning reports heckman2011systematic ; kremenek2004correlation .

5.2 Research results

As is shown in Figure 4, the dark blue dashed line denotes the learning curve of random selection generated from Findbugs reports. The curve grows diagonally, indicating that an end-user without any historical warning information or auxiliary tool has to inspect 2507 warnings to identify only 121 actionable ones in Derby dataset.

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ2. What is the previous state-of-the-art method to tackle the prevalence of actionable warnings in SA tools?

max width=1 Active+SVM Supervised_SVM Active+RF Supervised_RF Active+DT Supervised_DT Project Median IQR Median (IQR) Median of Prior work Median IQR Median (IQR) Median of Prior work Median IQR Median (IQR) Median of Prior work Derby 98 1 97(2) 50 96 7 97(4) 43 93 2 94(4) 44 Mvn 94 3 96(7) 50 93 2 97(3) 45 67 3 91(2) 45 Lucence 95 1 97(3) 50 85 9 99(2) 98 94 2 93(4) 98 Phoenix 97 2 97(3) 62 90 7 97(3) 71 90 2 91(7) 70 Cass 96 5 99(3) 67 96 4 98(5) 70 90 1 94(4) 69 Jmeter 94 1 95(2) 50 90 4 97(2) 86 86 2 91(12) 82 Tomcat 98 1 97(3) 50 92 5 96(2) 80 94 2 92(6) 64 Ant 95 2 98(2) 50 94 1 98(3) 44 84 3 94(7) 44 Commons 91 3 98(3) 50 93 1 92(2) 57 80 8 85(14) 56

Table 6: AUC on 9 projects for 10 runs. (%)

5.3 Research method

Wang et al. wang2018there implements a Java tool to extract the value of 116 total features collected from exhausted systematic literature review and use the machine learning utility Weka555 ml/weka/ to build classifier models. An optimal SA feature set of 23 features is identified as the golden features by obtaining best AUC values evaluated with 6 machine learning classifiers. We reproduce the experiments with three most outperforming supervised learning models in previous research study, e.g., weighted linear SVM, random forest and decision tree with default parameters in Python3.7. The detailed process to reproduce our baseline is demonstrated in Algorithm1.

The specific process is as follows: For each project, a supervised model (either weighted SVM, Random Forest and Decision Tree) is built by training on Version 4. After the training process, we test on Version 5 for the same project and get a list of probability for each bug reported by FindBugs to be actionable. Sort this list of probability from most likely to be real actionable to least likely and retrieve these warnings in this descending order to report the total recall, cost and AUC

as evaluation metrics.

5.4 Research results

As shown in Table 6, the median and IQR of AUC scores of ten runs on nine projects are reported in our paper. For three supervised learning methods explored, Linear weighted Support Vector Machine and Random Forest both outperform Decision Tree. For incrementally active learning algorithms, the best combination is Active Learning + Support Vector Machine, followed by Active Learning + Random Forest and Active Learning + Decision Tree.

We find incrementally active learning can obtain high AUC no worse than supervised learning on most of datasets. The pink shadow highlights median result for active learning method which is better or no less 0.05 than median AUC of the state of the art.

The column ”Prior Work” shows results reported in Wang et al.’s prior research wang2018there . Note that our AUC scores for supervised models reproduced with Python3.7 are far higher than that prior work implemented by Weka. This difference can be incurred by the distinct setting of parameters in two different tools.

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ3. Does incrementally active learning reduce the cost to identify actionable Static Warning?

The purpose of this research question is to compare incrementally active learning with random selection and traditional supervised learning models.

5.5 Research method

Considering a real-world scenario when a software project in different stages of life cycle, RQ3 is answered in two parts: We first contrast incrementally active learning, denoted as solid lines in Figure 4 with random ranking (default ranking reported from FindBugs, denoted as dark blue dashed line in Figure 4). Then, we compare active learning(denoted as purple, lighted blue and red dashed lines in Figure 4) results with supervised learning.

5.6 Research results

Results of supervised learning methods is denoted as light blue, purple and red dashed lines. As it’s revealed in Figure 4, Random Forest outperforms the other classifiers, followed by Linear SVM and Decision Tree.

Figure 4 provides an overall view of experiment results to address Research Question 3. These nine subplots are the results of ten-time repeated experiment on forth and fifth versions of nine projects and we only report the median values here. The latest version 5 is selected to construct incrementally active learning, while for supervised learning model, we choose the two latest versions, learning patterns from version fourth for model construction and testing on version fifth for evaluation to make the experimental results comparable.

Figure 5 summarizes the ratio of real actionable warnings in version 5 of each projects and the corresponding median of cost when applying incrementally active learning to identify all these real warnings.

As we can see from Figure 4, incrementally active learning outperforms random order, which simulates real time cost bound when an end user recurs to warning reports prioritized by FindBugs. While, the learning curve of incrementally active learning without historical version is almost as good as supervised learning in most of nine projects based on version history. Also, the test results on nine datasets suggest that Linear SVM incrementally active learning is the best combination of all active learning, and Random Forest is the winner in supervised learning.

Overall, the above observations suggest that applying incrementally active learning model in static warning identification can help to retrieve actionable warnings in higher priority and reduce the effort to eliminate false alarm for software projects without adequate version history.

(a) commons
(b) tomcat
(c) jmeter
(d) cass
(e) derby
(f) phoenix
(g) lucence
(h) mvn
(i) ant
Figure 4: Test Results.
Figure 5: Cost Results at different thresholds for Incrementally Active Learning.

[enhanced,width=3.0in,size=fbox,fontupper=,colback=blue!5,drop shadow southwest,sharp corners] RQ4. How many samples should be retrieved to identify all the actionable Static Warning?

How many samples to retrieved is a critical problem when implementing active learning model in the scenario of static warning identification. Stopping too early or too late will incur the issue of missing important actionable warnings or wasting unnecessary running time and CPU resources.

In the following part, we introduce the research method and analysis the experimental results to answer Research Question 4.

5.7 Research method

Figure 5 employs the box-plot to describe the costs required or percentage of samples visited by three classifiers, Linear weighted SVM, Random Forest and Decision Tree combined with Incrementally active learning algorithm. Horizontal coordinate of the box charts represents the thresholds of recall, a mechanism to stop retrieving new potential actionable warnings when the proportion of related samples found reached the specific thresholds. And the vertical axis means the corresponding effort required to obtain the given recall, measured by proportion of warnings visited.

5.8 Research results

Based on the results showed in Figure 5, it can be observed that the growth of effort required is in a gentle and slow fashion when the threshold of relevant warnings visited increasing from 70 % to 90 %. However, for reaching 100 % threshold, the effort needed is almost or over twice compared with cost of threshold equal 90 %. A very intuitive suggestion can be obtained from Figure 5 is learning from 20 % or 30 % warnings for each of these nine projects, in which case the active learning model can identify over 90 % of actionable warnings.

However, there is an exception. Results of lucence reveal that our model has to learn more than 40 % of data to identify 90 % actionable warnings. Revisiting Table 4, it indicts that most of our projects are unbalanced data sets (ratio of target points is less than 20 percent for derby, mvn, phoenix, cass, commons and ant, and for jmeter and tomcat it’s slightly over 20 percent) while lucence (ratio is about 35 percent) is relatively higher. Our study attempt to provide a solid guideline but there is no general conclusion about the specific percent of data should be fed into the learner. It highly depends on the degree of data imbalance and the trade-off between missing target samples and reducing costs. Since the cost can only be reduced at expense of a lower threshold, which means missing some real actionable warnings.

In summary, our model has been proven to be an efficient methodology to deal with information retrieve problem for SA identification of extremely unbalanced data sets, moreover it is also a good option for engineers and researchers to apply active learning model in general problems because it has a lower building cost, a wider application range, and a higher efficiency compared with state-of-art supervised learning methods and random selection.

6 Discussion

6.1 Threats to validity

As to any empirical study, biases can affect the final results. Therefore, conclusions drawn from this work must be considered with threats to validity in mind. In this section, we discuss the validity of our work.

Learner bias. This work applies three classifiers, weighted linear-SVM, Random Forest and Decision Tree, which are the best setting according previous research work wang2018there . However, this doesn’t necessarily guarantee a best performance in other domains or other static warning datasets. According to the No Free Lunch Theorems wolpert1997no , applying our method framework to other areas would be needed before we can assert that our methods are also better in those domains.

Sampling bias. One of the most important threat to validity is sampling bias since several sampling methods, random sampling, uncertainty sampling and certainty sampling, are used in combination. However, there are also many sampling methods in active learning area we can utilize. And different sampling strategies and combinations may result better performance. This is a potential research direction.

Ratio bias. In this paper, we propose an ideal scale value for our learner to retrieve on 9 nine static warning datasets to effectively solve the prevalence of false positive in warnings reported by SA tools. obvious improvement is observed for this unbalanced problem. But it doesn’t necessarily apply to balanced datasets.

Measurement bias. To evaluate the validity of the incrementally active learning method proposed in this paper, we employ two measurement metrics: total recall and cost. Several prior research work has demonstrated the necessity and effectiveness of these measurements yu2018improving ; yu2019fast2

. Nevertheless, many studies are still based on some classic and tranditional metrics, eg. confusion matrix or also known as error matrix 

landgrebe2008efficient . There exist many popular terminology and derivations from confusion matrix, false positive, F1 score, G measure and so on. We cannot explore and include all the options in one article. Also, even for this same research methodology, conclusions drawn from different evaluation matrix may differ. However, in this research scenario, this more efficient to report recall and cost for effort-aware model.

6.2 Future Work

Estimation. In real-world problem, labeled data may be scare or expensive to obtain, while data without labels may be abundant. In this case, the query process of our incrementally learning model cannot safely stop to obtain a given targeted threshold without knowing the actual number of actionabel warnings in the data set beforehand. Therefore, estimation is required to guarantee the algorithm stopping detection at an appropriate stage: stopping too late will cause unnecessary cost to explore unactionable warnings and increase the False Alarm; while stopping too early may incur missing potential and important true warnings.

Ensemble of classifiers. Ensemble learning is a methodology of making decision based on inputs of multiple experts or classifiers zhang2012ensemble . It’s an feasible and important scheme to reduce the variance of classifiers and improve the reliability and robustness of the decision system. The famous No Free Lunch Theorems proposed by Wolpert et al.  wolpert1997no gives us an instinct guidance to recur to ensemble learners. This will be promising to make the best of incremental active learning by precisely making prediction and pinpoint real actionable warnings with a generalized decision system.

7 Conclusion

Previous research work shows that about 35% to 91% warnings reported as bugs by static analysis tools are actually unactionable (i.e., warnings that would not be acted on by developers because they are falsely suggested as bugs). Therefore, to make such systems usable by programmers, some mechanism is required to reduce those false alarms.

Arnold et al. warn arnold2009security that knowledge about what is an ignorable static code warning may not transfer from project to project. Here, they advise that methods for managing static code warnings be tuned to different software projects. While we agree with that advice, it does create a knowledge acquisition bottleneck problem since acquiring that knowledge can be time-consuming and tedious task.

This explored methods for acquiring knowledge of what static code warnings can be ignored. Using a human-in-the-loop active learner, we conducted an empirical study with 9 software projects and 3 machine learning classifiers to verify how performance of current SA tools could be improved by efficient incrementally active learning method. We found about 90 % actionable static warnings can be identified when only inspecting about 20 % to 30 % warning reports without using historical version information. Our study attempts to bridge this gap between supervised learning and effort-aware active learning models by an in-depth analysis of reducing cost of static warning identification problem.

Our methods significantly decreases the cost of inspecting falsely reported warnings generated by static code analysis tools for software engineers (especially in early stage of software project’s life cycle) and provides a meaningful guideline to improve the performance of current SA tools. Acceptance and adoption of future Static Analysis tools can be enhanced by combining with SA feature extraction and self-adaptive incrementally active learning.


This work was partially funded by NSF grant #1908762.


  • [1] E. A. Feigenbaum, Knowledge engineering: the applied side of artificial intelligence., Tech. rep., STANFORD UNIV CA DEPT OF COMPUTER SCIENCE (1980).
  • [2] R. Hoekstra, The knowledge reengineering bottleneck, Semantic Web 1 (1, 2) (2010) 111–115.
  • [3] S. Kim, M. D. Ernst, Which warnings should i fix first?, in: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ACM, 2007, pp. 45–54.
  • [4] S. Heckman, L. Williams, On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques, in: Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, ACM, 2008, pp. 41–50.
  • [5] S. Heckman, L. Williams, A systematic literature review of actionable alert identification techniques for automated static code analysis, Information and Software Technology 53 (4) (2011) 363–387.
  • [6] J. Wang, S. Wang, Q. Wang, Is there a golden feature set for static warning identification?: an experimental evaluation, in: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ACM, 2018, p. 17.
  • [7] Z. Yu, T. Menzies, Fast2: An intelligent assistant for finding relevant papers, Expert Systems with Applications 120 (2019) 57–71.
  • [8] Z. Yu, C. Theisen, H. Sohn, L. Williams, T. Menzies, Cost-aware vulnerability prediction: the harmless approach. corr abs/1803.06545 (2018), arXiv preprint arXiv:1803.06545.
  • [9] J. Wang, S. Wang, Q. Cui, Q. Wang, Local-based active classification of test report to assist crowdsourced testing, in: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, 2016, pp. 190–201.
  • [10] P. K. Murukannaiah, M. P. Singh, Platys: An active learning framework for place-aware application development and its evaluation, ACM Transactions on Software Engineering and Methodology (TOSEM) 24 (3) (2015) 19.
  • [11] J. F. Bowring, J. M. Rehg, M. J. Harrold, Active learning for automatic classification of software behavior, in: ACM SIGSOFT Software Engineering Notes, Vol. 29, ACM, 2004, pp. 195–205.
  • [12] J. Krall, T. Menzies, M. Davies, Gale: Geometric active learning for search-based software engineering, IEEE Transactions on Software Engineering 41 (10) (2015) 1001–1018.
  • [13] N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, J. Penix, Using static analysis to find bugs, IEEE software 25 (5) (2008) 22–29.
  • [14] D. Wijayasekara, M. Manic, J. L. Wright, M. McQueen, Mining bug databases for unidentified software vulnerabilities, in: 2012 5th International Conference on Human System Interactions, IEEE, 2012, pp. 89–96.
  • [15] S. S. Murtaza, W. Khreich, A. Hamou-Lhadj, A. B. Bener, Mining trends and patterns of software vulnerabilities, Journal of Systems and Software 117 (2016) 218–228.
  • [16] J. Arnold, T. Abbott, W. Daher, G. Price, N. Elhage, G. Thomas, A. Kaseorg, Security impact ratings considered harmful, arXiv preprint arXiv:0904.4058.
  • [17] D. Hovemeyer, W. Pugh, Finding bugs is easy, Acm sigplan notices 39 (12) (2004) 92–106.
  • [18] M. Yan, X. Zhang, L. Xu, H. Hu, S. Sun, X. Xia, Revisiting the correlation between alerts and software defects: A case study on myfaces, camel, and cxf, in: 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Vol. 1, IEEE, 2017, pp. 103–108.
  • [19] F. Thung, D. Lo, L. Jiang, F. Rahman, P. T. Devanbu, et al., To what extent could we detect field defects? an extended empirical study of false negatives in static bug-finding tools, Automated Software Engineering 22 (4) (2015) 561–602.
  • [20] P. Avgustinov, A. I. Baars, A. S. Henriksen, G. Lavender, G. Menzel, O. de Moor, M. Schäfer, J. Tibble, Tracking static analysis violations over time to capture developer characteristics, in: Proceedings of the 37th International Conference on Software Engineering-Volume 1, IEEE Press, 2015, pp. 437–447.
  • [21] B. Johnson, Y. Song, E. Murphy-Hill, R. Bowdidge, Why don’t software developers use static analysis tools to find bugs?, in: Proceedings of the 2013 International Conference on Software Engineering, IEEE Press, 2013, pp. 672–681.
  • [22] C. Boogerd, L. Moonen, Assessing the value of coding standards: An empirical study, in: 2008 IEEE International Conference on Software Maintenance, IEEE, 2008, pp. 277–286.
  • [23] P. Bhattacharya, M. Iliofotou, I. Neamtiu, M. Faloutsos, Graph-based analysis and prediction for software evolution, in: 2012 34th International Conference on Software Engineering (ICSE), IEEE, 2012, pp. 419–429.
  • [24] S. Wang, T. Liu, L. Tan, Automatically learning semantic features for defect prediction, in: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), IEEE, 2016, pp. 297–308.
  • [25] S. Shivaji, E. J. Whitehead Jr, R. Akella, S. Kim, Reducing features to improve bug prediction, in: 2009 IEEE/ACM International Conference on Automated Software Engineering, IEEE, 2009, pp. 600–604.
  • [26] T. Kremenek, K. Ashcraft, J. Yang, D. Engler, Correlation exploitation in error ranking, in: ACM SIGSOFT Software Engineering Notes, Vol. 29, ACM, 2004, pp. 83–93.
  • [27] S. Allier, N. Anquetil, A. Hora, S. Ducasse, A framework to compare alert ranking algorithms, in: 2012 19th Working Conference on Reverse Engineering, IEEE, 2012, pp. 277–285.
  • [28] H. Shen, J. Fang, J. Zhao, Efindbugs: Effective error ranking for findbugs, in: 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation, IEEE, 2011, pp. 299–308.
  • [29] G. Liang, L. Wu, Q. Wu, Q. Wang, T. Xie, H. Mei, Automatic construction of an effective training set for prioritizing static analysis warnings, in: Proceedings of the IEEE/ACM international conference on Automated software engineering, ACM, 2010, pp. 93–102.
  • [30] S. Kim, M. D. Ernst, Prioritizing warning categories by analyzing software history, in: Proceedings of the Fourth International Workshop on Mining Software Repositories, IEEE Computer Society, 2007, p. 27.
  • [31] Q. Hanam, L. Tan, R. Holmes, P. Lam, Finding patterns in static analysis alerts: improving actionable alert ranking, in: Proceedings of the 11th Working Conference on Mining Software Repositories, ACM, 2014, pp. 152–161.
  • [32] S. Heckman, L. Williams, A model building process for identifying actionable static analysis alerts, in: 2009 International Conference on Software Testing Verification and Validation, IEEE, 2009, pp. 161–170.
  • [33] B. Settles, Active learning literature survey, Tech. rep., University of Wisconsin-Madison Department of Computer Sciences (2009).
  • [34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, Journal of machine learning research 12 (Oct) (2011) 2825–2830.
  • [35] G. V. Cormack, M. R. Grossman, Autonomy and reliability of continuous active learning for technology-assisted review, arXiv preprint arXiv:1504.06868.
  • [36] B. C. Wallace, C. H. Schmid, J. Lau, T. A. Trikalinos, Meta-analyst: software for meta-analysis of binary, continuous and diagnostic data, BMC medical research methodology 9 (1) (2009) 80.
  • [37] B. C. Wallace, T. A. Trikalinos, J. Lau, C. Brodley, C. H. Schmid, Semi-automated screening of biomedical citations for systematic reviews, BMC bioinformatics 11 (1) (2010) 55.
  • [38] M. Miwa, J. Thomas, A. O’Mara-Eves, S. Ananiadou, Reducing systematic review workload through certainty-based screening, Journal of biomedical informatics 51 (2014) 242–253.
  • [39] Z. Yu, C. Theisen, L. Williams, T. Menzies, Improving vulnerability inspection efficiency using active learning, arXiv preprint arXiv:1803.06545.
  • [40] I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, 2016.
  • [41] E. Pasolli, F. Melgani, D. Tuia, F. Pacifici, W. J. Emery, Svm active learning approach for image classification using spatial information, IEEE Transactions on Geoscience and Remote Sensing 52 (4) (2013) 2217–2233.
  • [42] S. Tong, D. Koller, Support vector machine active learning with applications to text classification, Journal of machine learning research 2 (Nov) (2001) 45–66.
  • [43] S. Ertekin, J. Huang, C. L. Giles, Active learning for class imbalance problem, in: SIGIR, Vol. 7, 2007, pp. 823–824.
  • [44] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (3) (1995) 273–297.
  • [45] A. Liaw, M. Wiener, et al., Classification and regression by randomforest, R news 2 (3) (2002) 18–22.
  • [46] S. R. Safavian, D. Landgrebe, A survey of decision tree classifier methodology, IEEE transactions on systems, man, and cybernetics 21 (3) (1991) 660–674.
  • [47] F. Rahman, S. Khatri, E. T. Barr, P. Devanbu, Comparing static bug finders and statistical prediction, in: Proceedings of the 36th International Conference on Software Engineering, ACM, 2014, pp. 424–434.
  • [48] S. Panichella, V. Arnaoudova, M. Di Penta, G. Antoniol, Would static analysis tools help developers with code reviews?, in: 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), IEEE, 2015, pp. 161–170.
  • [49]

    D. H. Wolpert, W. G. Macready, et al., No free lunch theorems for optimization, IEEE transactions on evolutionary computation 1 (1) (1997) 67–82.

  • [50] T. C. Landgrebe, R. P. Duin, Efficient multiclass roc approximation by decomposition via confusion matrix perturbation analysis, IEEE transactions on pattern analysis and machine intelligence 30 (5) (2008) 810–822.
  • [51] C. Zhang, Y. Ma, Ensemble machine learning: methods and applications, Springer, 2012.