1 Introduction
Static code warnings comment on a range of potential defects such as common programming errors, code styling, inline comments common programming antipatterns, style violations, and questionable coding decisions ayewah2008using. Static code warning tools are quite popular. For example the FindBugs static code analysis tool (shown in Figure 1) has been downloaded over a million times.
One issue with static code warnings is that they generate a large number of false positives. Many programmers routinely ignore most of the static code warnings, finding them irrelevant or spurious wang2018there. Such warnings are considered as “unactionable” since programmers never take action on them. Between 35% and 91% of the warnings from static analysis tools are known to be unactionable heckman2011systematic
. Hence it is prudent to learn to recognize what kinds of warnings programmers usually act upon. With such a classifier, static code warning tools can be made more useful by first pruning away the unactionable warnings.
As shown in this paper, data mining methods can be used to generate very accurate models for this task. This paper searches for 5,675 actionable warnings within a sample 31,058 static code warnings generated by FindBugs on nine opensource Java projects
wang2018there. After the experiment (where we trained on release then tested on release ), we built models (using linear SVM) that predicted for actionable warnings with recalls over 87%; false alarms under 7%; and AUC over 97%. These results are a new high watermark in this area of research since they outperform a prior stateoftheart result (the socalled “golden set” approach reported at ESEM’18 by Wang et al. wang2018there).Apart from making specific conclusions about static code warnings, our research offers another, more general, lesson about how to use data mining for software analytics. Complex tasks, like vision systems in autonomous cars, need complex learning systems like complex learners (e.g. deep learning). But as shown in this paper, simpler problems might be better addressed using much simpler learners such as linear SVM. At least for this task, complex methods like deep learners ran far slower and performed no better than much simpler methods. This was somewhat surprising since, to say the least, there are many advocates of deep learning for software analytics (e.g. lin2018sentiment; guo2017semantically; chen2019mining; gu2016deep; nguyen2017exploring; choetkiertikul2018deep; zhao2018deepsim; white2016deep).
To understand why our problem did not demonstrate the superiority of deep learning, we looked again at our problem. We found that our data has a low “intrinsic dimensionality”. That is to say, while our data sets have up to 58 raw features, those features can be approximated by less than two underlying dimensions (for details on intrinsic dimensionality, and how it might be calculated, see §4.1). We conjecture that for such intrinsically simple data, the sophistication of deep learning is unnecessary.
The rest of this paper is structured as follows. The background to this work is introduced in Section 2. Our methodology is described in Section 3. In Section 3.3 and Section 4, we analyse experiment results. Threats to validity and future work are discussed in Section 5. Our conclusions, drawn in Section 6, will be threefold:

[leftmargin=0.4cm]

It is possible and effective to augment static code warning tools with a postprocessor that prune away the warnings that programmers will ignore.

Before selecting a data mining algorithm, always check the intrinsic dimensionality of the data.

After checking the intrinsic dimensionality, match the complexity of the learner to the complexity of the problem.
To facilitate other researchers in this area, all our scripts are data are freely available online^{1}^{1}1 https://github.com/XueqiYang/intrinsic_dimension..
2 Background
2.1 Studying Static Code Warnings
Static code warning tools detect potential static code defects in source code or executable files at the stage of software product development. The distinguishing feature of these tools is that they make their comments without reference to a particular input. Nor do they use feedback from any execution of the code being studies. Examples of these tools include PMD^{2}^{2}2https://pmd.github.io/latest/index.html and Checkstyle3^{3}^{3}3https://checkstyle.sourceforge.io/and the FindBugs^{4}^{4}4http://findbugs.sourceforge.net tool featured in Figure 1.
As mentioned in the introduction, previous research work shows that 35% to 91 % warnings reported as bugs by static warning analysis tools can be ignored by programmers heckman2011systematic. This high false alarm rate is one of the most significant barriers for developers to use these tools thung2015extent; avgustinov2015tracking; johnson2013don. Various approaches have been tried to reduce these false alarms including graph theory boogerd2008assessing; bhattacharya2012graph, statistical models chen2005novel, and ranking schemes kremenek2004correlation. For example, Allier et al. allier2012framework proposed a framework to compare 6 warning ranking algorithms and identified the best algorithms to rank warnings. Similarly, Shen et al. shen2011efindbugs employed a ranking technique to sort true error reports before anything else. Some other works also prioritize warnings by dividing the results into different categories of impact factors liang2010automatic or by analyzing software history kim2007prioritizing.
Category  Features  

Warning combination 


Code characteristics 


Warning characteristics 


File history 


Code analysis 


Code history 


Warning history 


File characteristics 

Another approach, and the one taken by this paper, utilizes machine learning algorithms to recognizing which static code warnings that programmers will act upon
wang2016automatically; shivaji2009reducing; hanam2014finding. For example, when Heckaman et al. applied 15 learning algorithms to 51 features derived from static analysis tool, they achieved recalls of 8399 % (average across 15 data sets) heckman2009model.2.2 Wang et al.’s “Golden Set”
The data for this paper comes from a recent study by Wang et al. wang2018there. They conducted a systematic literature review to collect all public available static code features generated by widelyused static code warning tools (116 in total):

[leftmargin=0.4cm]

All the values of these collected features were extracted from warning reports generated by FindBugs based on 60 revisions of 12 projects.

Six machine learning classifiers were then employed to automatically identify actionable static warning (random forests, decision trees, a boosting algorithm, naive bayes, linear regression, and support vector machines).

After applying a greedy backward selection algorithm to eliminate noneffective features to the results of those learners, they isolated 23 features as the most useful ones for identifying actionable warnings.

They called these features the “golden set”; i.e. the features most important for recognizing actionable static code warnings.
To the best of our knowledge, this is the most exhaustive research about static warning characteristics yet published.
training set  test set  

Dataset  Features 





commons  39  725  7  786  5  
phoenix  44  2235  18  2389  14  
mvn  47  813  8  818  3  
jmeter  49  604  25  613  24  
cass  55  2584  15  2601  14  
ant  56  1229  19  1115  5  
lucence  57  3259  37  3425  34  
derby  58  2479  9  2507  5  
tomcat  60  1435  28  1441  23 
As shown in Table 1, the “golden set” features fall into eight categories. These features are the independent variables used in this study.
To assign dependent labels, we applied the methods of Liang et al liang2010automatic. They defined a specific warning as actionable if it is closed after the later revision interval. Otherwise, it is labeled as unactionable. Also, after Liang et al., anything labeled a “minor alert” is deleted and ignored.
By analyzing FindBugs output from two consecutive releases of nine software projects, then collecting the features of Table 1, then applying the Liang et al. definitions, we created the data of Table 2. In this table, the “training set” refers to release and the “test set” refers to release . In this study, we only employ two latest releases.
Note that, for any particular data set the 23 categories of Table 1. can grow to more than 23 features. For example, consider the “return type” feature in the “code analysis” category. This can include numerous return types including void, int, URL, boolean, string, printStream, file, date (or a list of any of the these Hence, as shown in Table 2, the number of features in our data varied from 39 to 60.
Note also that one way to summarize the results of this paper is that the golden set is an inaccurate, verbose, description of the attributes required to defect static code attributes. As shown below, hiding within the 23 feature categories of Table 1, there exist two synthetic dimensions, which can be found via a linear SVM.
2.3 Evaluation Metrics
Wang et al. reported their results in terms of AUC and running time:

[leftmargin=0.4cm]

AUC (Area Under the ROC Curve) measures the twodimensional area under the Receiver Operator Characteristic (ROC) curve witten2016data; heckman2011systematic. It provides an aggregate and overall evaluation of performance across all possible classification thresholds to overall report the discrimination of a classifier wang2018there. This is a widely adopted measurement in the area of software engineering, especially for imbalanced data liang2010automatic.

Running time measures the efficiency of the execution of one algorithm. In this paper, we use the running time of one run from the start to the terminal of algorithm execution to compare the efficiency of different models.
Table 3 shows the AUC results achieved by Wang et al. wang2018there. In summary, Wang et al. reported Random Forest as the best learner to identify actionable static warnings.
In the software analytics literature, it is also common to assess learners via recall and false alarms:

[leftmargin=0.4cm]

Recall represents the ability of one algorithm to identify instances of positive class or actionable from the given data set. It denotes the ratio of detected actionable defects in comparison to the total number of actionable defects in the data set generated by static warning tools, like FindBugs.

False Alarms (pf) measures the instances or warnings generated from static warning tools falsely classified by an algorithm as positive or actionable which are actually negative or unactionable ones. This is an important index used to measure the efficiency of a defect prediction model.
In the following, we will report results for all of these four evaluation measures.
Project  Random Forest  Decision Tree  SVM RBF 

derby  43  44  50 
mvn  45  45  50 
lucence  98  98  50 
phoenix  71  70  62 
cass  70  69  67 
jmeter  86  82  50 
tomcat  80  64  50 
ant  44  44  50 
commons  57  56  50 
median  70  64  50 
2.4 Learning to Recognize Actionable Static Code Warnings
Recall from the above that our data has two classes: actionable and nonactionable. Technically speaking, our task is a binary classification problem. A recent survey by Ghotra et al. ghotra2015revisiting found that for software analytics, the performance of dozens of binary classifications clusters into a handful of groups. Hence, by taking one classifier from each group, it is possible for just a few classifiers to act as representatives for a wide range of commonly used classifiers.
Decision trees quinlan1987generating seek splits to feature ranges that most minimize the diversity of classes within each split. Once the best “splitter” is found, decision tree algorithms recurse on each split.
Random forests breiman1999random take the idea of decision trees one step further. Instead of building one tree, random forests build multiple trees (each time using a small random sample of the rows and columns from the original data). The final conclusion is then computed by a majority vote across all trees in the forest.
Support vector machines cortes1995support take another approach. With a kernel function, the data is mapped into a higherdimensional space. Then, using a quadratic programming, the algorithm finds the “support vectors” which are the instances closest to the boundary between to distinguish different classes.
2.5 Deep Learning
Since the Ghortra et al. survey ghotra2015revisiting
was published in 2015, there has been much recent interest in the application of deep learning (DL) neural networks in software engineering. Applications for DL incldue bug localization
huo2019deeplin2018sentiment; guo2017semantically, API mining chen2019mining; gu2016deep; nguyen2017exploring, effort estimation for agile development
choetkiertikul2018deep, code similarity detection zhao2018deepsim, code clone detection white2016deep, etc.Deep neural networks are layers of connected units called neurons. A brief mechanism of fully connected DNN model is shown in Figure
2. For this paper, SE artifacts are transferred into vectors and fed into the neural networks as inputs in the input layer. Each neuron in hidden and output layers functions by multiplying its input with the weight of this neuron. Then the product is summed and then passed through a nonlinear transfer function called activation function to yield a variable. It either continuously serves as input to the next layer or final output of the network
goh1995back.Figure 2 illustrates a layered architecture of neurons where inputs at layer are organized and synthesized as inputs at layer
by nonlinear transformations mentioned above. It’s known as an automatic feature engineering model which efficiently extracts the nonlinear and sophisticated patterns generally observed in the real world, like speech, video, audio. For instance, technologyintensive companies like Google and Facebook are utilizing massive volumes of raw data for commercial data analysis
najafabadi2015deep. Within that layered architecture, only the most important signal from the inputs of layer will make it through to layer . In this way, DL automates “feature engineering” which is the synthesis of important new features using some part or combination of other features. This, in turn, means that predictors can be learned from very complex input signals with multiple features, without requiring manual preprocessing. For example, Lin et al. Lin2017StructuralDD replaced their mostly manual analysis of features extracted from a wavelet package with a deep learner that automatically synthesized significant features.DL trains its networks by running its data repeatedly through networks shown in Figure 2
in multiple “epochs”. Each epoch pushes all the data by batch over the network and the resulting error on the output layer is computed. This repeats until the training error or loss function on the validation set is minimized. Error minimization is done via back propagation (BP). Parameters in DL (including neuron weights), are initialized randomly, and then these parameters of neurons are updated in each epoch of training using error back propagation. Hornik et al.
hornik1991approximation have shown that with sufficient hidden neurons, a single hidden layer backpropagation neural network can accurately approximate any continuous function.DL training may require hundreds to thousands of epochs in complicated problems. However, overtraining makes the model overfitting the training dataset and having poor generalization ability on the test set. Early stopping zhang2016understanding is a commonly used optimizer strategy and regulariser in deep learning, which improves generalization and prevents deep learning from overfitting. It stops training when performance on a validation dataset starts to degrade. We tried to prevent overfitting in our domain via early stopping. The maximum epochs are set as 100, and patience of early stopping as 3, i.e. stopping training DLs if the performance on the training set is not getting better for continuous three epochs. After running our DLs, we could not improve performance after 8 to 30 epochs. Hence, all the results reported below come from 8 to 30 epochs.
3 Experiments
3.1 Learning Schemes
For this study, the nonDL learners came from SciKitLearn pedregosa2011scikit
while the DL methods came from the Keras package
geron2019hands. For the three nonDL learners (Random Forests, Decision Tree, linear Support vector machines), we ran these using their default control settings from SciKitLearn. As to Deep Learning, we ran three DL schemes. As suggested in the literature review li2018deep, (fullyconnected) deep neural network (DNN) and convolutional neural network (CNN) are mostly explored DL models in SE area.
The first scheme is a fully connected deep neural network (DNN). For a description of this method, see Section §2.5. Starting with the defaults from Keras, we configure our DNN model as follows:

[leftmargin=0.4cm]

5 fully connected layers (with 30 neurons for each hidden layer) concatenated by dropout layers in between.

The activation functions for hidden layers were implemented using the Relu
function. Relu represents a rectified linear unit, whose formula is denoted as
. As a universal choice of various activation functions, Relu is known for many merits like fast to compute and converge in practice and its gradients not vanishing when holds or the current neuron is activated li2017convergence. Batch normalization layers are conducted before each activation function to avoid the internal covariate shift (with the distribution changes of parameters in training deep neural networks, the current layer has to constantly readjust to new distributions)
ioffe2015batch. 
As said above, actionable warning identification is a binary problem. That is, for any instance of warnings, its label , where denotes this warning is unactionable and denotes as actionable. Consequently, we use softmax as the activation function for the output of our network in the output layer. Softmax takes the vectors generated from the last hidden layer as inputs and proceeds them by exponentiation operation with a power of
and mapping it into a list of probability distribution of all the label class candidates. For each instance, the list of Softmax vector
generated from softmax function always sums to 1, where is the probability that this bug is unactionable while denoted as actionable.
Our second scheme is CNN (convolutional neural network) goodfellow2016deep, a widely used DL method which employs weight sharing and pooling schemes. Figure 3
illustrates the overview scheme of applying CNN in static warning analysis. Convolutional layers work with a filter of inputs to build a feature map for repeated times, whose principle is looking for correlation between filter and input feature matrix. And max pooling layers reduce spatial size of features by selecting maximum value to represent a feature window. With weight sharing of filters and max pooling, CNNs can greatly reduces the parameters required in training phase.
DNN_weighted is our third DL scheme whose main structure is the same as DNN mentioned above but also use a weighted strategy. Table 2 shows that many of our data sets have unbalanced class distributions where our target class (actionable warnings) is very underrepresented (often less than 20%). To address this data imbalance problem, we reweight the minority class, actionable class. Specifically, we use the reciprocal of the ratio for class 0 and 1 to weight the loss function during the training phase. For instance, the ratio of actionable samples in training set is 0.25, the weighting scheme sets the weight of actionable (minority) as 4, and unactionable (majority) as 1 to balance the significance of training loss for two classes in the training process. Note that we used this reweighting scheme rather than some alternative method (e.g. duplicate instances of minority class) since reusing many copies of one instance in the training set causes extra computational cost shalev2014understanding.
3.2 Statistical Tests
To select “best” learning methods, the advice of Rosenthal et al. rosenthal1994parametric
is taken in this paper. Specifically, given that all our numbers are with 0..1, then experiment results are not prone to extreme outlier effects via statistical tests. Such extreme outliers and indicators for longtail effects which, in turn suggest that it might be better to use nonparametric methods. This is not ideal since nonparametric tests have less statistical power than parametric ones.
Rosenthal et al. discuss different parametric methods for asserting that one result is with some small effect of another (i.e. it is “close to”). They list dozens of effect size tests that divide into two groups: the group that is based on the Pearson correlation coefficient; or the
family that is based on absolute differences normalized by (e.g.) the size of the standard deviation. Rosenthal et al. comment that “none is intrinsically better than the other”. The most direct method is utilized in our paper, using a
family method, it can be concluded that one distribution is the same as another if their mean value differs by less than Cohen’s delta (=30%*standard deviation). Note that is computed separately for each different evaluation measures (recall, false alarm, AUC).To visualize that “close to” analysis, in all our results:

[leftmargin=0.4cm]

Any cell that is within of the best value will be highlighted in gray. All gray cells are observed as “winners” and all the other cells are “losers”.

For recall and AUC, the “best” cells have “highest value” since the optimization goal is to maximize these values.

For false alarm, the “best” cells have “lowest value” since false alarms is to be minimized.
3.3 Results
In the text of Empirical AI, Cohen advises that any method uses a random number generator must be run multiple times, to allow for any effects introduced by the random number seed. For deterministic models, the same output is always produced for the same sequence of given a particular input. To dispel the bias between deterministic and nondeterministic models and eliminate the bias of uncertainty:

[leftmargin=0.4cm]

Ten times, we shuffled the training and test data into some random order.

Each time, divide the test data was divided into five bins, taking care to implement stratified sampling; i.e. ensuring that the class distribution of the whole data is replicated within each bin.

For each 20% test bins, learn a model using 100% of the training set.
Table 4 shows the results of our experiment rig. The gray cells show results that are either (a) the best values or (b) are as good as the best. Counting the winning gray cells and the other white cells, we can see that:

[leftmargin=0.4cm]

Linear SVM are often preferred (lower false alarms, higher recall and AUC).

The tree learners have many white cells; i.e. they perform worse than best.

The deep learners (DNN weighted, CNN, DNN) are often gray– but not as often as SVM linear.
Hence we say that linear SVM has the best allaround performance.
Another reason to prefer SVMs over deep learners is shown in Table 6. This table shows the runtimes of our different learners: deep learners were very much slower than the other learners (at least 20 times faster).
Note that, compared with Table 3, our AUC results shown in Table 4 and Table 5 are much better than Wang et al.’s, which we explain as follows. Firstly, the default parameters in Weka (used by Wang et al.) are different to those used in SciKitLearn (the tool employed in our paper).
Secondly, we use a different SVM to Wang et al. In Table 4, Random Forest performs best in baseline models from the perspective of AUC which is consistent with Wang et al. While SVM result indicates significant difference due to different choices of kernels. (We also conducted an experiment on SVM with RBF kernel and got median AUC as 0.5.)
In summary, we can endorse the use of linear SVM in this domain, but not deep learners or tree learners.
4 Why Such Similar Performance?
A questions raised by the above results is why do different learners perform so similarly on all these data sets. Accordingly, this section explores that issue.
We will argue that the above results illustrates Vandekerckhove et al. Principle of Parsimony. They warn that unnecessary sophisticated models can damage the generalization capability of the classifiers vandekerckhove2015model. This principle is a strategy that warns against overfitting (and is a fundamental principle of model selection). It suggests that simpler models are preferred than complex ones if those models obtain similar performance.
A convincing demonstration that Principle of Parsimony has two parts:

[leftmargin=0.4cm]

We must show some damage to the generalization capability of a complex classifier. For example, in the above, we found that even though deep learner’s automatic feature engineering may account for irrelevant particulars (like noise in the data), they did not perform better than linear SVM.

We must also show that the data set has only very few dimensions; i.e. a complex learner is exploring an inherently simple set of data. In the rest of this section, using an intrinsic dimensionality calculator, we will show that the intrinsic dimensionality of our static warning data sets is never more than two and usually is less.
To say all that another way, since the problem explored in our study is inherently low dimensional, it is hardly surprising that the sophistication of deep learning was not useful in this domain.
4.1 What is “Intrinsic Dimensionality”?
Levina et al. levina2005maximum
comment that the reason any data mining method works for high dimensions is that data embedded in highdimensional format actually can be converted into a more compressed space without major information loss. A traditional way to compute these intrinsic dimensions is PCA (Principal Component Analysis). But Levina et al. caution that, as data in realworld becomes increasingly sophisticated and nonlinearly decomposable, PCA methods tend to overestimate the dimensions of a data set
levina2005maximum.Instead, Levina et al. propose a fractalbased method for calculating intrinsic dimensionality (and that method is now a standard technique in other fields such as astrophysics). The intrinsic dimension of a dataset with N items is found by computing the number of items found at distance within radius r (where r is the distance between two configurations) while varying r. This measures the intrinsic dimensionality since:

[leftmargin=0.4cm]

If the items spread out in only one dimensions, then we will only find linearly more items as increases.

But the items spread out in, say, dimensions, then we will find polynomially more items as increases.
As shown in Equation 1, Levina et al. normalize the number of items found according to the number of items being compared. They recommend reporting the number of intrinsic dimensions as the maximum value of the slope between vs the value computed as follows. Note Equation 1 use the L1norm to calculate distance rather than the Euclidean L2norm. As seen in Table 2, our raw data has up to 60 dimensions. Courtney et al. Aggarwal01
advise that for such high dimensional data, L1 performs better than L2.
(1) 
For example, in Figure 4, the intrinsic dimensionality of blue curve is 1.6 approximated by the maximum slope which is the orange line.
columns with random variables
. Algorithm 1 came close to the actual value of for . Above that point, the algorithm, seems to underestimate the number of columns– an effect we attribute to the “shotgun correlation effect” reported by Courtney et al. courtney93 in 1993. They reported that, due to randomly generated spurious correlations, the correlation between random variables will increase with . Hence it is not surprising that in the (e.g.) plot of this figure, we find less than 40 dimensions.4.2 Intrinsic Dimensionality and Static Code Warnings
Table 7 shows the results of applying our intrinsic dimensionality calculator to the static code warning data. In that table, we observe that:

[leftmargin=0.4cm]

The size of the data set is not associated with intrinsic dimensionality. Evidence: our largest data set (Lucene) has the lowest intrinsic dimensionality.

The intrinsic dimensionality of our data is very low (median value of less than one, never more than two).
This paper is not the first to suggest that several SE data sets are low dimensional in data. Menzies et al. also review a range of strange SE results, all of which indicate that the effective number of dimensions of SE data is very low menzies07. Also, Agrawal et al. agrawal2019dodge argued that dimensionality of the space of performance scores generated from some software effectively divides into just a few dozen regions– which is a claim we could restate as that space is effectively low dimensional. Further, Hindle et al. Hindle:2016 made an analogous argument that:
“Programming languages, in theory, are complex, flexible and powerful, but the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.”
That said, Hindle, Agrawal, and Menzies et al. only show that there can be a benefit is exploring SE data with tools that exploit low dimensionality. None of that work makes the point made in this paper, that for SE data it can be harmful to exploring low dimensional SE data with tools designed for synthesizing models from high dimensional spaces (such a deep learners).
Dataset 





lucence  57  0.15  3259  
phoenix  44  0.62  2235  
tomcat  60  0.73  1435  
derby  58  0.78  2479  
Ant  56  0.82  1229  
commons  39  1.04  725  
mvn  47  1.10  813  
jmeter  49  1.54  604  
cass  55  1.94  2584 
4.3 Summary
After applying Algorithm 1 to our data, we can assert that static code warning is inherently low dimensional problem. Specifically: our datasets can be characterized with less than two dimensions as reported in Table 7. Hence, we believe that the reason deep learning performs so similarly or even worst than conventional learners for static code warnings is that it is a very big hammer being applied to a very small nail.
5 Discussion
5.1 Threats to Validity
Sampling bias. In terms of threats to validity, our first comment is that all our conclusions are based on the data explored in the above experiments. For future work, we need to repeat this analysis using different data sets.
Our second comment is while we depreciate deep learning, that warning only applies to low dimensional data. Deep learning is very useful for very high dimensional problems; e.g. vision systems in autonomous cars.
Measurement bias:
To evaluate the efficiency of our learners, we employ three commonly used measurement metrics in SE area: recall, false alarm, and AUC. There exist many other metrics widely adopted by SE community, like F1 score, G measure and so forth. For the same research question, different conclusions may be drawn by using various evaluation metrics. In future work, we would use other evaluation metrics to have a more comprehensive analysis.
Parameter bias: This paper used the default settings for our learners (exception: we adjusted the number of epochs used in our deep learners). Recent work Tantithamthavorn16; agrawal2018better; agrawal2019dodge
has shown that these defaults can be improved via hyperparameter optimization (i.e., learners applied to learners to learn better settings for the control parameters). In this study, we found that even with the default parameters we could outperform deep learning and prior stateoftheart results
wang2018there. Hence, we leave hyperparameter optimization for future work.Learner bias. One of the most important threats to validity is learner bias, since there is no theoretical reason that any learner outperforms others in all test cases. Wolpert et al. wolpert1997no and Tu et al. Tu18Tuning proposed that no learner necessarily works best than others for all possible optimization problems. Moreover, there also exist many other DNN models developed in deep learning revolution. Different models show significant advantages in different tasks. For instance, LSTM is utilized in Google Translate to translate between more than 100 languages efficiently, while CNN is widely used in tasks of analyzing visual imagery. In this case, researchers may find other deep neural networks works better on SE tasks. For future work, we need to repeat this analysis using different learners.
5.2 Future Work
In future work, it would be interesting to do more comparative studies of SE data using deep learning versus other kinds of learners. Those studies should pay particular attention to the issue raised here; i.e. does DL match the complexity of datasets in other SE areas?
Another interesting avenue for future work is can we exploit the deep learning effect described above to generate a new generation of better learners. In the literature, nonlinear mapping methods that can project complex features into lower dimension space are widely explored in the areas of statistics and computer vision
krizhevsky2012imagenet. Such feature reduction can significantly save computational overhead brought by complex algorithms such as DNN models. Therefore, the implementation of nonlinear feature mapping might dispel the concern of SE researchers and practitioners caused by the overwhelming running cost of deep learning models on big datasets (as well as contribute to the promotion of deep learning in SE area). A comprehensive implementation of nonlinear feature mapping is left to future work.
6 Conclusion
Static code analysis tools produce many false positives which many programmers ignore. Such tools can be augmented with data mining algorithms to prune away the spurious reports, leaving behind just the warnings that cause programmers to take action to change their code. As seen by the above results, such data miners can be remarkably effective (and exhibit very low false alarm rates, very high AUC results, and respectably high recall results).
In this paper, we perform an empirical experiment to apply tree learners, linear SVM, and deep learning (with early stopping) to predicting actionable static warning analysis tasks on nine software projects. We find deep learners mismatch the complexity of our static warning datasets with high running cost. Using a dimension reduction algorithm, our static warning datasets are reported as inherently low dimensional. As suggested by Principle of Parsimony, it is detrimental to employ sophisticated models (like deep learning) on data that is inherently low dimensional (like the data explored here). Hence, we endorse the use of linear SVM for predicting which static code warnings are actionable.
For future work in software analytics, we strongly suggest that analysts match the complexity of their analysis tools to the underlying complexity of their research problem.
7 Acknowledgment
This work was partially funded by an NSF award #1703487.
Comments
There are no comments yet.