How to Recognize Actionable Static Code Warnings (Using Linear SVMs)

05/31/2020 ∙ by Xueqi Yang, et al. ∙ NC State University IEEE 0

Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. But what is the best way to build those selection algorithms? To answer that question, we learn predictors for 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. Several data mining methods perform very well on this task. For example, linear Support Vector Machine achieved median recalls of 96 over 99 similar results (usually, within 4 On investigation, we found the reason for all these learners performing very well: the data was intrinsically very simple. Specifically, while our data sets have up to 58 raw features, those features can be approximated by less than two underlying dimensions. For such intrinsically simple data, many different kinds of learners can generate useful models with similar performance. Based on the above, we conclude that it is both simple and effective to use data mining algorithms for selecting "actionable" warnings from static code analysis tools. Also, we recommend using linear SVMs to implement that selecting process (since, at least in our sample, that learner ran relatively quickly and achieved the best all-around performance). Further, for any analytics task, it important to match the complexity of the inference to the complexity of the data. For example, we would not recommend deep learning for finding actionable static code warnings since our data is intrinsically very simple.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Static code analysis and FindBugs. From http://findbugs.sourceforge.net/.

Static code warnings comment on a range of potential defects such as common programming errors, code styling, in-line comments common programming anti-patterns, style violations, and questionable coding decisions ayewah2008using. Static code warning tools are quite popular. For example the FindBugs static code analysis tool (shown in Figure 1) has been downloaded over a million times.

One issue with static code warnings is that they generate a large number of false positives. Many programmers routinely ignore most of the static code warnings, finding them irrelevant or spurious wang2018there. Such warnings are considered as “unactionable” since programmers never take action on them. Between 35% and 91% of the warnings from static analysis tools are known to be unactionable heckman2011systematic

. Hence it is prudent to learn to recognize what kinds of warnings programmers usually act upon. With such a classifier, static code warning tools can be made more useful by first pruning away the unactionable warnings.

As shown in this paper, data mining methods can be used to generate very accurate models for this task. This paper searches for 5,675 actionable warnings within a sample 31,058 static code warnings generated by FindBugs on nine open-source Java projects 

wang2018there. After the experiment (where we trained on release then tested on release ), we built models (using linear SVM) that predicted for actionable warnings with recalls over 87%; false alarms under 7%; and AUC over 97%. These results are a new high watermark in this area of research since they outperform a prior state-of-the-art result (the so-called “golden set” approach reported at ESEM’18 by Wang et al. wang2018there).

Apart from making specific conclusions about static code warnings, our research offers another, more general, lesson about how to use data mining for software analytics. Complex tasks, like vision systems in autonomous cars, need complex learning systems like complex learners (e.g. deep learning). But as shown in this paper, simpler problems might be better addressed using much simpler learners such as linear SVM. At least for this task, complex methods like deep learners ran far slower and performed no better than much simpler methods. This was somewhat surprising since, to say the least, there are many advocates of deep learning for software analytics (e.g. lin2018sentiment; guo2017semantically; chen2019mining; gu2016deep; nguyen2017exploring; choetkiertikul2018deep; zhao2018deepsim; white2016deep).

To understand why our problem did not demonstrate the superiority of deep learning, we looked again at our problem. We found that our data has a low “intrinsic dimensionality”. That is to say, while our data sets have up to 58 raw features, those features can be approximated by less than two underlying dimensions (for details on intrinsic dimensionality, and how it might be calculated, see §4.1). We conjecture that for such intrinsically simple data, the sophistication of deep learning is unnecessary.

The rest of this paper is structured as follows. The background to this work is introduced in Section 2. Our methodology is described in Section 3. In Section 3.3 and Section 4, we analyse experiment results. Threats to validity and future work are discussed in Section 5. Our conclusions, drawn in Section 6, will be three-fold:

  1. [leftmargin=0.4cm]

  2. It is possible and effective to augment static code warning tools with a post-processor that prune away the warnings that programmers will ignore.

  3. Before selecting a data mining algorithm, always check the intrinsic dimensionality of the data.

  4. After checking the intrinsic dimensionality, match the complexity of the learner to the complexity of the problem.

To facilitate other researchers in this area, all our scripts are data are freely available on-line111 https://github.com/XueqiYang/intrinsic_dimension..

2 Background

2.1 Studying Static Code Warnings

Static code warning tools detect potential static code defects in source code or executable files at the stage of software product development. The distinguishing feature of these tools is that they make their comments without reference to a particular input. Nor do they use feedback from any execution of the code being studies. Examples of these tools include PMD222https://pmd.github.io/latest/index.html and Checkstyle3333https://checkstyle.sourceforge.io/and the FindBugs444http://findbugs.sourceforge.net tool featured in Figure 1.

As mentioned in the introduction, previous research work shows that 35% to 91 % warnings reported as bugs by static warning analysis tools can be ignored by programmers heckman2011systematic. This high false alarm rate is one of the most significant barriers for developers to use these tools thung2015extent; avgustinov2015tracking; johnson2013don. Various approaches have been tried to reduce these false alarms including graph theory boogerd2008assessing; bhattacharya2012graph, statistical models chen2005novel, and ranking schemes kremenek2004correlation. For example, Allier et al. allier2012framework proposed a framework to compare 6 warning ranking algorithms and identified the best algorithms to rank warnings. Similarly, Shen et al. shen2011efindbugs employed a ranking technique to sort true error reports before anything else. Some other works also prioritize warnings by dividing the results into different categories of impact factors liang2010automatic or by analyzing software history kim2007prioritizing.

Category Features
Warning combination
size content for warning type;
size context in method, file, package;
warning context in method, file, package;
warning context for warning type;
fix, non-fix change removal rate;
defect likelihood for warning pattern;
variance of likelihood;
defect likelihood for warning type;
discretization of defect likelihood;
average lifetime for warning type;
Code characteristics
method, file, package size;
comment length;
comment-code ratio;
method, file depth;
method callers, callees;
methods in file, package;
classes in file, package;
indentation;
complexity;
Warning characteristics
warning pattern, type, priority, rank;
warnings in method, file, package;
File history
latest file, package modification;
file, package staleness;
file age; file creation;
deletion revision; developers;
Code analysis
call name, class, parameter signature,
return type;
new type, new concrete type;
operator;
field access class, field;
catch;
field name, type, visibility, is static/final;
method visibility, return type,
is static/ final/ abstract/ protected;
class visibility,
is abstract / interfact / array class;
Code history
added, changed, deleted, growth, total, percentage
of LOC in file in the past 3 months;
added, changed, deleted, growth, total, percentage
of LOC in file in the last 25 revisions;
added, changed, deleted, growth, total, percentage
of LOC in package in the past 3 months;
added, changed, deleted, growth, total, percentage
of LOC in package in the last 25 revisions;
Warning history
warning modifications;
warning open revision;
warning lifetime by revision, by time;
File characteristics
file type;
file name;
package name;
Table 1: Categories of Wang et al. wang2018there’s selected features. (8 categories are shown in the left column, and 95 features explored in Wang et al. are shown in the right column with 23 golden features in bold.)

Another approach, and the one taken by this paper, utilizes machine learning algorithms to recognizing which static code warnings that programmers will act upon 

wang2016automatically; shivaji2009reducing; hanam2014finding. For example, when Heckaman et al. applied 15 learning algorithms to 51 features derived from static analysis tool, they achieved recalls of 83-99 % (average across 15 data sets) heckman2009model.

2.2 Wang et al.’s “Golden Set”

The data for this paper comes from a recent study by Wang et al. wang2018there. They conducted a systematic literature review to collect all public available static code features generated by widely-used static code warning tools (116 in total):

  • [leftmargin=0.4cm]

  • All the values of these collected features were extracted from warning reports generated by FindBugs based on 60 revisions of 12 projects.

  • Six machine learning classifiers were then employed to automatically identify actionable static warning (random forests, decision trees, a boosting algorithm, naive bayes, linear regression, and support vector machines).

  • After applying a greedy backward selection algorithm to eliminate noneffective features to the results of those learners, they isolated 23 features as the most useful ones for identifying actionable warnings.

  • They called these features the “golden set”; i.e. the features most important for recognizing actionable static code warnings.

To the best of our knowledge, this is the most exhaustive research about static warning characteristics yet published.

training set test set
Dataset Features
instance
counts
actionable
ratio(%)
instance
counts
actionable
ratio(%)
commons 39 725 7 786 5
phoenix 44 2235 18 2389 14
mvn 47 813 8 818 3
jmeter 49 604 25 613 24
cass 55 2584 15 2601 14
ant 56 1229 19 1115 5
lucence 57 3259 37 3425 34
derby 58 2479 9 2507 5
tomcat 60 1435 28 1441 23
Table 2: Summary of data distribution.

As shown in Table 1, the “golden set” features fall into eight categories. These features are the independent variables used in this study.

To assign dependent labels, we applied the methods of Liang et al liang2010automatic. They defined a specific warning as actionable if it is closed after the later revision interval. Otherwise, it is labeled as unactionable. Also, after Liang et al., anything labeled a “minor alert” is deleted and ignored.

By analyzing FindBugs output from two consecutive releases of nine software projects, then collecting the features of Table 1, then applying the Liang et al. definitions, we created the data of Table 2. In this table, the “training set” refers to release and the “test set” refers to release . In this study, we only employ two latest releases.

Note that, for any particular data set the 23 categories of Table 1. can grow to more than 23 features. For example, consider the “return type” feature in the “code analysis” category. This can include numerous return types including void, int, URL, boolean, string, printStream, file, date (or a list of any of the these Hence, as shown in Table 2, the number of features in our data varied from 39 to 60.

Note also that one way to summarize the results of this paper is that the golden set is an inaccurate, verbose, description of the attributes required to defect static code attributes. As shown below, hiding within the 23 feature categories of Table 1, there exist two synthetic dimensions, which can be found via a linear SVM.

2.3 Evaluation Metrics

Wang et al. reported their results in terms of AUC and running time:

  • [leftmargin=0.4cm]

  • AUC (Area Under the ROC Curve) measures the two-dimensional area under the Receiver Operator Characteristic (ROC) curve witten2016data; heckman2011systematic. It provides an aggregate and overall evaluation of performance across all possible classification thresholds to overall report the discrimination of a classifier wang2018there. This is a widely adopted measurement in the area of software engineering, especially for imbalanced data liang2010automatic.

  • Running time measures the efficiency of the execution of one algorithm. In this paper, we use the running time of one run from the start to the terminal of algorithm execution to compare the efficiency of different models.

Table 3 shows the AUC results achieved by Wang et al. wang2018there. In summary, Wang et al. reported Random Forest as the best learner to identify actionable static warnings.

In the software analytics literature, it is also common to assess learners via recall and false alarms:

  • [leftmargin=0.4cm]

  • Recall represents the ability of one algorithm to identify instances of positive class or actionable from the given data set. It denotes the ratio of detected actionable defects in comparison to the total number of actionable defects in the data set generated by static warning tools, like FindBugs.

  • False Alarms (pf) measures the instances or warnings generated from static warning tools falsely classified by an algorithm as positive or actionable which are actually negative or unactionable ones. This is an important index used to measure the efficiency of a defect prediction model.

In the following, we will report results for all of these four evaluation measures.

Project Random Forest Decision Tree SVM RBF
derby 43 44 50
mvn 45 45 50
lucence 98 98 50
phoenix 71 70 62
cass 70 69 67
jmeter 86 82 50
tomcat 80 64 50
ant 44 44 50
commons 57 56 50
median 70 64 50
Table 3: %AUC results reported in prior state-of-the-art wang2018there using proposed golden feature set.

2.4 Learning to Recognize Actionable Static Code Warnings

Recall from the above that our data has two classes: actionable and non-actionable. Technically speaking, our task is a binary classification problem. A recent survey by Ghotra et al. ghotra2015revisiting found that for software analytics, the performance of dozens of binary classifications clusters into a handful of groups. Hence, by taking one classifier from each group, it is possible for just a few classifiers to act as representatives for a wide range of commonly used classifiers.

Decision trees quinlan1987generating seek splits to feature ranges that most minimize the diversity of classes within each split. Once the best “splitter” is found, decision tree algorithms recurse on each split.

Random forests breiman1999random take the idea of decision trees one step further. Instead of building one tree, random forests build multiple trees (each time using a small random sample of the rows and columns from the original data). The final conclusion is then computed by a majority vote across all trees in the forest.

Support vector machines cortes1995support take another approach. With a kernel function, the data is mapped into a higher-dimensional space. Then, using a quadratic programming, the algorithm finds the “support vectors” which are the instances closest to the boundary between to distinguish different classes.

2.5 Deep Learning

Since the Ghortra et al. survey ghotra2015revisiting

was published in 2015, there has been much recent interest in the application of deep learning (DL) neural networks in software engineering. Applications for DL incldue bug localization 

huo2019deep

, sentiment analysis 

lin2018sentiment; guo2017semantically, API mining chen2019mining; gu2016deep; nguyen2017exploring

, effort estimation for agile development 

choetkiertikul2018deep, code similarity detection zhao2018deepsim, code clone detection white2016deep, etc.

Deep neural networks are layers of connected units called neurons. A brief mechanism of fully connected DNN model is shown in Figure 

2

. For this paper, SE artifacts are transferred into vectors and fed into the neural networks as inputs in the input layer. Each neuron in hidden and output layers functions by multiplying its input with the weight of this neuron. Then the product is summed and then passed through a nonlinear transfer function called activation function to yield a variable. It either continuously serves as input to the next layer or final output of the network 

goh1995back.

Figure 2 illustrates a layered architecture of neurons where inputs at layer are organized and synthesized as inputs at layer

by non-linear transformations mentioned above. It’s known as an automatic feature engineering model which efficiently extracts the non-linear and sophisticated patterns generally observed in the real world, like speech, video, audio. For instance, technology-intensive companies like Google and Facebook are utilizing massive volumes of raw data for commercial data analysis  

najafabadi2015deep. Within that layered architecture, only the most important signal from the inputs of layer will make it through to layer . In this way, DL automates “feature engineering” which is the synthesis of important new features using some part or combination of other features. This, in turn, means that predictors can be learned from very complex input signals with multiple features, without requiring manual pre-processing. For example, Lin et al. Lin2017StructuralDD replaced their mostly manual analysis of features extracted from a wavelet package with a deep learner that automatically synthesized significant features.

DL trains its networks by running its data repeatedly through networks shown in Figure 2

in multiple “epochs”. Each epoch pushes all the data by batch over the network and the resulting error on the output layer is computed. This repeats until the training error or loss function on the validation set is minimized. Error minimization is done via back propagation (BP). Parameters in DL (including neuron weights), are initialized randomly, and then these parameters of neurons are updated in each epoch of training using error back propagation. Hornik et al. 

hornik1991approximation have shown that with sufficient hidden neurons, a single hidden layer back-propagation neural network can accurately approximate any continuous function.

Figure 2: Illustration of DNN Model.

DL training may require hundreds to thousands of epochs in complicated problems. However, overtraining makes the model overfitting the training dataset and having poor generalization ability on the test set. Early stopping zhang2016understanding is a commonly used optimizer strategy and regulariser in deep learning, which improves generalization and prevents deep learning from overfitting. It stops training when performance on a validation dataset starts to degrade. We tried to prevent overfitting in our domain via early stopping. The maximum epochs are set as 100, and patience of early stopping as 3, i.e. stopping training DLs if the performance on the training set is not getting better for continuous three epochs. After running our DLs, we could not improve performance after 8 to 30 epochs. Hence, all the results reported below come from 8 to 30 epochs.

3 Experiments

3.1 Learning Schemes

For this study, the non-DL learners came from SciKit-Learn pedregosa2011scikit

while the DL methods came from the Keras package 

geron2019hands. For the three non-DL learners (Random Forests, Decision Tree, linear Support vector machines), we ran these using their default control settings from SciKit-Learn. As to Deep Learning, we ran three DL schemes. As suggested in the literature review li2018deep

, (fully-connected) deep neural network (DNN) and convolutional neural network (CNN) are mostly explored DL models in SE area.

The first scheme is a fully connected deep neural network (DNN). For a description of this method, see Section §2.5. Starting with the defaults from Keras, we configure our DNN model as follows:

  • [leftmargin=0.4cm]

  • 5 fully connected layers (with 30 neurons for each hidden layer) concatenated by dropout layers in between.

  • The activation functions for hidden layers were implemented using the Relu

    function. Relu represents a rectified linear unit, whose formula is denoted as

    . As a universal choice of various activation functions, Relu is known for many merits like fast to compute and converge in practice and its gradients not vanishing when holds or the current neuron is activated li2017convergence

    . Batch normalization layers are conducted before each activation function to avoid the internal covariate shift (with the distribution changes of parameters in training deep neural networks, the current layer has to constantly readjust to new distributions) 

    ioffe2015batch.

  • As said above, actionable warning identification is a binary problem. That is, for any instance of warnings, its label , where denotes this warning is unactionable and denotes as actionable. Consequently, we use softmax as the activation function for the output of our network in the output layer. Softmax takes the vectors generated from the last hidden layer as inputs and proceeds them by exponentiation operation with a power of

    and mapping it into a list of probability distribution of all the label class candidates. For each instance, the list of Softmax vector

    generated from softmax function always sums to 1, where is the probability that this bug is unactionable while denoted as actionable.

Figure 3: Overview of CNN Model in Static Warning Identification.

Our second scheme is CNN (convolutional neural network) goodfellow2016deep, a widely used DL method which employs weight sharing and pooling schemes. Figure 3

illustrates the overview scheme of applying CNN in static warning analysis. Convolutional layers work with a filter of inputs to build a feature map for repeated times, whose principle is looking for correlation between filter and input feature matrix. And max pooling layers reduce spatial size of features by selecting maximum value to represent a feature window. With weight sharing of filters and max pooling, CNNs can greatly reduces the parameters required in training phase.

DNN_weighted is our third DL scheme whose main structure is the same as DNN mentioned above but also use a weighted strategy. Table 2 shows that many of our data sets have unbalanced class distributions where our target class (actionable warnings) is very under-represented (often less than 20%). To address this data imbalance problem, we re-weight the minority class, actionable class. Specifically, we use the reciprocal of the ratio for class 0 and 1 to weight the loss function during the training phase. For instance, the ratio of actionable samples in training set is 0.25, the weighting scheme sets the weight of actionable (minority) as 4, and unactionable (majority) as 1 to balance the significance of training loss for two classes in the training process. Note that we used this reweighting scheme rather than some alternative method (e.g. duplicate instances of minority class) since reusing many copies of one instance in the training set causes extra computational cost shalev2014understanding.

3.2 Statistical Tests

To select “best” learning methods, the advice of Rosenthal et al. rosenthal1994parametric

is taken in this paper. Specifically, given that all our numbers are with 0..1, then experiment results are not prone to extreme outlier effects via statistical tests. Such extreme outliers and indicators for long-tail effects which, in turn suggest that it might be better to use non-parametric methods. This is not ideal since non-parametric tests have less statistical power than parametric ones.

Rosenthal et al. discuss different parametric methods for asserting that one result is with some small effect of another (i.e. it is “close to”). They list dozens of effect size tests that divide into two groups: the group that is based on the Pearson correlation coefficient; or the

family that is based on absolute differences normalized by (e.g.) the size of the standard deviation. Rosenthal et al. comment that “none is intrinsically better than the other”. The most direct method is utilized in our paper, using a

family method, it can be concluded that one distribution is the same as another if their mean value differs by less than Cohen’s delta (=30%*standard deviation). Note that is computed separately for each different evaluation measures (recall, false alarm, AUC).

To visualize that “close to” analysis, in all our results:

  • [leftmargin=0.4cm]

  • Any cell that is within of the best value will be highlighted in gray. All gray cells are observed as “winners” and all the other cells are “losers”.

  • For recall and AUC, the “best” cells have “highest value” since the optimization goal is to maximize these values.

  • For false alarm, the “best” cells have “lowest value” since false alarms is to be minimized.

3.3 Results

In the text of Empirical AI, Cohen advises that any method uses a random number generator must be run multiple times, to allow for any effects introduced by the random number seed. For deterministic models, the same output is always produced for the same sequence of given a particular input. To dispel the bias between deterministic and non-deterministic models and eliminate the bias of uncertainty:

  • [leftmargin=0.4cm]

  • Ten times, we shuffled the training and test data into some random order.

  • Each time, divide the test data was divided into five bins, taking care to implement stratified sampling; i.e. ensuring that the class distribution of the whole data is replicated within each bin.

  • For each 20% test bins, learn a model using 100% of the training set.

Table 4: Summary results of recall, false alarm and AUC on nine datasets. Cells in gray denote the “best” results for each row, where “best” means within difference to the best value (and is calculated as per §3.2.)
Table 5: Comparing median results and IQR of recall, false alarm and AUC. Cells in gray denote the “best” median results for each row, where “best” means within difference to the best value in each row (and is calculated as per §3.2.)
Table 6: Comparing results of running time sorted by size of datasets in a descending order on nine projects from six learners.

Table 4 shows the results of our experiment rig. The gray cells show results that are either (a) the best values or (b) are as good as the best. Counting the winning gray cells and the other white cells, we can see that:

  • [leftmargin=0.4cm]

  • Linear SVM are often preferred (lower false alarms, higher recall and AUC).

  • The tree learners have many white cells; i.e. they perform worse than best.

  • The deep learners (DNN weighted, CNN, DNN) are often gray– but not as often as SVM linear.

Hence we say that linear SVM has the best all-around performance.

Another reason to prefer SVMs over deep learners is shown in Table 6. This table shows the runtimes of our different learners: deep learners were very much slower than the other learners (at least 20 times faster).

Note that, compared with Table 3, our AUC results shown in Table 4 and Table 5 are much better than Wang et al.’s, which we explain as follows. Firstly, the default parameters in Weka (used by Wang et al.) are different to those used in SciKit-Learn  (the tool employed in our paper).

Secondly, we use a different SVM to Wang et al. In Table 4, Random Forest performs best in baseline models from the perspective of AUC which is consistent with Wang et al. While SVM result indicates significant difference due to different choices of kernels. (We also conducted an experiment on SVM with RBF kernel and got median AUC as 0.5.)

In summary, we can endorse the use of linear SVM in this domain, but not deep learners or tree learners.

4 Why Such Similar Performance?

A questions raised by the above results is why do different learners perform so similarly on all these data sets. Accordingly, this section explores that issue.

We will argue that the above results illustrates Vandekerckhove et al. Principle of Parsimony. They warn that unnecessary sophisticated models can damage the generalization capability of the classifiers vandekerckhove2015model. This principle is a strategy that warns against overfitting (and is a fundamental principle of model selection). It suggests that simpler models are preferred than complex ones if those models obtain similar performance.

A convincing demonstration that Principle of Parsimony has two parts:

  1. [leftmargin=0.4cm]

  2. We must show some damage to the generalization capability of a complex classifier. For example, in the above, we found that even though deep learner’s automatic feature engineering may account for irrelevant particulars (like noise in the data), they did not perform better than linear SVM.

  3. We must also show that the data set has only very few dimensions; i.e. a complex learner is exploring an inherently simple set of data. In the rest of this section, using an intrinsic dimensionality calculator, we will show that the intrinsic dimensionality of our static warning data sets is never more than two and usually is less.

To say all that another way, since the problem explored in our study is inherently low dimensional, it is hardly surprising that the sophistication of deep learning was not useful in this domain.

4.1 What is “Intrinsic Dimensionality”?

Levina et al. levina2005maximum

comment that the reason any data mining method works for high dimensions is that data embedded in high-dimensional format actually can be converted into a more compressed space without major information loss. A traditional way to compute these intrinsic dimensions is PCA (Principal Component Analysis). But Levina et al. caution that, as data in real-world becomes increasingly sophisticated and non-linearly decomposable, PCA methods tend to overestimate the dimensions of a data set 

levina2005maximum.

Instead, Levina et al. propose a fractal-based method for calculating intrinsic dimensionality (and that method is now a standard technique in other fields such as astrophysics). The intrinsic dimension of a dataset with N items is found by computing the number of items found at distance within radius r (where r is the distance between two configurations) while varying r. This measures the intrinsic dimensionality since:

  • [leftmargin=0.4cm]

  • If the items spread out in only one dimensions, then we will only find linearly more items as increases.

  • But the items spread out in, say, dimensions, then we will find polynomially more items as increases.

As shown in Equation 1, Levina et al. normalize the number of items found according to the number of items being compared. They recommend reporting the number of intrinsic dimensions as the maximum value of the slope between vs the value computed as follows. Note Equation 1 use the L1-norm to calculate distance rather than the Euclidean L2-norm. As seen in Table 2, our raw data has up to 60 dimensions. Courtney et al. Aggarwal01

advise that for such high dimensional data, L1 performs better than L2.

Figure 4: Intrinsic dimensionality is the maximum slope of the smoothed blue curve of vs (see the orange line).
(1)

For example, in Figure 4, the intrinsic dimensionality of blue curve is 1.6 approximated by the maximum slope which is the orange line.

  Import data from Testdata.py   Input:         for  in  do       # Calculated by L1 Distance              # count for pairwise samples within R       for  in  do                      # L1 distance           if  then                          end if       end for          end for      for  in  do                 end for      # smooth the curve      # Estimate the intrinsic dimensionality
Algorithm 1 Intrinsic Dimension by Box-counting Method
(a) d=5, s=1000
(b) d=10, s=1000
(c) d=20, s=1000
(d) d=40, s=1000
Figure 5: Algorithm 1 works well for up to 20 intrinsic dimensions. To show that, we randomly filled 1000 rows of tables of data with

columns with random variables

. Algorithm 1 came close to the actual value of for . Above that point, the algorithm, seems to underestimate the number of columns– an effect we attribute to the “shotgun correlation effect” reported by Courtney et al. courtney93 in 1993. They reported that, due to randomly generated spurious correlations, the correlation between random variables will increase with . Hence it is not surprising that in the (e.g.) plot of this figure, we find less than 40 dimensions.

Algorithm 1 shows the intrinsic dimensionality calculator used in this paper. Note that this calculator uses Equation 1 with an L1-norm. Figure 5 displays a verification study which shows that this algorithm works well for up to 20 intrinsic dimensions.

4.2 Intrinsic Dimensionality and Static Code Warnings

Table 7 shows the results of applying our intrinsic dimensionality calculator to the static code warning data. In that table, we observe that:

  • [leftmargin=0.4cm]

  • The size of the data set is not associated with intrinsic dimensionality. Evidence: our largest data set (Lucene) has the lowest intrinsic dimensionality.

  • The intrinsic dimensionality of our data is very low (median value of less than one, never more than two).

This paper is not the first to suggest that several SE data sets are low dimensional in data. Menzies et al. also review a range of strange SE results, all of which indicate that the effective number of dimensions of SE data is very low menzies07. Also, Agrawal et al. agrawal2019dodge argued that dimensionality of the space of performance scores generated from some software effectively divides into just a few dozen regions– which is a claim we could restate as that space is effectively low dimensional. Further, Hindle et al. Hindle:2016 made an analogous argument that:

“Programming languages, in theory, are complex, flexible and powerful, but the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.”

That said, Hindle, Agrawal, and Menzies et al. only show that there can be a benefit is exploring SE data with tools that exploit low dimensionality. None of that work makes the point made in this paper, that for SE data it can be harmful to exploring low dimensional SE data with tools designed for synthesizing models from high dimensional spaces (such a deep learners).

Dataset
original
dimensionality
intrinsic
dimensionality
instance
counts
lucence 57 0.15 3259
phoenix 44 0.62 2235
tomcat 60 0.73 1435
derby 58 0.78 2479
Ant 56 0.82 1229
commons 39 1.04 725
mvn 47 1.10 813
jmeter 49 1.54 604
cass 55 1.94 2584
Table 7: Summary of dimensionality of nine datasets. Calculated using Equation 1.

4.3 Summary

After applying Algorithm 1 to our data, we can assert that static code warning is inherently low dimensional problem. Specifically: our datasets can be characterized with less than two dimensions as reported in Table 7. Hence, we believe that the reason deep learning performs so similarly or even worst than conventional learners for static code warnings is that it is a very big hammer being applied to a very small nail.

5 Discussion

5.1 Threats to Validity

Sampling bias. In terms of threats to validity, our first comment is that all our conclusions are based on the data explored in the above experiments. For future work, we need to repeat this analysis using different data sets.

Our second comment is while we depreciate deep learning, that warning only applies to low dimensional data. Deep learning is very useful for very high dimensional problems; e.g. vision systems in autonomous cars.

Measurement bias:

To evaluate the efficiency of our learners, we employ three commonly used measurement metrics in SE area: recall, false alarm, and AUC. There exist many other metrics widely adopted by SE community, like F1 score, G measure and so forth. For the same research question, different conclusions may be drawn by using various evaluation metrics. In future work, we would use other evaluation metrics to have a more comprehensive analysis.

Parameter bias: This paper used the default settings for our learners (exception: we adjusted the number of epochs used in our deep learners). Recent work Tantithamthavorn16; agrawal2018better; agrawal2019dodge

has shown that these defaults can be improved via hyperparameter optimization (i.e., learners applied to learners to learn better settings for the control parameters). In this study, we found that even with the default parameters we could outperform deep learning and prior state-of-the-art results 

wang2018there. Hence, we leave hyperparameter optimization for future work.

Learner bias. One of the most important threats to validity is learner bias, since there is no theoretical reason that any learner outperforms others in all test cases. Wolpert et al. wolpert1997no and Tu et al. Tu18Tuning proposed that no learner necessarily works best than others for all possible optimization problems. Moreover, there also exist many other DNN models developed in deep learning revolution. Different models show significant advantages in different tasks. For instance, LSTM is utilized in Google Translate to translate between more than 100 languages efficiently, while CNN is widely used in tasks of analyzing visual imagery. In this case, researchers may find other deep neural networks works better on SE tasks. For future work, we need to repeat this analysis using different learners.

5.2 Future Work

In future work, it would be interesting to do more comparative studies of SE data using deep learning versus other kinds of learners. Those studies should pay particular attention to the issue raised here; i.e. does DL match the complexity of datasets in other SE areas?

Another interesting avenue for future work is can we exploit the deep learning effect described above to generate a new generation of better learners. In the literature, non-linear mapping methods that can project complex features into lower dimension space are widely explored in the areas of statistics and computer vision 

krizhevsky2012imagenet

. Such feature reduction can significantly save computational overhead brought by complex algorithms such as DNN models. Therefore, the implementation of non-linear feature mapping might dispel the concern of SE researchers and practitioners caused by the overwhelming running cost of deep learning models on big datasets (as well as contribute to the promotion of deep learning in SE area). A comprehensive implementation of non-linear feature mapping is left to future work.

6 Conclusion

Static code analysis tools produce many false positives which many programmers ignore. Such tools can be augmented with data mining algorithms to prune away the spurious reports, leaving behind just the warnings that cause programmers to take action to change their code. As seen by the above results, such data miners can be remarkably effective (and exhibit very low false alarm rates, very high AUC results, and respectably high recall results).

In this paper, we perform an empirical experiment to apply tree learners, linear SVM, and deep learning (with early stopping) to predicting actionable static warning analysis tasks on nine software projects. We find deep learners mismatch the complexity of our static warning datasets with high running cost. Using a dimension reduction algorithm, our static warning datasets are reported as inherently low dimensional. As suggested by Principle of Parsimony, it is detrimental to employ sophisticated models (like deep learning) on data that is inherently low dimensional (like the data explored here). Hence, we endorse the use of linear SVM for predicting which static code warnings are actionable.

For future work in software analytics, we strongly suggest that analysts match the complexity of their analysis tools to the underlying complexity of their research problem.

7 Acknowledgment

This work was partially funded by an NSF award #1703487.

References