How to "DODGE" Complex Software Analytics?

02/05/2019 ∙ by Amritanshu Agrawal, et al. ∙ NC State University IEEE 8

AI software is still software. Software engineers need better tools to make better use of AI software. For example, for software defect prediction and software text mining, the default tunings for software analytics tools can be improved with "hyperparameter optimization" tools that decide (e.g.,) how many trees are needed in a random forest. Hyperparameter optimization is unnecessarily slow when optimizers waste time exploring redundant options (i.e., pairs of tunings with indistinguishably different results). By ignoring redundant tunings, the Dodge(E) hyperparameter optimization tool can run orders of magnitude faster, yet still find better tunings than prior state-of-the-art algorithms (for software defect prediction and software text mining).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Software analytics is becoming increasingly complicated. Fisher et al. [13] defines software analytics as a work flow that distills large quantities of low-value data into smaller sets of higher value data. Due to the complexities and computational cost of SE analytics, they say “the luxuries of interactivity, direct manipulation, and fast system response are gone”. In fact, they characterize modern cloud-based analytics as a throwback to the 1960s- batch processing mainframes where jobs are submitted and then analysts wait, wait, wait for results with “little insight into what’s really going on behind the scenes, how long it will take, or how much it’s going to cost”. Fisher et al. document issues seen by industrial data scientists, one who says:

“Fast iteration is key, but incompatible with jobs … in the cloud. It’s frustrating to wait for hours, only to realize you need a slight tweak…”

One impediment to fast iterations are hyperparameter optimizers that automatically tune control options for data mining. Table I

lists some tuning options for data pre-processing and machine learning for two well-studied SE tasks:

  • Software defect prediction

    (classifying software modules into “buggy” or otherwise 

    [2, 8, 15, 47, 30, 17]);

  • Software bug report text mining (to find severity [2, 36]).

If numeric options divide into 10 sub-ranges, then Table I has over a billion options. With enough CPU, automatic hyperparameter optimizers can prune those options to find tunings that improve the performance of software quality predictors [2, 15, 25, 45, 47, 58, 48, 36]. For example, Figure 1 shows an example where tuning converts some very bad learners into outstandingly good ones.

The problem with hyperparameter optimization is that tuning requires the evaluation of hundreds to millions of different tuning options. The cost of running a data miner through all those options is very high, requiring days to weeks to decades of CPU [49, 47, 52, 48, 54]. For many years, we have addressed these long CPU times via cloud-based CPU farms. Fisher et al. [13] warn that cloud computation is a heavily monetized environment that charges for all their services (storage, uploads, downloads, and CPU time). While each small part of that service is cheap, the total annual cost to an organization can be exorbitant.

DATA PRE-PROCESSING
Software defect prediction:
  • StandardScaler

  • MinMaxScaler

  • MaxAbsScaler

  • RobustScaler(quantile_range=(a, b))

    • a,b = randint(0,50), randint(51,100)

  • KernelCenterer

  • QuantileTransformer(n_quantiles=a,
    output_distribution=c, subsample=b)

    • a, b = randint(100, 1000), randint(1000, 1e5)

    • c = randchoice([‘normal’,‘uniform’])

  • Normalizer(norm=a)

    • a = randchoice([‘l1’, ‘l2’,‘max’])

  • Binarizer(threshold=a)

    • a = randuniform(0,100)

  • SMOTE(a=n_neighbors, b=n_synthetics,
    c=Minkowski_exponent)

    • a,b = randit(1,20),randchoice(50,100,200,400)

    • c = randuniform(0.1,5)

Text mining:
  • CountVectorizer(max_df=a, min_df=b)

    • a, b = randint(100, 1000), randint(1, 10)

  • TfidfVectorizer(max_df=a, min_df=b, norm=c)

    • a, b,c = randint(100, 1000), randint(1, 10), randchoice([‘l1’, ‘l2’, None])

  • HashingVectorizer(n_features=a, norm=b)

    • a = randchoice([1000, 2000, 4000, 6000, 8000, 10000])

    • b = randchoice([‘l1’, ‘l2’, None])

  • LatentDirichletAllocation(n_components=a, doc_topic_prior=b,
    topic_word_prior=c, learning_decay=d, learning_offset=e,batch_size=f)

    • a, b, c = randint(10, 50), randuniform(0, 1), randuniform(0, 1)

    • d, e = randuniform(0.51, 1.0), randuniform(1, 50),

    • f = randchoice([150,180,210,250,300])

LEARNERS
Software defect prediction and text mining:
  • [noitemsep,topsep=0pt]

  • DecisionTreeClassifier(criterion=b, splitter=c, min_samples_split=a)

    • a, b, c= randuniform(0.0,1.0), randchoice([‘gini’,‘entropy’]), randchoice([‘best’,‘random’])

  • RandomForestClassifier(n_estimators=a,criterion=b, min_samples_split=c)

    • a,b,c = randint(50, 150), randchoice([’gini’, ’entropy’]), randuniform(0.0, 1.0)

  • LogisticRegression(penalty=a, tol=b, C=float(c))

    • a,b,c=randchoice([‘l1’,‘l2’]), randuniform(0.0,0.1), randint(1,500)

  • MultinomialNB(alpha=a)

    • a= randuniform(0.0,0.1)

  • KNeighborsClassifier(n_neighbors=a, weights=b, p=d, metric=c)

    • a, b,c = randint(2, 25), randchoice([‘uniform’, ‘distance’]), randchoice([‘minkowski’,‘chebyshev’])

    • if c==’minkowski’: d= randint(1,15) else: d=2

TABLE I: Hyperparameter tuning options explored in this paper. Options selected by listing the learners seen in recent SE papers on hyperparameter optimization [17, 15, 2, 1] then consulting the documentation of a widely-used data mining library (Scikit-learn [40]) for a list of options not explored in those studies. Note that we make no claim that this is a complete list of tuning options. Rather, we merely claim that a reader of the recent SE literature on hyperparameter optimization might be tempted to try some subset of the following.

Fig. 1: Effects of hyperparameter optimization on control parameters of a learner from [47]. Blue dots and red triangles

show the mean performance before and after tuning (respectively). X-axis shows different learners. Y-axis shows the frequency at which a learner was selected to be “top-ranked” (by a statistical analysis). Vertical lines show the variance of that selection process over repeated runs. On the left-hand-side of the chart we see pre-tuned learners that seem ineffective (C5.0 and AVNNet). Yet after tuning, these seemingly poor performers performed outstandingly well, defeating 23/26 of the other classifiers.

Recently, and surprisingly, it was discovered how to (a) save most of that CPU cost while at the same time (b) find better tunings. As discussed later, a very simple approach called FFtrees [42] outperforms the supposed state-of-the-art SE tuning algorithms for our two SE tasks [8] (where “simpler” means less CPU is required to build better software quality prediction models). This is a strange result since standard tuning sampled hundreds to thousands of options, while FFtrees explored just a dozen or so.

To explain this result, we observe that there is some variation in performance between training and test (e.g. if test data comes from new project data not available during training). As shown in right, if , then recall vs false alarms groups into cells (where green cells are preferred over red). In that space, if we explored more than 25 tunings, some would be redundant; i.e., certain pairs would have indistinguishably different outcomes.

Fig. 2: Number of evaluations required by different methods in this article. Note that these are the number of evaluations required to find just one tuning. In practice, many more evaluations will be required. For example, when exploring data sets using a 5*5 cross-val, these evaluations need to be repeated 125 times. For LDA-GA SVM (described in §4.3), that implies 125,000 evaluations.

It turns out there are better ways to avoid redundant tunings than FFtrees. Our DODGE() tuning tool learns to ignore redundant tunings (those that fall within of other results). When tested defect prediction and text mining, DODGE() terminated after fewer evaluations than standard optimizers (such as those that generated Figure 1). Also, it produced better performance scores than state-of-the art research articles (from the two well-studied SE tasks listed above [15, 2, 8, 1, 39, 17]). We conjecture that other methods perform relatively worse since they do not appreciate the simplicity of the output space. Hence, those other methods waste CPU as they struggle to cover billions of redundant tuning options like Table I (most of which yield indistinguishably different results).

This article introduces and evaluates DODGE(). §2 describes how FFtrees lead to the design of DODGE() (in §3). §4 then answers the following research questions.

RQ1: Is DODGE() too complicated? How to find appropriate value of ? We can not recommend a method if it is too complex to use. Fortunately, we show that it is easy to find DODGE()’s parameters since its success is not altered by large changes to .

RQ2: How does DODGE() compare to recent prominent defect prediction and hyperparameter optimization results? We show that DODGE() out-performs:

  • An ICSE’15 article that explored many different learners for defect prediction [17];

  • An IST’16 journal article that demonstrated the value of tuning for defect predictors [15];

  • An ICSE’18 article that advocated tuning data pre-preprocessors [2].

  • The FSE’18 article that reported the FFtree results [8].

RQ3: Is DODGE() only useful for defect prediction? In order to stress test our methods, we must apply DODGE() to some harder task than defect prediction. Software bug report text mining is a harder task than defect prediction since the latter only process a few dozen attributes while former task have tens of thousands of attributes. For text mining, we show DODGE() performs better than:

  • An IST’18 journal article that had showed the value of tuning for SE text mining applications [1].

  • An earlier ICSE’13 article that applied genetic algorithms to learn the settings for a text miner 

    [39].

From our findings, we have several reasons to endorse DODGE():

  1. DODGE() generates better quality predictors than recent SE state-of-the-art research articles [15, 2, 8, 1, 39, 17].

  2. Also, we endorse DODGE() over the FFtree method that inspired it. Figure 2 shows the number of evaluations required by different frameworks understudy. Figure 2 shows that FFtrees uses fewer evaluations than DODGE(). However, our RQ2 and RQ3 results show that DODGE() typically defects FFtrees.

  3. Further, DODGE() is a simpler way to build those predictors (where “simpler” is measured in terms of the CPU required to run the method). From Figure 2, it can be observed that DODGE() takes 30 evaluations, far less to explore billions of choices as shown in Table I than any other frameworks which tries to explore few thousands only.

  4. Finally, and more fundamentally, DODGE() tests the theory that much simpler hyperparameter optimizers can be built by assuming the output space divides into just a few regions of size . DODGE() is one way to exploit this effect. We believe that further research could be in many others ways (e.g. different learners; better visualizations and/or explanations of analytics).

Before beginning, we digress to make two points. Firstly, just to say the obvious, while these results are certainly promising, our approach needs to be tested using more SE tasks (for more on this, see External Validity in §5).

Secondly, when we say DODGE() does better than prior work, we also include several of our own papers. Based on the results of this paper, we can no longer endorse the specific conclusions (about how to do hyperparameter optimization) from [1, 2, 15]. When researchers discover ways to overturn their own results, they should feel duty-bound to declare what was learned from that self-refutation. This is an important methodological point since, as Matthew Strassler observes, “All the great revolutions in science start with an unexpected discrepancy that wouldn’t go away.” 111https://en.wikiquote.org/wiki/Science.

2 FFtrees and Defect Prediction

This section describes the surprising results about FFtrees and defect prediction that lead to DODGE().

2.1 Why Study Defect Prediction?

Software developers are smart, but sometimes make mistakes. Hence, it is essential to test software before the deployment  [37, 4, 57, 34]. Software quality assurance budgets are finite while assessment effectiveness increases exponentially with assessment effort [15]. Therefore, standard practice is to apply the best available methods on code sections that seem most critical and bug-prone. Software bugs are not evenly distributed across the project [19, 23, 38, 32]. Hence, a useful way to perform software testing is to allocate most assessment budgets to the more defect-prone parts in software projects. Software defect predictors are never 100% correct. But they can be used to suggest where to focus more expensive methods.

There is much commercial interest in defect prediction. In a survey of 395 practitioners from 33 countries and five continents, Wan et al. [51] found that over 90% of the respondents were willing to adopt defect prediction techniques. Other results from commercial deployments show the benefits of defect prediction. When Misirli et al. [32] built a defect prediction model for a telecommunications company, those models could predict 87 percent of code defects. Those models also decreased inspection efforts by 72 percent, and hence reduce post-release defects by 44 percent. Also, when Kim et al. [21] applied defect prediction model, REMI, to API development process at Samsung Electronics, they could predict the bug-prone APIs with reasonable accuracy (0.68 F1 score) and reduce the resources required for executing test cases.

Software defect predictors not only save labor compared with traditional manual methods, but they are also competitive with certain automatic methods. A recent study at ICSE’14, Rahman et al.  [44]

compared (a) static code analysis tools FindBugs, Jlint, and PMD and (b) static code defect predictors (which they called “statistical defect prediction”) built using logistic regression. They found no significant differences in the cost-effectiveness of these approaches.

Given this equivalence, it is significant to note that static code defect prediction can be quickly adapted to new languages by building lightweight parsers to extract static code metrics such as Table II. The same is not true for static code analyzers - these need extensive modification before they can be used in new languages.

2.2 Data and Algorithms for Defect Prediction

Our defect predictors use static code measurements from Table II and Table III. As shown in Table III, this data is available for multiple software versions (from http://tiny.cc/seacraft). This is important since, an important principle of data mining is not to test on the data used in training. There are many ways to design a experiment that satisfies this principle. Some of those methods have limitations; e.g., leave-one-out is too slow for large data sets and cross-validation mixes up older and newer data  (such that data from the past may be used to test on future data). In this work, for each project data, we set the latest version of project data as the testing data and all the older data as the training data. For example, we use data for training predictors, and the newer data, is left for testing.

Table III illustrates the variability of SE data. When we compare the % of Defects in the training and test data, we see that the past can be very different to the future. Observe how the median defect percentage in the training data is 29% but in the test data, it is 49% (i.e. nearly doubled). This tells us that software analytics will forever be an imprecise science (and one of the lessons of this paper is that imprecision can be used to simplify complex tasks like hyperparameter optimization).

Some of the data in Table III has imbalanced class frequencies. If the target class is not common (as in the camel, ivy, jedit test data in Table III), it is difficult to generate a model that can locate it. A standard trick for class imbalance is SMOTE [7] that randomly deletes members of the majority class while synthetically

create members of the minority class. SMOTE is controlled by the parameters shown in Table I.

amc average method complexity
avg  cc average McCabe
ca afferent couplings
cam cohesion among classes
cbm coupling between methods
cbo coupling between objects
ce efferent couplings
dam data access
dit depth of inheritance tree
ic inheritance coupling
lcom (lcom3) 2 measures of lack of cohesion in methods
loc lines of code
max  cc maximum McCabe
mfa functional abstraction
moa aggregation
noc number of children
npm number of public methods
rfc response for a class
wmc weighted methods per class
defects Boolean: where defects found in bug-tracking
TABLE II: OO code metrics used for the defect prediction studies of this article. Last line, shown in gray, denotes the dependent variable.
Training Data Testing Data
Project Versions % of Defects Versions % of Defects
Poi 1.5, 2.0, 2.5 426/936 = 46% 3.0 281/442 = 64%
Lucene 2.0, 2.2 235/442 = 53% 2.4 203/340 = 60%
Camel 1.0, 1.2, 1.4 374/1819 = 21% 1.6 188/965 = 19%
Log4j 1.0, 1.1 71/244 = 29% 1.2 189/205 = 92%
Xerces 1.2, 1.3 140/893 = 16% 1.4 437/588 = 74%
Velocity 1.4, 1.5 289/410 = 70% 1.6 78/229 = 34%
Xalan 2.4, 2.5, 2.6 908/2411 = 38% 2.7 898/909 = 99%
Ivy 1.1, 1.4 79/352 = 22% 2.0 40/352 = 11%
Synapse 1.0, 1.1 76/379 = 20% 1.2 86/256 = 34%
Jedit
3.2,4.0, 4.1,4.2
292/1257 = 23% 4.3 11/492 = 2%
TABLE III: Statistics of the studied data sets.

As to machine learning algorithms, these are many and varied. At ICSE’15, Ghotra et al. [17] applied 32 different machine learning algorithms to defect prediction. In a result consistent with the theme of this article, they found that those 32 algorithms formed the four groups of Table IV (and the performance of two learners in any one group were statistically indistinguishable from each other).

Overall Rank
Classification
Technique
1 (best) Rsub+J48, SL, Rsub+SL,Bag+SL, LMT, RF+SL, RF+J48, Bag+LMT, Rsub+LMT, and RF+LMT
2

RBFs, Bag+J48, Ad+SL, KNN, RF+NB, Ad+LMT, NB, Rsub+NB, and Bag+NB

3

Ripper, EM, J48, Ad+NB, Bag+SMO, Ad+J48,Ad+SMO, and K-means

4 (worst) RF+SMO, Ridor, SMO, and Rsub+SMO
TABLE IV: 32 defect predictors clustered by their performance rank by Ghotra et al. (ranked using their Scott-Knot statistical test) [17].

2.3 Evaluation of Defect Predictors

2.3.1 Measures of Performance

We choose not to evaluate defect predictors on any single criteria (e.g., not just recall) since succeeding on one criteria can damage another [15]. Also, we eschew from precision and accuracy since these can be inaccurate for data sets where the target class is rare (which is common in defect prediction data sets) [29]. Instead, we will evaluate our predictors on metrics that aggregate multiple metrics.

D2h, or “distance to heaven”, shows how close scores fall to “heaven” (where recall=1 and false alarms=0) [8]:

(1)
(2)
(3)

Here, the term normalizes d2h to the range zero to one.

Popt(20) comments on the effort required after a defect predictor triggers and humans have to read code, looking for errors. , where is the area between the effort (code-churn-based) cumulative lift charts of the optimal learner and the proposed learner. To calculate Popt(20), we divide all the code modules into those predicted to be defective () or not (). Both sets are then sorted in ascending order of lines of code. The two sorted sets are then laid out across the x-axis, with before . This layout means that the x-axis extends from 0 to 100% where lower values of are predicted to be more defective than higher values. On such a chart, the y-axis shows what percent of the defects would be recalled if we traverse the code sorted that x-axis order. Following from the recommendations of Ostrand et al. [38], Popt is reported at the 20% point; show how many bugs are find if we inspect a small portion of the code (20%).

Kamei, Yang et al.  [56, 20, 33] normalize Popt using:

(4)

where , and represent the area of curve under the optimal learner, proposed learner, and worst learner. Note that the worst model is built by sorting all the changes according to the actual defect density in ascending order. After normalization, Popt(20) (like d2h) has the range zero to one. But note that:

  • Larger values of Popt(20) are better;

  • Smaller values of d2h are better.

2.3.2 Comparing to a Sample

As to statistical methods, the following results use two approaches. Firstly, when comparing one result to a sample of others, we will sometimes see “small effects” (which can be ignored). To define “small effect”, we use Cohen’s delta [9]:

(5)

i.e., 20% of the standard deviation.

Secondly, other statistical tests are required when comparing results from two samples; e.g. when two variants of some stochastic process are applied, many times, to a population. For this second kind of comparison, we need a statistical significance test (to certify that the distributions are indeed different) and an effect size test (to check that the differences are more than a “small effect”). There there are many ways to implement second kind of test. Here, use those which have past peer reviewed in the literature [2, 1]. Specifically, we use Efron’s 95% confidence bootstrap procedure [11] and the A12 test [3]. In this second test, to say that one sample is “worse” than another sample is to say:

  • The mean Popt(20) values of are less than .

  • The mean D2h values of are more than .

  • The populations are not statistically similar; i.e. (a) their mean difference is larger than a small effect (using A12) and that (b) a statistical significance test (bootstrapping) has not rejected the hypothesis that they are different (at 95% confidence).

Note we do not use A12 or bootstrap for the first kind of test, since those methods are not defined for comparisons of individuals to a sample.

2.4 Results from FFtrees

Fast and Frugal Trees (FFtrees) were developed by psychological scientists [27] trying to generate succinct, easily comprehensible models. We first used them as an explanation tool, but realized that they had broader implications.

FFtrees are binary trees that return a binary classification (e.g., true, false). Unlike standard decision trees, each level of an FFtree must have at least one leaf node. For example, Table 

V shows an FFTtree generated from the log4j JAVA system of Table III. The goal of this tree is to classify a software module as “defective=true” or “defective=false”. The four nodes in the Table V FFTree reference four attributes cbo, rfc, dam, amc (defined in Table II).

              if      cbo <= 4    then false
              else if rfc > 32    then true
              else if dam >  0    then true
              else if amc < 32.25 then true
              else false
        
TABLE V: An example FFtree generated from Table III data sets. Attributes come from Table II. “True” means “predicted to be defective”.

Following the advice of [42], we use trees with depth . This means that our FFtrees make their decisions using at most four attributes (where numeric ranges have been binarized by splitting at the median point).

Standard rule learners select ranges that best select for some goal (e.g., selecting for the “true” examples). This can lead to overfitting. To avoid overfitting, FFtrees use a somewhat unique strategy: at each level of the tree, FFtrees builds two trees using the ranges that most and least satisfy some goal; e.g., d2h or Popt20. That is, half the time, FFtrees will try to avoid the target class by building a leaf node that exits to “false”. Assuming a maximum tree depth of and two choices at each level, then FFtree builds trees then prunes away all but one, as follows:

  • First, we first select a goal predicate; e.g., d2h or Popt20.

  • Next, while building one tree, at each level of the tree, FFtree scores each range according to how well that range {does, does not} satisfy that goal. These selected range becomes a leaf note. FFtree then calls itself recursively on all examples that do not fall into that range.

  • Finally, while assessing 16 trees, we run the training data through each tree to find what examples are selected by that tree. Each tree is scored by passing the selected examples through the goal predicate. The tree with the best score is then applied to the test data.

In summary, FFtrees explore around a few dozen times, trying different options for how to best model the data (i.e., what exit node to use at each level of the tree). After a few explorations, FFtrees deletes the worst models, and uses the remaining model on the test data.

D2h: less is better. Popt(20): more is better.
“small effect” “small effect”
Fig. 3: Defect prediction results for FFtree vs untuned learners. From [8]. FFtrees is almost never beaten by other methods (by more than a “small effect”). Exception: see the synapse+EM results in the left column.

Figure 3 shows results from Chen et al. [8] that compared FFtrees to standard defect predictors. In that comparison, Ghotra et al. [17] was used to guide learner selection. They found that 32 defect predictors group together into just four ranks (see Table IV

). We picked at random from each of their ranks to select SL=Simple Logistic, NB=Naive Bayes, EM=Expectation Maximization, SMO=Sequential Minimal Optimization (a kind of support vector machine). We call these learners “standard” since, in Figure 

3, we use them with their defaults from Scikit-learn [40]. In Figure 3:

  • Performance is evaluated using metrics from §2.3.

  • Data comes from Table III.

  • This data has the attributes of Table II.

  • For data with multiple versions, we test on the latest version and train on a combination of all the rest.

  • If FFtrees perform worse than any other learner by more than a “small effect” (defined using Equation 5), then that result is highlighted in red (see the synapse d2h results of Figure 3). For each column, the size of a “small effect” is listed at top.

As shown in Figure 3, FFtrees nearly always performs as well, or better, than anything else.

3 From FFtrees to Dodge()

Figure 3 is a very strange result. How can something as simple as FFtree perform so well?

  • FFtrees explores very few alternate models (only 16).

  • Each model references only four attributes.

  • To handle numeric variables, a very simplistic discretization policy is applied at each level of tree (numerics are separated at the median value).

  • Strangest of all, the FFtree overfitting mechanism will (half the time) try to avoid the target class when it selects a leaf node that exits to “false”.

Under what conditions would something that simple work as well as shown in Figure 3? One possible answer was offered in the introduction. If our data has large in its output space, then:

  • The output space divides into just a few cells; so

  • If there are cells and tunings, then when , then some of those will be redundant; i.e. they achieve results within of other results.

  • Which means that exploring around times will cover much of the output space.

If that is true then to do better than FFtrees:

  • Try exploring around across a wider range of options.

  • If some options result in a performance score , then we will deprecate options that lead to .

To find a wider range of options, DODGE() uses the Table I tree of options. Leaves in that tree are either:

  • Single choices; e.g., DecisionTree, “splitter=random”; or

  • Numeric ranges; e.g., Normalizer, “norm=l2”.

Each node in the tree is assigned a weight . When evaluating a branch, the options in that branch configure, then executes, a pre-processor/learner. Each evaluation selects one leaf from the learner sub-tree and one from the pre-processing tree (and defect prediction and text mining explores different pre-processing sub-trees, see Table I). If the evaluation score is more than of prior scores, then all nodes in that branch are endorsed (). Otherwise, DODGE() deprecates (). DODGE() these weights to select options via a recursive weighted descent where, at each level, it selects sub-trees whose root has the largest weight (i.e., those most endorsed).

The design conjecture of DODGE() is that exploring some tuning options matters but, given a large output space, the details of those options are not so important. Hence, a limited number of times, we pick some options at random. Having selected those options, for further samples, we learn which of the options should be most deprecated or endorsed.

The stage refines numeric ranges. When a range is initially evaluated, a random number is selected and its weight is set to zero. Subsequently, this weight is endorsed/deprecated technique as described above, with one refinement. When a new value is required (i.e., when the branch is evaluated again) then DODGE() restricts the range as follows. If the best,worst weights seen so far (in this range) are associated with (respectively) then use and . Important point: endorse and deprecate is done each time a branch is evaluated within each and steps.

In summary, DODGE() is a method for learning what tunings are redundant; i.e. lead to results that are very similar to other tunings. It is controlled by two meta-parameters:

  • : results are “similar” if they differ by less than ;

  • : the number of sampled tunings.

Recall that where

  • The first times, the set of tuning options grows;

  • For the remaining times, that set is frozen while we refine our understanding of what tunings to avoid.

D2h: less is better. Popt(20): more is better.
“small effect” “small effect”
Fig. 4: RQ1 results. Defect prediction with DODGE(), keeping samples constant at . As before, changing does not change learner performance any more than a “small effect”. This figure was generated using the same experimental set up as Figure 5.
D2h: less is better. Popt(20): more is better.
“small effect” “small effect”
Fig. 5: More RQ1 results. Defect prediction with DODGE(.2), varying samples . Note that for any data set, all these results are very similar; i.e., changing the sample size does not change learner performance any more than a “small effect”. This figure was generated using the same experimental set up as Figure 3 (where tuning options taken from Table I).

4 Answers to Research Questions

Using DODGE(), we can now answer the research questions asked in this article’s introduction.

4.1 R1: Is Dodge() too complicated? How to find appropriate value of ?

When proposing simplifications to software analytics, it is important to check if the new proposed method is itself simple to apply. Accordingly, this research question asks if it is difficult to find useful values for (the number of samples) or the value used in the search. Figure 4 and Figure 5 explore different settings of .

  • Figure 4 varies but keeps constant.

  • Figure 5 varies but keeps constant.

As shown in these figures, these changes to alter the performance of DODGE() by less than a “small effect”. That is, (a) the output space for this data falls into a very small number of regions so (b) a large number of samples across a fine-grained division of the output space performs just as well as a few samples over a coarse-grained division.

In summary, our answer to RQ1 is that the values of can be set very easily. Based on the results of Figure 4 and Figure 5, for the rest of this article we will use while taking samples of the options tree.

INPUT:
  • A dataset, such as Table III;

  • A tuning goal (e.g., Equation 3 or Equation 4 from §2.3);

  • DE parameters: , , ,

PROCEDURE:
  • Given options, generate tunings as the initial population ;

  • Score each tuning in the population with goal ;

  • While

    1. Set ; i.e., we will lose a life (unless we find improvements);

    2. For do

      1. Make a mutant by extrapolating between

        at probability

        . For decisions :

        1. (continuous values).

        2. (discrete values).

      2. Build a learner with parameters and train data;

      3. Using the data and , score that learner;

      4. If beats , replace and do not lose a life (set ).

Fig. 6: Storn’s differential evolution algorithm [46, 14].

4.2 RQ2: How does Dodge() compare to recent prominent defect prediction and hyperparamter optimization results?

SMOTUNED is Agrawal et al. ICSE’18 [2]’s hyperparamater optimizer that tunes the SMOTE data pre-processor (recall that SMOTE is a tool for addressing class imbalance and was described in §2.2

). Agrawal et al. report that SMOTUNED’s tunings greatly improved classifier performance. SMOTUNED uses the differential evolutionary algorithm described in Figure 

6. SMOTUNED has the same control parameters as SMOTE (see Table I).

DE+RF is a hyperparameter optimizer proposed by Fu et al. [16] that uses differential evolution to tune the control parameters of random forests The premise of RF (which is short for random forests) is “if one tree is useful, why not a hundred?”. RF quickly builds many trees, each time using a random selection of the attributes and examples. The final conclusion is then generated by polling across all the trees in the forest. RF’s control parameters are listed in Table I.

SMOTUNED and DE+RF used DE since DE can handle numeric and discrete options (see step 2a in Figure 6). Also, DE has been proven useful in prior SE tuning studies [15]. Further, other evolutionary algorithms (genetic algorithms [18], simulated annealing [22]

) mutate each attribute in isolation. When two attributes are correlated, those algorithms can mutate variables inappropriately in different directions. DE, on the other hand, mutates attributes in tandem along known data trends. Hence, DE’s tandem search can outperform other optimizers such as (a) particle swarm optimization 

[50]; (b) the grid search used by Tantithamthavorn et al. [47] to tune their defect predictors [16]; or (c) the genetic algorithm used by Panichella et al.  [39] to tune a text miner (that result is presented below).

D2h: less is better. Popt(20): more is better.
Means results from 25 runs Means results from 25 runs
Fig. 7: RQ2 results. Defect prediction results for DODGE(.2), vs (FFtrees, SMOTUNED, DE+RF, RANDOM). In only a few cases (those highlighted in red) is DODGE(.2)’s performance worse than anything else (where “worse” is defined using the statistics of §2.3.2.)

Figure 7 compares hyperparameter optimizers with DODGE(.2), FFtrees and (just for completeness) a random search method that picks 30 random options from Table I. These experiments make extensive use of stochastic algorithms whose behavior can significantly differ between each run (DE and Random30). Hence, Figure 7 shows mean results from 25 runs using 25 different seeds. In those results:

  • Usually, random performs badly and never defeats DODGE(). This result tells us that the reweighing scheme within DODGE() is useful.

  • In 16/20 cases combining the d2h and Popt20 datasets, DODGE(.2) is no worse than anything else (where “worse” is defined as per §2.3.2).

  • In two cases, DODGE(.2) is beaten by FFtrees (see the d2h results for jedit and log4j). That is, in 90% of these results, methods that explore a little around the results space do no worse than methods that try to extensively explore the space of tuning options.

In summary, our answer to RQ2 is that DODGE() often performs much better than recent prominent standard hyperparameter optimization results.

4.3 RQ3: Is Dodge() only useful for defect prediction?

DODGE() was designed in the context of defect prediction. This section checks if that design applies to a very different software analytics; i.e., SE text mining.

4.3.1 Why Study SE Text Mining?

The defect predictors described above learned models from structured data; i.e., simple tables of data which include a target class such as buggy equals true or false. But many SE project artifacts come in the form of unstructured text such as word processing files, slide presentations, comments, Github issue reports, etc. According to White [53]:

  • 80% of business is conducted on unstructured data;

  • 85% of all data stored is held in an unstructured format;

  • Unstructured data doubles every three months.

Nadkarni and Yezhkova [35] say that most of the planet’s 1600 Exabytes of data does not appear in structured sources (databases, etc) and that each year, humans generate far less structured than unstructured artifacts.

Lately, there has been much interest in SE text mining [28, 31, 39, 1, 55, 26] since this covers a much wider range of SE activities. Text mining is harder than defect prediction due to presence of free form natural language which is semantically very complex and may not conform to any known grammar. In practice, text documents require tens of thousands of attributes (one for each word in the natural language of the author of those documents). For example, consider NASA’s software project and issue tracking systems (or PITS) [28, 31] that contain text discussing bugs and changes in source code. It also contains comments on software patches. As shown in Table VI, our text data contains tens to hundreds of thousands of words (even when reduced to unique words, there are still 10,000+ unique words).

When such a free text tool like PITS is used by a very broad community, it is hard to ensure that humans comment on artifacts in a consistent manner. To encourage a better and more uniform comment system within PITS, Menzies & Marcus [31] developed a text miner that checked on the validity of PITS severity reports.

  • After seeing an issue in some artifact, a human analyst assigns a severity level severityX.

  • Our text miner learns a predictor for issue severity level from logs of notes, severities. This is applied to the latest issue to assign a severity level severityY.

  • When severityY is different to severityX, then a human supervisor reviews the dispute to, possibly, override the human’s severity ranking.

The rest of this section compares different methods for implement this severity classifier.

4.3.2 Data and Algorithms for Text Mining

Dataset
No. of Documents
No. of Unique Words
Severe %
PitsA 965 155,165 39
PitsB 1650 104,052 40
PitsC 323 23,799 56
PitsD 182 15,517 92
PitsE 825 93,750 63
PitsF 744 28,620 64
TABLE VI: Dataset statistics. Data comes from the SEACRAFT repository: http://tiny.cc/seacraft
Topics= Top words in topic
01= command engcntrl section spacecraft unit icd tabl point referenc indic
02= softwar command test flight srobc srup memori script telemetri link
03= file variabl line defin messag code macro initi use redund
04= file includ section obc issu fsw code number matrix src
05= mode safe control state error power attitud obc reset boot
06= function eeprom send non uplink srup control load chang support
07= valu function cmd return list ptr curr tss line code
08= tabl command valu data tlm load rang line count type
09= flight sequenc link capabl spacecraft softwar provid time srvml trace
10= line messag locat column access symbol file referenc code bld
TABLE VII: Top 10 topics found by LDA for PitsA dataset fromTable VI. Within each topic, the weight of words decreases exponentially left to right across the order shown here. The words here are truncated (e.g., “software” becomes “softwar”) due to stemming.

Issue
10 Topics Severe?
01 .60 .10 .00 .15 .00 .05 .03 .04 .03 .00 y
02 .10 .03 .02 .00 .03 .02 .15 .65 .00 .00 n
03 .00 .20 .05 .05 .00 .60 .02 .03 .03 .02 n
04 .03 .01 .01 .10 .15 .00 .70 .00 .00 .00 y
etc
TABLE VIII: Document Topic distribution found by LDA for PitsA dataset

Table VI describes our PITS data, which comes from six different NASA systems (which we label PitsA, PitsB,…etc). For this study, all datasets were preprocessed using the usual text mining filters [12]. We implemented stop words removal using NLTK toolkit [5] (to ignore very common short words such as “and” or “the”). Next, Porter’s stemming filter [43] was used to delete uninformative word endings (e.g., after performing stemming, all the following words would be rewritten to “connect”: “connection”, “connections”, “connective”, “connected”, “connecting”). After that, DODGE() selected other pre-processors using the space of options from Table I.

A standard learner text mining is SVM (support vector machine). SVMs seek a hyperplane that separates the data while maximizing the distance of the plane to examples

[10]. A drawback with SVM is that its models may not be human comprehensible. Finding insights among unstructured text is difficult unless we can search, characterize, and classify the textual data in a meaningful way. One of the common techniques for finding related topics within unstructured text (an area called topic modeling) is the Latent Dirichlet allocation (LDA) [6]. LDA clusters text into “topics” defined by the high-frequency words in that cluster. For example, the topics found by LDA for one of our PITS data sets are shown in Table VII. We study LDA in this article since it is a widely-used technique in recent prominent SE research articles. For example, in the last decade, at least 39 articles using LDA have appeared at ICSE, FSE, TSE, OOPSLA, IST, JSS, ASE, MSR, ICPC, SANER, ICSME, ISSRE, and the Empirical SE journal [1].

LDA is controlled by various parameters (see Table I). At ICSE’13, Panichella et al. [39] used a genetic algorithm to tune their LDA text miners. More recently, in a IST’18 journal article, Agrawal et al. [1] saw that differential evolution can out-perform genetic algorithms for tuning LDA.

A standard pre-processor for text mining is vectorization; i.e., replace the raw observations that wordX appears in documentY with some more informative statistic. For example, Agrawal et al. converted the PITS text data into the vectors of Table VIII. The cells in that table shows how much each issue report matches each topic (and the final column shows the issue severity of that report). Table I lists the options for the LDA vectorization, plus three other vectorization methods.

The important thing about vectorization is that, after that conversion, then standard machine learning algorithms can be applied to text miners (e.g., see the learners of Table I).

4.3.3 Evaluation of Text Miners

In this article we assess how well our text miners perform at recognizing the “severe” issue reports for PITS. Since this a standard classification problem, it is appropriate to use the d2h metric of §2.3.

That said, we must adjust some of the evaluation methods used in this article’s previous work on defect prediction. We should not use Popt20 for these text mining studies since that is a specialized metric that reports the effectiveness of the source code reviews triggered by defect prediction.

Also, unlike the defect prediction data of Table III, the PITS data is not conveniently divided into versions. Hence, to generate separate train and test data sets, we use a cross-validation study where, times, we randomize the order of the data then divide into bins. Then, we test on that bin after training on all the others.

Since cross-validation can significantly alter performance between different train/test pairs, we will show mean results from 25 runs using 25 different seeds.

4.3.4 Results

D2h: less is better.

Mean results from 25 runs.

Fig. 8: RQ3 results. Mean text mining prediction results using DODGE(.2) and (results seen in 25 repeats of a cross-validation study). In only one case (PitsB) is DODGE()’s performance is worse than anything else (where “worse” is defined as per §2.3.2). Same experimental set up as Figure 3 except here, we use Efron’s 95% confidence bootstrap procedure [11] (to demonstrate significant differences), then the A12 effect size test [3] (to demonstrate that the observed delta is bigger than a “small effect”).

Figure 8 shows our text mining results. The treatments of Figure 8 divide into several groups. The first Group1 does no hyperparameter optimization. This group includes LDA-FFT and LDA-SVM and uses LDA to vectorize the data, then applies either the FFTree or the SVM classifier. Here, we use LDA-SVM since that was found useful in prior studies [24]. Also, we use FFT since this is analogous to the defect prediction study discussed above.

The second Group2, tunes the pre-processor but not the learner. This group includes LDADE-FFT, LDADE-SVM and LDA-GA-SVM. In this group, LDA is used for vectorization, which is tuned by either DE (as done by [1]) or a genetic algorithm (as done by [39]). For GA, we used the same control parameters as used by Panichella et al. in their text mining work [39]

The third Group3, which contains DODGE() and RANDOM tunes both the pre-processor and learner. RANDOM is included, just for completeness. As to DODGE(), we use samples and . In those results, DODGE() was free to apply any learner or pre-processing or vectorization procedure of Table I.

As seen in Figure 8, in on one case is DODGE()’s performance worse than anything else (where “worse” is defined as per §2.3.2). The LDA-FFT results from PitsF look a little better than DODGE(), but difference was deemed insignificant by our statistical tests. And, just as with the Figure 7 results, when DODGE() fails, it is beaten by a treatment that uses FFtree (see the PitsB LDA-FFT results). That is, in 100% of these results, methods that explore a little around the results space do no worse than methods that try to extensively explore the space of tuning options (e.g., genetic algorithms and differential evolution).

In summary, our answer to RQ3 is that DODGE() is not just a defect prediction method. Its success with text mining make it an interesting candidate for further experimentation with other SE tasks.

5 External Validity

DODGE() self-selects the tunings used in the pre-processors and data miners. Hence, by its very nature, this article avoids one threat to external validity (i.e., that important control parameter settings are not explored).

This paper reports results from two tasks (defect prediction and text mining) to show that the same effect holds in both tasks; i.e. algorithms can be remarkably effective when they assume the output space seems to divide into a very small number of regions. Most software analytics papers report results from one task; i.e. either defect prediction or text mining. In that sense, the external validity of this paper is greater than most analytics papers.

On the other hand, this paper only reports results from two tasks. There are many more kinds of SE tasks that should be explored before it can be conclusively stated that DODGE() is widely applicable and useful.

Another threat to external validity is that this article compares DODGE() against existing baselines for hyperparameter optimization in the SE analytics literature. We do not compare our new approach against the kinds of optimizers we might find in search-based SE literature [41]. There are two reasons for this. Firstly, search-based SE methods are typically CPU intensive and so do not address our simplicity goal. Secondly, the main point of this article is to document a previous unobserved feature of the output space of software analytics. It is an open question whether or not DODGE() is the best way to explore output space In order to motivate the community to explore that space, some article must demonstrate its existence and offer baseline results that, using the knowledge of output space, it is possible to do better than past work. Hence, this article.

6 Conclusion

This article has discussed ways to reduce the CPU cost associated with hyperparameter optimization for software analytics. Tools like FFtrees or DODGE() were shown to work as well, or better, than numerous recent SE results. As stated in the introduction, we assert that other methods perform worse than DODGE() since they do not appreciate the simplicity of the output space. Hence, those other methods waste much CPU as they struggle to cover billions of tuning options like Table I (most of which yield indistinguishably different results).

Generalizing from our results, perhaps it is time for a new characterization of software analytics:

Software analytics is that branch of machine learning that studies problems with large outputs.

This new characterization is interesting since it means that machine learning algorithm developed in the AI community might not apply to SE. We suspect that understanding SE is a fundamentally different problem to understanding other problems that are more precisely controlled and restrained. Perhaps, it is time to design new machine learning algorithms (like DODGE()) that are better suited to large SE problems. As shown in this article, such new algorithms can exploit the peculiarities of SE data to dramatically simplify software analytics.

We hope that this article inspires much future work on a next generation of SE data miners. For example, tools like DODGE()

need to be applied to more SE tasks to check the external validity of these results. Another useful extension to this work would be to explore problems with three or more goals (e.g., reduce false alarms while at the same time improving precision and recall).

Also, there are many ways that DODGE() might be improved. For example, right now we only deprecate tunings that lead to similar results. Another approach would be also depreciate tunings that lead to similar and worse results (perhaps to rule out larger parts of the output space, sooner). Further, for pragmatically reasons it would be useful if the Table I list could be reduced to a smaller, faster to run, set of learners. That is, here we would select learners that run fastest while generating the most variable kinds of models.

Acknowledgements

This work was partially funded by an NSF Grant #1703487.

References

  • [1] A. Agrawal, W. Fu, and T. Menzies, “What is wrong with topic modeling? and how to fix it using search-based software engineering,” Information and Software Technology, 2018.
  • [2] A. Agrawal and T. Menzies, “Is better data better than better data miners?: on the benefits of tuning smote for defect prediction,” in International Conference on Software Engineering, 2018.
  • [3] A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in International Conference on Software Engineering, 2011.
  • [4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE transactions on software engineering, 2015.
  • [5] S. Bird, “Nltk: the natural language toolkit,” in Proceedings of the COLING/ACL on Interactive presentation sessions, 2006.
  • [6] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” in the Journal of machine Learning research, 2003.
  • [7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over-sampling technique,” J. Artif. Int. Res., 2002.
  • [8] D. Chen, W. Fu, R. Krishna, and T. Menzies, “Applications of psychological science for actionable analytics,” Foundations of Software Engineering, 2018.
  • [9] J. Cohen, “Statistical power analysis for the behavioral sciences. 1988, hillsdale, nj: L,” Lawrence Earlbaum Associates, vol. 2, 1988.
  • [10] C. Cortes and V. Vapnik, “Support-vector networks,” in Machine Learning, 1995, pp. 273–297.
  • [11] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap.   New York: Chapman & Hall, 1993.
  • [12] R.-S. Feldman, J, The Text Mining Handbook.   New York: Cambridge University Press, 2006.
  • [13] D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker, “Interactions with big data analytics,” ACM interactions, 2012.
  • [14]

    W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in

    Foundations of Software Engineering.   ACM, 2017.
  • [15] W. Fu, T. Menzies, and X. Shen, “Tuning for software analytics: Is it really necessary?” Information and Software Technology, 2016.
  • [16] W. Fu, V. Nair, and T. Menzies, “Why is differential evolution better than grid search for tuning defect predictors?” arXiv preprint arXiv:1609.02613, 2016.
  • [17] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact of classification techniques on the performance of defect prediction models,” in International Conference on Software Engineering, 2015.
  • [18] D. E. Goldberg, Genetic algorithms.   Pearson Education India, 2006.
  • [19] M. Hamill and K. Goseva-Popstojanova, “Common trends in software fault and failure data,” IEEE Transactions on Software Engineering, 2009.
  • [20] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi, “A large-scale empirical study of just-in-time quality assurance,” IEEE Transactions on Software Engineering, 2013.
  • [21] M. Kim, J. Nam, J. Yeon, S. Choi, and S. Kim, “Remi: defect prediction for efficient api testing,” in Foundations of Software Engineering.   ACM, 2015.
  • [22] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” science, vol. 220, no. 4598, pp. 671–680, 1983.
  • [23] A. G. Koru, D. Zhang, K. El Emam, and H. Liu, “An investigation into the functional form of the size-defect relationship for software modules,” IEEE Transactions on Software Engineering, 2009.
  • [24] R. Krishna, Z. Yu, A. Agrawal, M. Dominguez, and D. Wolf, “The ”bigse” project: Lessons learned from validating industrial text mining,” in Proceedings of the 2Nd International Workshop on BIG Data Software Engineering, ser. BIGDSE ’16, 2016, pp. 65–71.
  • [25] Y. Liu, T. M. Khoshgoftaar, and N. Seliya, “Evolutionary optimization of software quality modeling with multiple repositories,” IEEE Transactions on Software Engineering, 2010.
  • [26] S. Majumder, N. Balaji, K. Brey, W. Fu, and T. Menzies, “500+ times faster than deep learning (A case study exploring faster methods for text mining stackoverflow),” in Mining Software Repository, 2018.
  • [27]

    L. Martignon, K. V. Katsikopoulos, and J. K. Woike, “Categorization with limited resources: A family of simple heuristics,”

    Journal of Mathematical Psychology, vol. 52, no. 6, pp. 352–361, 2008.
  • [28] T. Menzies, “Improving iv&v techniques through the analysis of project anomalies: Text mining pits issue reports-final report,” Citeseer, 2008.
  • [29] T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald, “Problems with precision: A response to ”comments on ’data mining static code attributes to learn defect predictors’”,” IEEE Transactions of Software Engineering, 2007.
  • [30] T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE Transactions on Software Engineering, vol. 33, no. 1, 2007.
  • [31] T. Menzies and A. Marcus, “Automated severity assessment of software defect reports,” in International Conference on Software Maintenance.   IEEE, 2008.
  • [32] A. T. Misirli, A. Bener, and R. Kale, “Ai-based software defect predictors: Applications and benefits in a case study,” AI Magazine, vol. 32, no. 2, pp. 57–68, 2011.
  • [33] A. Monden, T. Hayashi, S. Shinoda, K. Shirai, J. Yoshida, M. Barker, and K. Matsumoto, “Assessing the cost effectiveness of fault prediction in acceptance testing,” IEEE Transactions on Software Engineering, vol. 39, no. 10, pp. 1345–1357, 2013.
  • [34] G. J. Myers, C. Sandler, and T. Badgett, The art of software testing.   John Wiley & Sons, 2011.
  • [35] A. Nadkarni and N. Yezhkova, “Structured versus unstructured data: The balance of power continues to shift,” IDC (Industry Development and Models) Mar, 2014.
  • [36]

    A. L. Oliveira, P. L. Braga, R. M. Lima, and M. L. Cornélio, “Ga-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation,”

    Information and Software Technology Journal, 2010.
  • [37] A. Orso and G. Rothermel, “Software testing: a research travelogue (2000–2014),” in Future of Software Engineering.   ACM, 2014.
  • [38] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Where the bugs are,” in ACM SIGSOFT Software Engineering Notes.   ACM, 2004.
  • [39] A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, “How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms,” in International Conference on Software Engineering, 2013.
  • [40] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” Journal of machine learning research, 2011.
  • [41] J. Petke and T. Menzies, “Guest editorial for the special section from the 9th international symposium on search based software engineering,” Information and Software Technology, 2018.
  • [42] N. D. Phillips, H. Neth, J. K. Woike, and W. Gaissmaier, “Fftrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees,” Judgment and Decision Making, vol. 12, no. 4, p. 344, 2017.
  • [43] M. Porter, “The Porter Stemming Algorithm,” pp. 130–137, 1980. [Online]. Available: http://tartarus.org/martin/PorterStemmer/
  • [44] F. Rahman, S. Khatri, E. T. Barr, and P. Devanbu, “Comparing static bug finders and statistical prediction,” in International Conference on Software Engineering.   ACM, 2014.
  • [45] F. Sarro, S. Di Martino, F. Ferrucci, and C. Gravino, “A further analysis on the use of genetic algorithm to configure support vector machines for inter-release fault prediction,” in Symposium on applied computing.   ACM, 2012.
  • [46] R. Storn and K. Price, “Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces,” Journal of global optimization, vol. 11, no. 4, pp. 341–359, 1997.
  • [47] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “Automated parameter optimization of classification techniques for defect prediction models,” in International Conference on Software Engineering.   IEEE, 2016.
  • [48] C. Treude and M. Wagner, “Per-corpus configuration of topic modelling for github and stack overflow collections,” arXiv preprint arXiv:1804.04749, 2018.
  • [49] H. Tu and V. Nair, “Is one hyperparameter optimizer enough?” in ACM SIGSOFT International Workshop on Software Analytics, 2018.
  • [50] J. Vesterstrøm and R. Thomsen, “A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems,” in

    Congress on Evolutionary Computation

    .   IEEE, 2004.
  • [51] Z. Wan, X. Xia, A. E. Hassan, D. Lo, J. Yin, and X. Yang, “Perceptions, expectations, and challenges in defect prediction,” IEEE Transactions on Software Engineering, pp. 1–1, 2018.
  • [52] T. Wang, M. Harman, Y. Jia, and J. Krinke, “Searching for better configurations: a rigorous approach to clone evaluation,” in Foundations of Software Engineering.   ACM, 2013.
  • [53] C. White, “Consolidating, accessing and analyzing unstructured data,” 2005, http://www.b-eye-network.com/view/2098.
  • [54] T. Xia, R. Krishna, J. Chen, G. Mathew, X. Shen, and T. Menzies, “Hyperparameter optimization for effort estimation,” arXiv preprint arXiv:1805.00336, 2018.
  • [55]

    B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, “Predicting semantically linkable knowledge in developer online forums via convolutional neural network,” in

    International Conference on Automated Software Engineering.   ACM, 2016.
  • [56] Y. Yang, Y. Zhou, J. Liu, Y. Zhao, H. Lu, L. Xu, B. Xu, and H. Leung, “Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models,” in Foundations of Software Engineering.   ACM, 2016.
  • [57] S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,” Software Testing, Verification and Reliability, vol. 22, no. 2, pp. 67–120, 2012.
  • [58] S. Zhong, T. M. Khoshgoftaar, and N. Seliya, “Analyzing software measurement data with clustering techniques,” IEEE Intelligent Systems, vol. 19, no. 2, pp. 20–27, 2004.