Applications of Psychological Science for Actionable Analytics

03/13/2018 ∙ by Di Chen, et al. ∙ NC State University 0

Actionable analytics are those that humans can understand, and operationalize. What kind of data mining models generate such actionable analytics? According to psychological scientists, humans understand models that most match their own internal models, which they characterize as lists of "heuristic" (i.e., lists of very succinct rules). One such heuristic rule generator is the Fast-and-Frugal Trees (FFT) preferred by psychological scientists. Despite their successful use in many applied domains, FFTs have not been applied in software analytics. Accordingly, this paper assesses FFTs for software analytics. We find that FFTs are remarkably effective. Their models are very succinct (5 lines or less describing a binary decision tree). These succinct models outperform state-of-the-art defect prediction algorithms defined by Ghortra et al. at ICSE'15. Also, when we restrict training data to operational attributes (i.e., those attributes that are frequently changed by developers), FFTs perform much better than standard learners. Our conclusions are two-fold. Firstly, there is much that software analytics community could learn from psychological science. Secondly, proponents of complex methods should always baseline those methods against simpler alternatives. For example, FFTs could be used as a standard baseline learner against which other software analytics tools are compared.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Data mining tools have been applied to many applications in Software Engineering (SE). For example, it has been used to estimate how long it would take to integrate new code into an existing project 

(Czerwonka et al., 2011), where defects are most likely to occur (Ostrand et al., 2004; Menzies et al., 2007a), or how long will it take to develop a project (Turhan et al., 2011; Kocaguneli et al., 2012), etc. Large organizations like Microsoft routinely practice data-driven policy development where organizational policies are learned from an extensive analysis of large datasets (Begel and Zimmermann, 2014; Theisen et al., 2015).

Despite these successes, there exists some drawbacks with current software analytic tools. At a recent workshop on “Actionable Analytics” at ASE’15, business users were very vocal in their complaints about analytics (Hihn and Menzies, 2015), saying that there are rarely producible models that business users can understand or operationalize.

Accordingly, this paper explores methods for generating actionable analytics for:

  • [leftmargin=0.4cm]

  • Software defect prediction;

  • Predicting close time for Github issues.

There are many ways to define “actionable” but at the very least, we say that something is actionable if people can read and use the models it generates. Hence, for this paper, we assume:

Actionable = Comprehensible + Operational.

We show here that many algorithms used in software analytics generate models that are not actionable. Further, a data mining algorithm taken from psychological science (Czerlinski et al., 1999; Gigerenzer et al., 1999; Martignon et al., 2003; Brighton, 2006; Martignon et al., 2008; Gigerenzer, 2008; Gigerenzer and Gaissmaier, 2011; Neth and Gigerenzer, 2015), called Fast-and-Frugal trees (FFTs111

The reader might be aware that FFT is also an acronym for “Fast Fourier Transform”. Apparently, the psychological science community was unaware of that acronym when they named this algorithm.

), are very actionable.

Note that demanding that analytics be actionable also imposes certain restrictions on (a) the kinds of models that can be generated and (b) the data used to build the models.

  • [leftmargin=0.4cm]

  • Drawing on psychological science, we say an automatically generated model is comprehensible if:

    • [leftmargin=0.4cm]

    • The model matches the models used internally by humans; i.e., it comprises small rules.

    • Further, for expert-level comprehension, the rules should quickly lead to decisions (thus freeing up memory for other tasks).

    For more on this point, see Section 2.2.

  • As to operational, we show in the historical log of software projects that only a few of the measurable project attributes are often changed by developers. For a data mining algorithm to be operational, it must generate effective models even if restricted to using just those changed attributes.

Using three research questions, this paper tests if these restrictions damage our ability to build useful models.

RQ1: Do FFTs models perform worse than the current state-of-the-art? We will find that:

[couleur=gray!20,arrondi=0.1,logo=,ombre=true]  For defect prediction, FFTs out-perform the state-of-art. When compared to state-of-the-art defect prediction algorithms surveyed by Ghotra et al. (Ghotra et al., 2015), FFTs are more effective (where “effective” is measured in terms of a recall/false alarm metric or the metric defined in § 3.4).

RQ2: Are FFTs more operational than the current state-of-the-art? This research question tests what happens when we learn from less data; i.e., if we demand our models avoid using attributes that are rarely changed by developers. We show that:

[couleur=gray!20,arrondi=0.1,logo=,ombre=true] When learning from less data, FFTs performance is stabler than some other learners. When data is restricted to attributes that developers often change, then FFTs performance is only slightly changed while the performance of some other learners, can vary by alarmingly large amounts.

The observed superior performance of FFT raises the question:

RQ3: Why do FFTs work so well? Our answer to this question will be somewhat technical but, in summary we will say:

[couleur=gray!20,arrondi=0.1,logo=,ombre=true] FFTs match the structure of SE data SE data divides into a few regions with very different properties and FFTs are good way to explore such data spaces.

In summary, the contributions of this paper are:

  • [leftmargin=0.4cm]

  • A novel inter-disciplinary contribution of the application of psychological science to software analytics.

  • A cautionary tale that, for software analytics, more complex learners can perform worse.

  • A warning that many current results in software analytics make the, possibly unwarranted, assumption that merely because an attribute is observable, that we should use those attributes in a model.

  • Three tests for “actionable analytics”: (a) Does a data mining produce succinct models? (b) Do those succinct models perform as well, or better, than more complex methods? (c) If the data mining algorithm is restricted to just the few attributes that developers actually change, does the resulting model perform satisfactorily?

  • A demonstration that the restraints demanding by actionable analytics (very simple models, access to less data) need not result in models with poor performance.

  • A new, very simple baseline data mining method (FFTs) against which more complex methods can be compared.

  • A reproduction package containing all the data and algorithms of this paper, see http://url_blinded_for_review.

The rest of this paper is structured as follows. In Section 2, we introduce the concepts of “operational” and “comprehensible” as the preliminaries. Our data, experimentation settings and evaluation measures will be described in Section 3. In Section 4, we show our results and answer to research questions. Threats and validity of our work is given in Section 5. In Section 6, we conclude this paper with the following:

  • [leftmargin=0.4cm]

  • There is much the software analytics community could learn from psychological science.

  • Proponents of complex methods should always baseline those methods against simpler alternatives.

Finally, we discuss future work.

2. Preliminaries

2.1. Operational

This paper assumes that for a data mining algorithm to be operational, it must generate effective models even if restricted to using just those attributes which, in practice, developers actually change. We have two reasons for making that assumption.

Figure 1. Only some metrics change between versions and of a software system. For definitions of the metrics on the x-axis, see Table 1. To create this plot, we studied the 26 versions of the ten datasets in Table 3. First we initialize , then for all pairs of versions from the same data set, we (a) incremented by one; (b) collected the distributions of metric seen in version and of the software; (c)  checked if those two distributions were different; and if so, (d) added one to . Afterwards, the y-axis of this plot was computed using .

Firstly, this definition of operational can make a model much more acceptable to developers. If a model says that, say, leads to defective code then developers will ask for guidance on how to reduce “x” (in order to reduce the chances of defects). If we define “operational” as per this article, then it is very simple matter to offer that developer numerous examples, from their own project’s historical log, of how “x” was changed.

Secondly, as shown in Figure 1 there exist attributes that are usually not changed from one version to the next. Figure 1 is important since, as shown in our RQ2 results, when we restrict model construction to just the 25% most frequently changed attributes, this can dramatically change the behavior of some data mining algorithms (but not FFTs).

Technical aside: in Figure 1, we defined “changed” using the A12 test  (Vargha and Delaney, 2000) which declares two distributions different if they differ by more than a small effect. A recent ICSE’11 article (Arcuri and Briand, 2011) endorsed the use of A12 due to its non-parametric nature, it avoids any possibly incorrect Gaussian assumptions about the data.

Metric Name Description
amc average method complexity Number of JAVA byte codes
avg_cc average McCabe Average McCabe’s cyclomatic complexity seen in class
ca afferent couplings How many other classes use the specific class.
cam cohesion amongst classes Summation of number of different types of method parameters in every method divided by a multiplication of number of different method parameter types in whole class and number of methods.
cbm coupling between methods Total number of new/redefined methods to which all the inherited methods are coupled
cbo coupling between objects Increased when the methods of one class access services of another.
ce efferent couplings How many other classes is used by the specific class.
dam data access Ratio of private (protected) attributes to total attributes
dit depth of inheritance tree It’s defined as the maximum length from the node to the root of the tree
ic inheritance coupling Number of parent classes to which a given class is coupled (includes counts of methods and variables inherited)
lcom lack of cohesion in methods Number of pairs of methods that do not share a reference to an instance variable.
locm3 another lack of cohesion measure If are the number of in a class number and is the number of methods accessing an attribute, then .
loc lines of code Total lines of code in this file or package.
max_cc Maximum McCabe maximum McCabe’s cyclomatic complexity seen in class
mfa functional abstraction Number of methods inherited by a class plus number of methods accessible by member methods of the class
moa aggregation Count of the number of data declarations (class fields) whose types are user defined classes
noc number of children Number of direct descendants (subclasses) for each class
npm number of public methods npm metric simply counts all the methods in a class that are declared as public.
rfc response for a class Number of methods invoked in response to a message to the object.
wmc weighted methods per class A class with more member functions than its peers is considered to be more complex and therefore more error prone

defect
defect Boolean: where defects found in post-release bug-tracking systems.
Table 1. The C-K OO metrics studied in Figure 1. Note that the last line. ‘defect’, denotes the dependent variable.

2.2. Comprehensible

Why Demand Comprehensibility?

This paper assumes that better data mining algorithms are better at explaining their models to humans. But is that always the case?

The obvious counter-argument is that if no human ever needs to understand our audited model, then it does not need to be comprehensible. For example, a neural net could control the carburetor of an internal combustion engine since that carburetor will never dispute the model or ask for clarification of any of its reasoning.

On the other hand, if a model is to be used to persuade software engineers to change what they are doing, it needs to be comprehensible so humans can debate the merits of its conclusions. Several researchers demand that software analytics models needs to be expressed in a simple way that is easy for software practitioners to interpret (Menzies, 2014; Lipton, 2016; Dam et al., 2018). According to Kim et al. (Kim et al., 2016), software analytics aim to obtain actionable insights from software artifacts that help practitioners accomplish tasks related to software development, systems, and users. Other researchers (Tan and Chan, 2016) argue that for software vendors, managers, developers and users, such comprehensible insights are the core deliverable of software analytics. Sawyer et al. comments that actionable insight is the key driver for businesses to invest in data analytics initiatives (Sawyer, 2013). Accordingly, much research focuses on the generation of simple models, or make blackbox models more explainable, so that human engineers can understand and appropriately trust the decisions made by software analytics models (Fu and Menzies, 2017; Abdollahi and Nasraoui, 2016).

If a model is not comprehensible, there are some explanation algorithms that might mitigate that problem. For example:

  • [leftmargin=0.4cm]

  • In secondary learning

    , the examples given to a neural network are used to train a rule-based learner and those learners could be said to “explain” the neural net 

    (Craven and Shavlik, 2014).

  • In contrast set learning for instance-based reasoning, data is clustered and users are shown the difference between a few exemplars selected from each cluster (Krishna and Menzies, 2015).

Such explanation facilities are post-processors to the original learning method. An alternative simpler approach would be to use learners that generate comprehensible models in the first place.

The next section of this paper discusses one such alternate approach for creating simple comprehensible models.

  if      cob <= 4    then false     # 0
  else if rfc > 32    then true      # 1
  else if dam >  0    then true      # 1
  else if amc < 32.25 then true      # 1
  else false                         # 0
        
  if   Ψ  cbo    <   4   then true # 1
  else ifΨmax_cc <   3   then true # 1
  else ifΨwmc    <  10   then true # 1
  else ifΨrfc    <= 41.5 then true # 1
  else false                       # 0
  
  ifΨ     dam > 0 then false # 0
  else ifΨnoc > 0Ψthen false # 0
  else ifΨwmc > 5Ψthen false # 0
  else ifΨmoa > 0Ψthen false # 0
  else true                  # 1
Table 2. Three example FFTs.

Theories of Expert Comprehension

Psychological science argues that models comprising small rules are more comprehensible. This section outlines that argument.

Larkin et al. (Larkin et al., 1980)

characterize human expertise in terms of very small short term memory, or STM (used as a temporary scratch pad for current observation) and a very large long term memory, or LTM. The LTM holds separate tiny rule fragments that explore the contents of STM to say “when you see THIS, do THAT”. When an LTM rule triggers, its consequence can rewrite STM contents which, in turn, can trigger other rules.

Short term memory is very small, perhaps even as small as four to seven items (Miller, 1956; Cowan, 2001) 222Recently, Ma et al. (Ma et al., 2014) used evidence from neuroscience and functional MRIs to argue that STM capacity might be better measured using other factors than “number of items”. But even they conceded that “the concept of a limited (STM) has considerable explanatory power for behavioral data”.. Experts are experts, says Larkin et al. (Larkin et al., 1980) because the patterns in their LTM patterns dictate what to do, without needing to pause for reflection. Novices perform worse than experts, says Larkin et al., when they fill up their STM with too many to-do’s where they plan to pause and reflect on what to do next. Since, experts post far fewer to-do’s in their STMs, they complete their tasks faster because (a) they are less encumbered by excessive reflection and (b) there is more space in their STM to reason about new information. While first proposed in 1981, this STM/LTM theory still remains relevant (Ma et al., 2014). This theory can be used to explain both expert competency and incompetency in software engineering tasks such as understanding code (Wiedenbeck et al., 1993).

Phillips et al. (Phillips et al., 2017) discuss how models containing tiny rule fragments can be quickly comprehended by doctors in emergency rooms making rapid decisions; or by soldiers on guard making snap decisions about whether to fire or not on a potential enemy; or by stockbrokers making instant decisions about buying or selling stock. That is, according to this psychological science theory (Czerlinski et al., 1999; Gigerenzer et al., 1999; Martignon et al., 2003; Brighton, 2006; Martignon et al., 2008; Gigerenzer, 2008; Phillips et al., 2017; Gigerenzer and Gaissmaier, 2011; Neth and Gigerenzer, 2015), humans best understand a model:

  • [leftmargin=0.4cm]

  • When they can “fit” it into their LTM; i.e., when that model comprises many small rule fragments;

  • Further, to have an expert-level comprehension of some domain meaning having rules that can very quickly lead to decisions, without clogging up memory.

Psychological scientists have developed FFTs as one way to generate comprehensible models consisting of separate tiny rules (Phillips et al., 2017; Gigerenzer, 2008; Martignon et al., 2008). A FFT is a decision tree with exactly two branches extending from each node, where either one or both branches is an exit branch leading to a leaf (Martignon et al., 2008). That is to say, in an FFT, every question posed by a node will trigger an immediate decision (so humans can read every leaf node as a separate rule).

For example, Table 2 (at left) is an FFT generated from the log4j JAVA system of Table 3

. The goal of this tree is to classify a software module as “defective=true” or “defective=false”. The four nodes in this FFT reference four static code attributes

cbo, rfc, dam, amc (these metrics are defined in Table 1).

FFTs are a binary classification algorithm. To apply such classifiers to mulit-classes problems: (a) build one FFTs for each class for classX or not classX; (b) run all FFTs on the test example, then (c) then select conclusion with most support (number of rows).

An FFT of depth has a choice of two “exit policies” at each level: the existing branch can select for the negation of the target (denoted “0”) or the target (denoted “1”). The left-hand-side log4j tree in Table 2 is hence an 01110 tree since:

  • [leftmargin=0.4cm]

  • The first level exits to the negation of the target: hence, “0”.

  • While the next tree levels exit first to target; hence, “111”.

  • And the final line of the model exits to the opposite of the penultimate line; hence, the final “0”.

To build one FFT tree, select a maximum depth , then follow the steps described in Table 4

Training Testing
Data Set Versions Cases Versions Cases % Defective
jedit 3.2, 4.0, 4.1, 4.2 1257 4.3 492 2
ivy 1.1, 1.4 352 2.0 352 11
camel 1.0, 1.2, 1.4 1819 1.6 965 19
synapse 1.0, 1.1 379 1.2 256 34
velocity 1.4, 1.5 410 1.6 229 34
lucene 2.0, 2.2 442 2.4 340 59
poi 1.5, 2, 2.5 936 3.0 442 64
xerces 1.0, 1.2, 1.3 1055 1.4 588 74
log4j 1.0, 1.1 244 1.2 205 92
xalan 2.4, 2.5, 2.6 2411 2.7 909 99
Table 3. Some open-source JAVA systems. Used for training and testing showing different details for each. All data available on-line at http://tiny.cc/seacraft.
(1) First discretize all attributes; e.g., split numerics on median value.
(2) For each discretized range, find what rows it selects in the training data. Using those rows, score each range using some user-supplied function e.g., recall, false alarm, or the defined in §3.4.
(3) Divide the data on the best range.
(4) If the exit policy at this level is (0,1), then exit to (false,true) using the range that scores highest assuming that the target class is (false,true), respectively.
(5) If the current level is at , add one last exit node predicting the opposite to step 4. Then terminate.
(6) Else, take the data selected by the non-exit range and go to step1 to build the next level of the tree.
Table 4. Steps for building FFTs

For trees of depth , there are possible trees which we denoted 00001, 00010, 00101,… , 11110. Here, the first four digits denote the 16 exit policies and the last digit denotes the last line of the model (which makes the opposite conclusion to the line above). For example:

  • [leftmargin=0.4cm]

  • A “00001” tree does it all it can to avoid the target class. Only after clearing away all the non-defective examples it can at levels one, two, three, four does it make a final “true” conclusion. Table 2 (right) shows the log4j 00001 tree. Note that all the exits, except the last, are to “false”.

  • As to “11110” trees, these fixate on finding the target. Table 2 (center) shows the log4j 11110 tree. Note that all the exits, except the last, are to “true”.

During FFT training, we generate all trees then, using the predicate , select the best one (using the training data). This single best tree is then applied to the test data.

Following the advice of (Phillips et al., 2017), for all the experiments of this paper, we use a depth . Note that FFTs of such small depths are very succinct (see above examples). Many other data mining algorithms used in software analytics are far less succinct and far less comprehensible (see Table 5).

For very high dimensional data, there is some evidence that complex deep learning algorithms have advantages for software engineering applications 

(Yang et al., 2015; White et al., 2015; Gu et al., 2016). However, since they do not readily support explainability, they have been criticizing as “data mining alchemy” (Synced, 2017).
Support vector machines and principle component methods achieve their results after synthesizing new dimensions which are totally unfamiliar to human users (Menzies et al., 2009).

Other methods that are heavily based on mathematics can be hard to explain to most users. For example, in our experience, it is hard for (e.g.,) users to determine minimal changes to a project that mostly affect defect-proneness, just by browsing the internal frequency tables of a Naive Bayes classifier or the coefficients found via linear regression/logistic regression 

(Menzies et al., 2009).
When decision tree learners are many pages long, they are hard to browse and understand (Friedl and Brodley, 1997).
Random forests are even harder to understand than decision trees since the problems of reading one tree are multiplied times, one for each member of the forest (Liaw et al., 2002).
Instance-based methods do not compress their training data; instead they produce conclusions by finding older exemplars closest to the new example. Hence, for such instance-based methods, it is hard to generalize and make a conclusion about what kind of future projects might be (e.g.,) most defective-prone (Aha et al., 1991).
Table 5. Comprehension issues with models generated by data mining algorithms used in software analytics.

The value of models such as FFTs comprising many small rules has been extensively studied:

  • [leftmargin=0.4cm]

  • These models use very few attributes from the data. Hence they tend to be robust against overfitting, especially on small and noisy data, and have been found to predict data at levels comparable with regression. See for example (Martignon et al., 2008; Woike et al., 2017; Czerlinski et al., 1999).

  • Other work has shown that these rule-based models can perform comparably well to more complex models in a range of domains e.g., public health, medical risk management, performance science, etc. (Jenny et al., 2013; Laskey and Martignon, 2014; Raab and Gigerenzer, 2015).

  • Neth and Gigerenzer argue that such rule-bases are tools that work well under conditions of uncertainty (Neth and Gigerenzer, 2015).

  • Brighton showed that rule-based models can perform better than complex nonlinear algorithms such as neural networks, exemplar models, and classification/regression trees (Brighton, 2006).

3. Methods

The use of models comprising many small rules has not been explored in the software analytics literature. This section describes the methods used by this paper to assess FFTs.

3.1. Data

3.1.1. Defect Data:

To assess the FFTs, we perform our experiments using the publicly available SEACRAFT data (Jureczko and Madeyski, 2010), gathered by Jureczko et al. for object-oriented JAVA systems (Jureczko and Madeyski, 2010). The “Jureczko” data records the number of known defects for each class using a post-release defect tracking system. The classes are described in terms of nearly two dozen metrics such as number of children (noc), lines of code (loc), etc (see Table 1). For details on the Jureczko data, see Table 3. The nature of collected data and its relevance to defect prediction is discussed in greater detail by Madeyski & Jureczko (Madeyski and Jureczko, 2015).

We selected these data sets since they have at least three consecutive releases (where release was built after release ). This is important for our experimental rig (see section 3.2).

Commit Comment Issue
nCommitsByActorsT meanCommentSizeT issueCleanedBodyLen
nCommitsByCreator nComments nIssuesByCreator
nCommitsByUniqueActorsT nIssuesByCreatorClosed
nCommitsInProject nIssuesCreatedInProject
nCommitsProjectT nIssuesCreatedInProjectClosed
nIssuesCreatedProjectClosedT
nIssuesCreatedProjectT
Misc. nActors, nLabels, nSubscribedByT
Table 6. Metrics used in issue lifetimes data

3.1.2. Issue Lifetime Data:

This paper will conclude that FFTs are remarkable effective. To check the external validity of that conclusion, we will apply FFT to another SE domain (Rahul Krishna, 2018; Rees-Jones et al., 2018). Our Github issue lifetime data333https://doi.org/10.5281/zenodo.197111 consists of 8 projects used to study issue lifetimes. In raw form, the data consisted of sets of JSON files for each repository, each file contained one type of data regarding the software repository (issues, commits, code contributors, changes to specific files as shown in Table 6

). In order to extract data specific to issue lifetime, we did similar preprocessing and feature extraction on the raw datasets as suggested by

(Rees-Jones et al., 2018).

3.2. Experimental Rig

For the defect prediction data, we use versions of the software systems in Table 3.

Using versions , we track what attributes change by from version to (using the calculation shown in Figure 1). Then we build a model using all the attributes from version or just the top 25% most changed attributes. Note that this implements our definition of “operational”, as discussed in our introduction.

After building a model, we use the latest version for testing while the older versions for training. In this way, we can assert that all our predictions are using past date to predict the future.

For the issue lifetime data, we do not have access to multiple versions of the data. Hence, for this data we cannot perform the operational test. Hence, for that data we conduct a 5*10 cross-validation experiment that ensures that the train and test sets are different. For that cross-val, we divide the data into ten bins, then for each bin we train on then test on bin . To control for order effects (where the conclusions are altered by the order of the input examples) (Agrawal et al., 2018), this process is repeated five times, using different random orderings of the data.

3.3. Data Mining Algorithms

Overall Rank
Classification
Technique
Median
Rank
Average
Rank
Standard
Deviation
1
Rsub+J48, SL, Rsub+SL,
Bag+SL, LMT, RF+SL,
RF+J48, Bag+LMT,
Rsub+LMT, and RF+LMT
1.7 1.63 0.33
2
RBFs, Bag+J48, Ad+SL,
KNN, RF+NB, Ad+LMT,
NB, Rsub+NB, and Bag+NB
2.8 2.84 0.41
3
Ripper, EM, J48, Ad+NB,
Bag+SMO, Ad+J48,

Ad+SMO, and K-means

5.1 5.13 0.46
4
RF+SMO, Ridor, SMO,
and Rsub+SMO
6.5 6.45 0.25
Table 7. For the purposes of predicting software defects, Ghotra et al. (Ghotra et al., 2015) found that many learners have similar performance. Here are their four clusters of 32 data mining algorithms. For our work, we selected learners at random, one from each cluster (see the underlined entries).

The results shown below compare FFTs to state of the art algorithms from software analytics. For a list of state-of-algorithms, we used the ICSE’15 paper from Ghotra et al. (Ghotra et al., 2015) which compared 32 classifiers for defect prediction. Their statistical analysis showed that the performance of these classifiers clustered into four groups shown in Table 7

. For our work, we selected one classifier at random from each of their clusters: i.e., Simple Logistic (SL), Naive Bayes (NB), Expectation Maximization (EM), Sequential Minimal Optimization (SMO).

Simple Logistic and Naive Bayes falls into the 1st and 2nd rankings layers. They are both statistical techniques that are based on a probability based model  

(Kotsiantis et al., 2007). These techniques are used to find patterns in datasets and build diverse predictive models (Berson et al., 2004)

. Simple Logistic is a generalized linear regression model that uses a logit function. Naive Bayes is a probability-based technique that assumes that all of the predictors are independent of each other.

Clustering techniques like EM divide the training data into small groups such that the similarity within groups is more than across the groups  (Hammouda and Karray, 2000). EM is a clustering technique based on cluster performance Expectation Maximization (Fraley and Raftery, 2007) (EM) technique, which automatically splits a dataset into an (approximately) optimal number of clusters (Bettenburg et al., 2012).

Support Vector Machines (SVMs) use a hyperplane to separate two classes (i.e., defective or not). In this paper, following the results of Ghotra et al., we use the Sequential Minimal Optimization (SMO) SVM technique. SMO analytically solves the large Quadratic Programming (QP) optimization problem which occurs in SVM training by dividing the problem into a series of possible QP problems 

(Zeng et al., 2008).

3.4. Evaluation Measures

Our rig assess learned models using an evaluation function called . For FFTs, this function is called three times:

  • [leftmargin=0.4cm]

  • Once to rank discretized ranges;

  • Then once again to select the best FFT out of the trees generated during training.

  • Then finally, is used to score what happens when that best FFT is applied to the test data.

For all the other learners, score is applied on the test data. For this work, we use the two measures: dis2heaven and .

Ideally, a perfect learner will have perfect recall (100%) with no false alarms.

(1)
(2)

We combine these two into a “distance to heaven” measure called dis2heaven that reports how far a learner falls away from the ideal point of Recall=1 and FAR=0:

(3)
Figure 2. Effort-based cumulative lift chart (Yang et al., 2016).

As to , Ostrand et al. (Ostrand et al., 2005) report that their quality predictors can find 20% of the files contain on average 80% of all defects in the project. Although there is nothing magical about the number 20%, it has been used as a cutoff value to set the efforts required for the defect inspection when evaluating the defect learners  (Kamei et al., 2013; Mende and Koschke, 2010; Monden et al., 2013; Yang et al., 2016). That is, reports how many defects have been found after (a) the code is sorted by the learner from “most likely to be buggy” to “least likely”; then (b) humans inspect 20% of the code (measured in lines of code), where that code has , how many defects can be detected by the learner. This measure is widely used in defect prediction literature  (Kamei et al., 2013; Menzies et al., 2007c; Menzies et al., 2010; Monden et al., 2013; Yang et al., 2016; Zimmermann et al., 2007).

is defined as , where is the area between the effort cumulative lift charts of the optimal model and the prediction model (as shown in Figure2). In this chart, the x-axis is the percentage of required effort to inspect the code and the y-axis is the percentage of defects found in the selected code. In the optimal model, all the changes are sorted by the actual defect density in descending order, while for the predicted model, all the changes are sorted by the actual predicted value in descending order. According to Kamei et al. and Xu et al.  (Kamei et al., 2013; Monden et al., 2013; Yang et al., 2016) can be normalized as follows:

(4)

where , and represent the area of curve under the optimal model, predicted model, and worst model, respectively. This worst model is built by sorting all the changes according to the actual defect density in ascending order.

Note that for our two score functions:

  • [leftmargin=0.4cm]

  • For dis2heaven, the lower values are better.

  • For , the higher values are better.

Figure 3. On the left, in the dis2Heaven results, less is better. On the right, in the results, more is better. On both sides, the FFTs results are better than those from state-of-the-art defect prediction algorithms (as defined by Ghotra et al. (Ghotra et al., 2015)).

4. Results

4.1. RQ1: Do FFTs models perform worse than the current state-of-the-art?

Figure 3 compares the performance of FFT versus learners taken from Ghotra et al. In this figure, datasets are sorted left right based on the FFT performance scores. With very few exceptions:

  • [leftmargin=0.4cm]

  • FFT’s dis2heaven’s results lower, hence better, than the other learners.

  • FFT’s results are much higher, hence better, than the other learners.

Therefore our answer to RQ1 is:

[couleur=gray!20,arrondi=0.1,logo=,ombre=true]  For defect prediction, FFTs out-perform the state-of-art. When compared to state-of-the-art defect prediction algorithms surveyed by Ghotra et al., FFTs are more effective (where “effective” is measured in terms of a recall/false alarm metric or ).
Figure 4. For each learner in Figure 3, this plot shows the difference between the results obtains using the top 25% or all (100%) of attributes. For (,), values that are (lesser,greater) (respectively) are better. Note that all the 100% results were also shown in Figure 3.

Figure 5. Deltas between results 25% and 100% of the data. Computed from Figure 4. Calculated such that larger values are better; i.e., for (dist2heaven, ) we report (25%-100%, 100%-25%) since (less, more) values are better (respectively). All values for each learner are sorted independently.

4.2. RQ2: Are FFTs more operational than the current state-of-the-art?

Please recollect from before that a model is operational if its performance is not affected after avoiding attributes that are rarely changed by developers.

Figure 4 compares model performance when we learn from all 100% attributes or just the 25% most changed attributes. For this study, these 25% group (of most changed attributes) was computed separately for each data set. Note that:

  • [leftmargin=0.4cm]

  • The top row of Figure 4 shows the dis2heaven results;

  • The bottom row of Figure 4 shows the results.

Figure 5 reports the deltas in performance scores seen between using 25% and 100% of the data. These deltas are computed such that larger values are better; i.e., for (dist2heaven, ) we report (25%-100%, 100%-25%) since (fewer, more) values are better (respectively).

There are several key features for these results:

  • [leftmargin=0.4cm]

  • The FFT’s red dots for dis2heaven are below the rest; also, FFT’s orange dots for are above the rest. This means that, regardless of whether we use all attributes or just the most changed attributes, the FFT results are nearly always better than the other methods.

  • As seen in Figure 5, the deltas between using all data and just some of the data is smallest for FFTs and EM (the instance-based clustering algorithm). In , those deltas are very small indeed (the FFT and EM results lie right on the y-axis for most of that plot).

  • Also, see in Figure 5, the deltas on the other learners can be highly variable. While for the most part, using just the 25% most changed attributes improves performance, SMO , SL and NB all have large negative results for at least some of the data sets.

In summary, the learners studied here fall into three groups:

  1. Those that exhibited a wide performance variance after restricting the learning to just the frequently changed data (SL, NB, SMO), and those that are not (FFT, EM);

  2. Those with best performance across the two performance measures studied here (FFT), and the rest (SL, NB, EM, SMO);

  3. Those that generate tiny models (FFT), and the rest (SL, NB, EM, SMO).

Accordingly, FFT is the recommended learner since it both performs well and is unaffected by issues such as whether or not the data is restricted to just the most operational attributes. In summary:

[couleur=gray!20,arrondi=0.1,logo=,ombre=true] When learning from less data, FFTs performance is stabler than some other learners. When data is restricted to attributes that developers often change, then FFTs performance is only slightly changed while the performance of some other learners, can vary by alarmingly large amounts.
Best FFF 25% 100%
exit policy D2H D2H
00001 0 0 0 0 0
00010 0 0 0 0 0
00101 0 0 0 0 0
00110 0 0 0 0 0
01001 0 0 0 0 0
01010 0 0 0 0 0
01101 1 0 0 0 1
01110 0 0 0 0 0
10001 14 6 0 7 1
10010 8 4 2 2 0
10101 3 0 1 1 1
10110 5 0 3 0 2
11001 0 0 0 0 0
11010 3 0 1 0 2
11101 2 0 0 0 2
11110 4 0 3 0 1
Totals 40 10 10 10 10
Table 8. Frequency heatmap of best exit polices seen for FFT and defect prediction.

4.3. RQ3: Why do FFTs work so well?

To explain the success of FFTs, recall that during training, FFTs explores models, then selects the models whose exit policies achieves best performances (exit policies were introduced in Section  2.2). The exit policies selected by FFTs are like a trace of the reasoning jumping around the data. For example, a 11110 policy shows a model always jumping towards sections of the data containing most defects. Also, a 00001 policy show another model trying to jump away from defects until, in its last step, it does one final jump towards defects. Table 8 shows what exit policies were seen in the experiments of the last section:

  • [leftmargin=0.4cm]

  • The 11110 policy was used sometimes.

  • A more common policy is 10001 which shows a tree first jumping to some low hanging fruit (see the first “1”), then jumping away from defects three times (see the next “000”) before a final jump into defects (see the last “1”).

  • That said, while 10001 was most common, many other exit policies appear in Table 8. For example, the policies are particularly diverse.

Table 8 suggests that software data is “lumpy”; i.e., it divides into a few separate regions, each with different properties. Further, the number and importance of the “lumps” is specific to the data set and the goal criteria. In such a “lumpy” space, a learning policy like FFT works well since its exit policies let a learner discover how to best jump between the “lumps”. Other learners fail in this coarse-grained lumpy space when they:

  • [leftmargin=0.4cm]

  • Divide the data too much; e.g. like RandomForests, which finely divide the data multiple times down the branches of the trees and across multiple trees;

  • Fit some general model across all the different parts of the data; e.g. like simple logistic regression.

In summary, in answer to the question “why do FFTs work so well”, we reply:

[couleur=gray!20,arrondi=0.1,logo=,ombre=true] FFTs match the structure of SE data SE data divides into a few regions with very different properties and FFTs are good way to explore such data spaces.

5. Threats to Validity

5.1. Sampling Bias

This paper shares the same sampling bias problem as every other data mining paper. Sampling bias threatens any classification experiment; what matters in one case may or may not hold in another case. For example, even though we use 10 open-source datasets in this study which come from several sources, they were all supplied by individuals.

As researchers, we can adopt two tactics to reduce the sampling bias problem. First we can document our tools and methods, then post an executable reproduction package for all the experiments (that package for this paper is available at url_blind_for_review).

Secondly, when new data becomes available, we can test our methods on the new data. For example, Table 9 shows results were FFTs and four different state-of-the-art learners, i.e. Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbors, were applied to the task of predicting issue close time (the other four learners were used since that was the technology recommended in a recent study in that domain (Rahul Krishna, 2018; Rees-Jones et al., 2018)). Unlike the defect prediction data, we did not have multiple versions of the code so, for this domain, we used a 5*10-way cross-validation analysis. White cells show where the FFT results were statistically different and better than all of the state-of-the-art learners’ results. Note that, in most cases (), FFTs performed better.

While this result does not prove that FFTs works well in all domains, it does show that there exists more than one domain where this is a useful approach.

Days till closed
Data(# of instances)
cloudstack (1551) FFT FFT FFT FFT FFT DT LR
node (6207) FFT FFT FFT FFT FFT DT LR
deeplearning (1434) FFT FFT FFT FFT FFT FFT RF
cocoon (2045) FFT FFT FFT FFT FFT FFT FFT
ofbiz (6177) FFT FFT FFT FFT FFT FFT FFT
camel (5056) RF/KNN KNN FFT/KNN/DT FFT FFT FFT FFT
hadoop (12191) KNN DT DT FFT FFT FFT FFT
qpid (5475) DT DT/RF DT FFT FFT FFT FFT
  • The goal here is to classify an issue according to how long it will take to close; i.e. less than 1 day, less than 7 days, and so on. Values collected via a 5x10 cross-validation procedure.

  • Cells with a (white, gray) background means FFTs are statistically (better, worse) than (all, any) of the state-of-the-art learners (as determined by a Mann-Whitney test, 95% confidence), respectively. KNN, DT, RF and LR represents K-Nearest Neighbors, Decision Tree, Random Forest and Logistic Regression respectively.

Table 9. Which learners performed better (in terms of median Dis2heaven) in 5*10 cross-value experiments predicting for different classes of “how long to close an Github issue”. Gray areas denote experiments where FFTs were out-performed by other learners. Note that, in (43/56=77%) experiments, FFT performed better than the prior state-of-the-art in this area (Rahul Krishna, 2018).

5.2. Learner Bias

For building the defect predictors in this study, we elected to use Simple Logistic, Naive Bayes, Expectation Maximization, Support Vector Machine. We chose these learners because past studies shows that, for defect prediction tasks, these four learners represents four different levels of performance among a bunch of different learners (Ghotra et al., 2015; Agrawal and Menzies, 2018). Thus they are selected as the state-of-the-art learns to be compared with FFTs on the defect prediction data. While for Table 9), K-Nearest Neighbors, Decision Tree, Random Forest and Logistic Regression are used to compare against FFTs, because a recent work has summarized all the best learners that were applied on the issue lifetime data.

5.3. Evaluation Bias

This paper uses two performance measures, i.e., and as defined in Equation 4 and 3. Other quality measures often used in software engineering to quantify the effectiveness of prediction  (Menzies et al., 2007b; Menzies et al., 2005; Jorgensen, 2004). A comprehensive analysis using these measures may be performed with our replication package. Additionally, other measures can easily be added to extend this replication package.

5.4. Order Bias

For the performance evaluation part, the order that the data trained and predicted affects the results.

For the defect prediction datasets, we deliberately choose an ordering that mimics how our software projects releases versions so, for those experiments, we would say that bias was a required and needed.

For the issue close time results of Table 9, to mitigate this order bias, we ran our rig in a the 5-bin cross validation 10 times, randomly changing the order of the data each time.

6. Conclusions

This paper has shown that a data mining algorithm call Fast-and-Frugal trees (FFTs) developed by psychological scientist is remarkably effective for creating actionable software analytics. Here “actionable” was defined as a combination of comprehensible and operational.

Measured in terms of comprehensibility, the FFT examples of Table 2 show that FFTs satisfy requirements raised by psychological scientists for “easily understandable at an expert level”; i.e., they comprise several short rules and those rules can be quickly applied (recall that each level of an FFT has an exit point which, if used, means humans can ignore the rest of the tree).

Despite their brevity, FFTs are remarkably effective:

  • [leftmargin=0.4cm]

  • Measured in terms of , FFTs are much better than other standard algorithms (see Figure 3).

  • Measured in terms of distance to the “heaven” point of 100% recall and no false alarms, FFTs are either usually better than other standard algorithms used in software analytics (Random Forests, Naive Bayes, EM, Logistic Regression, and SVM). This result holds for at least two SE domains: defect prediction (see Figure 3) issue close time prediction (see Table 9).

As to being operational, we found that if learning is restricted to just the attributes changed most often, then the behavior of other learning algorithms can vary, wildly (see Figure 5). The behaviour of FFTs, on the other hand, remain remarkable stable across that treatment.

From the above, our conclusions is two-fold:

  1. There is much the software analytics community could learn from psychological science. FFTs, based on psychological science principles, out-perform a wide range of learners in widespread use.

  2. Proponents of complex methods should always baseline those methods against simpler alternatives. For example, FFTs could be used as a standard baseline learner against which other software analytics tools are compared.

7. Future Work

Numerous aspects of the above motivate deserve more attention.

7.1. More Data

This experiment with issue close time shows that FFTs are useful for more just defect prediction data. That said, for future work, it is important to test many other SE domains to learn when FFTs are useful. For example, at this time we are exploring text mining of StackOverflow data.

7.2. More Learners

The above experiments should be repeated, comparing FFTs against more learners. For example, at this time, we are comparing FFTs against deep learning for SE datasets. At this time, there is nothing as yet definitive to report about those results.

7.3. More Algorithm Design

These results may have implications beyond SE. Indeed, it might be insightful to another field– machine learning. For the reader familiar with machine learning literature, we note that FFTs are a decision-list rule-covering model. FFTs restrict the (a) number of conditions per rule to only one comparison and (b) the total number of rules is set to a small number (often often just

). Other decision list approaches such as PRISM (Cendrowska, 1987), INDUCT (Witten and Frank, 2002),RIPPER (Cohen, 1995) and RIPPLE-DOWN-RULES (Gaines and Compton, 1995) produce far more complex models since they impose no such restriction. Perhaps the lesson of FFT is that PRISM,INDUCT,RIPPER, etc could be simplified with a few simple restrictions on the models they learn.

Also the success of FFT might be credited to its use on ensemble methods; i.e. train multiple times, then select the best. The comparison between FFTs and other ensemble methods like bagging and boosting (Quinlan et al., 1996) could be useful in future work.

7.4. Applications to Delta Debugging

There is a potential connection between the Figure 5 results and the delta debugging results of Zeller (Zeller, 2002). As shown above, we found that, sometimes focusing on the values that change most can sometimes, lead to better defect predictors (though, caveat empty or, sometimes it can actually make matters worse– see the large negative results in Figure 5). Note that this parallels Zeller’s approach which he summarizes as “Initially, variable v1 was x1, thus variable v2 became x2, thus variable v3 became x3 … and thus the program failed”. In future work, we will explore further applications of FFTs to delta debugging.

References

  • (1)
  • Abdollahi and Nasraoui (2016) Behnoush Abdollahi and Olfa Nasraoui. 2016. Explainable restricted Boltzmann machines for collaborative filtering. arXiv preprint arXiv:1606.07129 (2016).
  • Agrawal et al. (2018) Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2018. What is Wrong with Topic Modeling?(and How to Fix it Using Search-based Software Engineering). Information and Software Technology (2018).
  • Agrawal and Menzies (2018) Amritanshu Agrawal and Tim Menzies. 2018. Is ”Better Data” Better than ”Better Data Miners”? (Benefits of Tuning SMOTE for Defect Prediction). International Conference on Software Engineering (2018).
  • Aha et al. (1991) David W Aha, Dennis Kibler, and Marc K Albert. 1991. Instance-based learning algorithms. Machine learning 6, 1 (1991), 37–66.
  • Arcuri and Briand (2011) A. Arcuri and L. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In 2011 33rd International Conference on Software Engineering (ICSE). 1–10. DOI:http://dx.doi.org/10.1145/1985793.1985795 
  • Begel and Zimmermann (2014) Andrew Begel and Thomas Zimmermann. 2014. Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering. ACM, 12–23.
  • Berson et al. (2004) Alex Berson, Stephen Smith, and Kurt Thearling. 2004. An overview of data mining techniques. Building Data Mining Application for CRM (2004).
  • Bettenburg et al. (2012) Nicolas Bettenburg, Meiyappan Nagappan, and Ahmed E Hassan. 2012. Think locally, act globally: Improving defect and effort prediction models. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories. IEEE Press, 60–69.
  • Brighton (2006) Henry Brighton. 2006. Robust Inference with Simple Cognitive Models.. In AAAI spring symposium: Between a rock and a hard place: Cognitive science principles meet AI-hard problems. 17–22.
  • Cendrowska (1987) Jadzia Cendrowska. 1987. PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27, 4 (1987), 349–370.
  • Cohen (1995) William W. Cohen. 1995. Fast Effective Rule Induction. In ICML’95. 115–123.
  • Cowan (2001) N. Cowan. 2001. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav Brain Sci 24, 1 (Feb 2001), 87–114.
  • Craven and Shavlik (2014) Mark W Craven and Jude W Shavlik. 2014. Learning symbolic rules using artificial neural networks. In Proceedings of the Tenth International Conference on Machine Learning. 73–80.
  • Czerlinski et al. (1999) Jean Czerlinski, Gerd Gigerenzer, and Daniel G Goldstein. 1999. How good are simple heuristics? (1999).
  • Czerwonka et al. (2011) J. Czerwonka, R. Das, N. Nagappan, A. Tarvo, and A. Teterev. 2011. CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice – Experiences from Windows. In Software Testing, Verification and Validation (ICST), 2011 IEEE Fourth International Conference on. 357 –366.
  • Dam et al. (2018) Hoa Khanh Dam, Truyen Tran, and Aditya Ghose. 2018. Explainable Software Analytics. arXiv preprint arXiv:1802.00603 (2018).
  • Fraley and Raftery (2007) Chris Fraley and Adrian E Raftery. 2007. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of classification 24, 2 (2007), 155–181.
  • Friedl and Brodley (1997) Mark A Friedl and Carla E Brodley. 1997. Decision tree classification of land cover from remotely sensed data. Remote sensing of environment 61, 3 (1997), 399–409.
  • Fu and Menzies (2017) Wei Fu and Tim Menzies. 2017. Easy over hard: a case study on deep learning. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 49–60.
  • Gaines and Compton (1995) B. R. Gaines and P. Compton. 1995. Induction of Ripple-down Rules Applied to Modeling Large Databases. J. Intell. Inf. Syst. 5, 3 (Nov. 1995), 211–228. DOI:http://dx.doi.org/10.1007/BF00962234 
  • Ghotra et al. (2015) Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 789–800.
  • Gigerenzer (2008) Gerd Gigerenzer. 2008. Why heuristics work. Perspectives on psychological science 3, 1 (2008), 20–29.
  • Gigerenzer et al. (1999) Gerd Gigerenzer, Jean Czerlinski, and Laura Martignon. 1999. How good are fast and frugal heuristics. Decision science and technology: Reflections on the contributions of Ward Edwards (1999), 81–103.
  • Gigerenzer and Gaissmaier (2011) Gerd Gigerenzer and Wolfgang Gaissmaier. 2011. Heuristic decision making. Annual review of psychology 62 (2011), 451–482.
  • Gu et al. (2016) Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 631–642.
  • Hammouda and Karray (2000) Khaled Hammouda and Fakhreddine Karray. 2000. A comparative study of data clustering techniques. University of Waterloo, Ontario, Canada (2000).
  • Hihn and Menzies (2015) J. Hihn and T. Menzies. 2015. Data Mining Methods and Cost Estimation Models: Why is it So Hard to Infuse New Ideas?. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW). 5–9. DOI:http://dx.doi.org/10.1109/ASEW.2015.27 
  • Jenny et al. (2013) Mirjam A Jenny, Thorsten Pachur, S Lloyd Williams, Eni Becker, and Jürgen Margraf. 2013. Simple rules for detecting depression. Journal of Applied Research in Memory and Cognition 2, 3 (2013), 149–157.
  • Jorgensen (2004) Magne Jorgensen. 2004. Realism in assessment of effort estimation uncertainty: It matters how you ask. IEEE Transactions on Software Engineering 30, 4 (2004), 209–217.
  • Jureczko and Madeyski (2010) Marian Jureczko and Lech Madeyski. 2010. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, 9.
  • Kamei et al. (2013) Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. 2013. A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering 39, 6 (2013), 757–773.
  • Kim et al. (2016) Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The Emerging Role of Data Scientists on Software Development Teams. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 96–107. DOI:http://dx.doi.org/10.1145/2884781.2884783 
  • Kocaguneli et al. (2012) E. Kocaguneli, T. Menzies, A. Bener, and J. Keung. 2012. Exploiting the Essential Assumptions of Analogy-Based Effort Estimation. IEEE Transactions on Software Engineering 28 (2012), 425–438. Issue 2. Available from http://menzies.us/pdf/11teak.pdf.
  • Kotsiantis et al. (2007) Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. 2007. Supervised machine learning: A review of classification techniques. (2007).
  • Krishna and Menzies (2015) Rahul Krishna and Tim Menzies. 2015. Actionable= Cluster+ Contrast?. In Automated Software Engineering Workshop (ASEW), 2015 30th IEEE/ACM International Conference on. IEEE, 14–17.
  • Larkin et al. (1980) Jill Larkin, John McDermott, Dorothea P. Simon, and Herbert A. Simon. 1980. Expert and Novice Performance in Solving Physics Problems. Science 208, 4450 (1980), 1335–1342. DOI:http://dx.doi.org/10.1126/science.208.4450.1335  arXiv:http://science.sciencemag.org/content/208/4450/1335.full.pdf
  • Laskey and Martignon (2014) K Laskey and Laura Martignon. 2014.

    Comparing fast and frugal trees and Bayesian networks for risk assessment. In

    Proceedings of the 9th International Conference on Teaching Statistics, Flagstaff, Arizona.
  • Liaw et al. (2002) Andy Liaw, Matthew Wiener, and others. 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18–22.
  • Lipton (2016) Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016).
  • Ma et al. (2014) Wei Ji Ma, Masud Husain, and Paul M Bays. 2014. Changing concepts of working memory. Nature neuroscience 17, 3 (2014), 347–356.
  • Madeyski and Jureczko (2015) Lech Madeyski and Marian Jureczko. 2015. Which process metrics can significantly improve defect prediction models? An empirical study. Software Quality Journal 23, 3 (2015), 393–422.
  • Martignon et al. (2008) Laura Martignon, Konstantinos V Katsikopoulos, and Jan K Woike. 2008. Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology 52, 6 (2008), 352–361.
  • Martignon et al. (2003) Laura Martignon, Oliver Vitouch, Masanori Takezawa, and Malcolm R Forster. 2003. Naive and yet enlightened: From natural frequencies to fast and frugal decision trees. Thinking: Psychological perspectives on reasoning, judgment and decision making (2003), 189–211.
  • Mende and Koschke (2010) Thilo Mende and Rainer Koschke. 2010. Effort-aware defect prediction models. In Software Maintenance and Reengineering (CSMR), 2010 14th European Conference on. IEEE, 107–116.
  • Menzies (2014) Tim Menzies. 2014. Occam’s razor and simple software project management. In Software Project Management in a Changing World. Springer, 447–472.
  • Menzies et al. (2007a) Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007a. Problems with Precision: A Response to ”Comments on ’Data Mining Static Code Attributes to Learn Defect Predictors’”. IEEE Transactions on Software Engineering 33, 9 (sep 2007), 637–640. DOI:http://dx.doi.org/10.1109/TSE.2007.70721 
  • Menzies et al. (2007b) Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007b. Problems with Precision: A Response to” comments on’data mining static code attributes to learn defect predictors’”. IEEE Transactions on Software Engineering 33, 9 (2007), 637–640.
  • Menzies et al. (2007c) Tim Menzies, Jeremy Greenwald, and Art Frank. 2007c. Data mining static code attributes to learn defect predictors. IEEE transactions on software engineering 33, 1 (2007), 2–13.
  • Menzies et al. (2010) Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener. 2010. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering 17, 4 (2010), 375–407.
  • Menzies et al. (2009) Tim Menzies, Osamu Mizuno, Yasunari Takagi, and Tohru Kikuno. 2009. Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects. Journal of Software Engineering and Applications 2 (2009), 221–236.
  • Menzies et al. (2005) Tim Menzies, Dan Port, Zhihao Chen, and Jairus Hihn. 2005. Simple software cost analysis: safe or unsafe?. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 1–6.
  • Miller (1956) George A Miller. 1956. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review 63, 2 (1956), 81.
  • Monden et al. (2013) Akito Monden, Takuma Hayashi, Shoji Shinoda, Kumiko Shirai, Junichi Yoshida, Mike Barker, and Kenichi Matsumoto. 2013. Assessing the cost effectiveness of fault prediction in acceptance testing. IEEE Transactions on Software Engineering 39, 10 (2013), 1345–1357.
  • Neth and Gigerenzer (2015) Hansjörg Neth and Gerd Gigerenzer. 2015. Heuristics: Tools for an uncertain world. Emerging trends in the social and behavioral sciences: An interdisciplinary, searchable, and linkable resource (2015).
  • Ostrand et al. (2004) Thomas J. Ostrand, Elaine J. Weyuker, and Robert M. Bell. 2004. Where the bugs are. In ISSTA ’04: Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis. ACM, New York, NY, USA, 86–96.
  • Ostrand et al. (2005) Thomas J Ostrand, Elaine J Weyuker, and Robert M Bell. 2005. Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering 31, 4 (2005), 340–355.
  • Phillips et al. (2017) Nathaniel D Phillips, Hansjoerg Neth, Jan K Woike, and Wolfgang Gaissmaier. 2017. FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making 12, 4 (2017), 344–368.
  • Quinlan et al. (1996) J Ross Quinlan and others. 1996. Bagging, boosting, and C4. 5. In AAAI/IAAI, Vol. 1. 725–730.
  • Raab and Gigerenzer (2015) Markus Raab and Gerd Gigerenzer. 2015. The power of simplicity: a fast-and-frugal heuristics approach to performance science. Frontiers in psychology 6 (2015).
  • Rahul Krishna (2018) Tim Menzies Rahul Krishna. 2018. Bellwethers: A Baseline Method For Transfer Learning. arXiv preprint arXiv:1703.06218v4 (2018).
  • Rees-Jones et al. (2018) Mitch Rees-Jones, Matthew Martin, and Tim Menzies. 2018. Better predictors for issue lifetime. Journal of Software and Systems, submitted. arXiv preprint arXiv:1702.07735 (2018).
  • Sawyer (2013) Robert Sawyer. 2013. BI’s Impact on Analyses and Decision Making Depends on the Development of Less Complex Applications. In Principles and Applications of Business Intelligence Research. IGI Global, 83–95.
  • Synced (2017) AI Technology & Industry Review Synced. 2017. LeCun vs Rahimi: Has Machine Learning Become Alchemy? (2017). https://medium.com/@Synced/lecun-vs-rahimi-has-machine-learning-become-alchemy-21cb1557920d
  • Tan and Chan (2016) Shiang-Yen Tan and Taizan Chan. 2016. Defining and conceptualizing actionable insight: a conceptual framework for decision-centric analytics. arXiv preprint arXiv:1606.03510 (2016).
  • Theisen et al. (2015) Christopher Theisen, Kim Herzig, Patrick Morrison, Brendan Murphy, and Laurie Williams. 2015. Approximating Attack Surfaces with Stack Traces. In ICSE’15.
  • Turhan et al. (2011) Burak Turhan, Ayşe Tosun, and Ayşe Bener. 2011. Empirical evaluation of mixed-project defect prediction models. In Software Engineering and Advanced Applications (SEAA), 2011 37th EUROMICRO Conference on. IEEE, 396–403.
  • Vargha and Delaney (2000) András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25, 2 (2000), 101–132.
  • White et al. (2015) Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward deep learning software repositories. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on. IEEE, 334–345.
  • Wiedenbeck et al. (1993) Susan Wiedenbeck, Vikki Fix, and Jean Scholtz. 1993. Characteristics of the mental representations of novice and expert programmers: an empirical study. International Journal of Man-Machine Studies 39, 5 (1993), 793–812.
  • Witten and Frank (2002) Ian H. Witten and Eibe Frank. 2002. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. SIGMOD Rec. 31, 1 (March 2002), 76–77. DOI:http://dx.doi.org/10.1145/507338.507355 
  • Woike et al. (2017) Jan K Woike, Ulrich Hoffrage, and Laura Martignon. 2017. Integrating and testing natural frequencies, naïve Bayes, and fast-and-frugal trees. Decision 4, 4 (2017), 234.
  • Yang et al. (2015) Xinli Yang, David Lo, Xin Xia, Yun Zhang, and Jianling Sun. 2015. Deep learning for just-in-time defect prediction. In Software Quality, Reliability and Security (QRS), 2015 IEEE International Conference on. IEEE, 17–26.
  • Yang et al. (2016) Yibiao Yang, Yuming Zhou, Jinping Liu, Yangyang Zhao, Hongmin Lu, Lei Xu, Baowen Xu, and Hareton Leung. 2016. Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 157–168.
  • Zeller (2002) Andreas Zeller. 2002. Isolating Cause-effect Chains from Computer Programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering (SIGSOFT ’02/FSE-10). ACM, New York, NY, USA, 1–10. DOI:http://dx.doi.org/10.1145/587051.587053 
  • Zeng et al. (2008) Zhi-Qiang Zeng, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, and Ji Gao. 2008. Fast training support vector machines using parallel sequential minimal optimization. In

    Intelligent System and Knowledge Engineering, 2008. ISKE 2008. 3rd International Conference on

    , Vol. 1. IEEE, 997–1001.
  • Zimmermann et al. (2007) Thomas Zimmermann, Rahul Premraj, and Andreas Zeller. 2007. Predicting defects for eclipse. In Proceedings of the third international workshop on predictor models in software engineering. IEEE Computer Society, 9.