Can You Explain That, Better? Comprehensible Text Analytics for SE Applications

04/27/2018 ∙ by Amritanshu Agrawal, et al. ∙ NC State University IEEE 0

Text mining methods are used for a wide range of Software Engineering (SE) tasks. The biggest challenge of text mining is high dimensional data, i.e., a corpus of documents can contain 10^4 to 10^6 unique words. To address this complexity, some very convoluted text mining methods have been applied. Is that complexity necessary? Are there simpler ways to quickly generate models that perform as well as the more convoluted methods and also be human-readable? To answer these questions, we explore a combination of LDA (Latent Dirichlet Allocation) and FFTs (Fast and Frugal Trees) to classify NASA software bug reports from six different projects. Designed using principles from psychological science, FFTs return very small models that are human-comprehensible. When compared to the commonly used text mining method and a recent state-of-the-art-system (search-based SE method that automatically tune the control parameters of LDA), these FFT models are very small (a binary tree of depth d = 4 that references only 4 topics) and hence easy to understand. They were also faster to generate and produced similar or better severity predictions. Hence we can conclude that, at least for datasets explored here, convoluted text mining models can be deprecated in favor of simpler method such as LDA+FFTs. At the very least, we recommend LDA+FFTs (a) when humans need to read, understand, and audit a model or (b) as an initial baseline method for the SE researchers exploring text artifacts from software projects.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Software analytics have been focusing on working with adept and state-of-the-art data miners in order to find the optimal results. One sub-topic for software analytics is the use of sophisticated text mining techniques [1]. Text mining is much more complex task as it involves dealing with high dimensional textual data that are inherently unstructured [2, 1]. These complex methods often generate models not comprehensible to humans (e.g., using synthetic dimensions generated by an SVM kernel [3]). This complexity might not be necessary if simpler methods can be found to achieve the same performance, while at the same time generating easy-to-understand models [4]. We define our terminologies “simple” and “comprehensible” in this paper as:

  • [leftmargin=0.4cm]

  • simple - (1) has low dimensionality of features (in 10s and not 100 to 1000s); (2) generate small set of theories and (3) is not computationally expensive. Otherwise, we call it “complex”.

  • comprehensible - (1) comprise of small rules and (2) rules that quickly lead to decisions

Moeyersoms et al. [5] comment that predictive models not only need to be accurate but also comprehensible, demanding that the user can understand the motivation behind the model’s prediction. They further remark that, to obtain such predictive performance, comprehensibility is often sacrificed and vice-versa. Do simpler methods perform worse? Martens et al. [6] referred comprehensibility as to how well humans grasp the classifier induced or how strong the mental fit of the classifier is. Dejaeger et al. [7] said comprehensible models are often needed in order to inspire confidence in a business setting and improve model acceptance. Business users are vocal in their complaints about analytics [8], stating that there are rarely producible models that business users can comprehend.

Researchers in SE use complex methods, such as Support Vector Machine (SVM) with 1,000 to 10,000s of Term Frequency (TF) or Term Frequency - Inverse Document Frequency (TFIDF) features in order to achieve high performance of prediction. Yet, they do not try to comprehend the model itself 

[9, 10, 11, 12, 13, 14] making business users more hesitant to adopt their methodologies and losing the value of their work. Though, Latent Dirichlet Allocation (LDA) uses less number of features but does require 100s of features for finding an optimal model and be human-comprehensible [15, 16]. An alternative, better, search-based SE method (LDADE) was proposed recently which tries to find optimal parameters of LDA that can make the model more stable and achieve optimal results [17]. The problem with this model is that it is quite expensive in terms of CPU usage and still need 100s of features for it to be comprehensible. We need a simple method which: 1) offers comparable performance; and 2) human comprehensible.

This paper study’s a simple data miner taken from the psychological science literature, i.e., FFT which outputs small trees, (and generally, smaller is better comprehensible [18, 19]). In this study, FFT uses LDA features () with its default parameters, which does not require any expensive optimization to find the optimal K, and build its trees. We seek few rules through FFT that can report severe and non-severe for the datasets under study. We compared this method against complex and most commonly used methods in SE literature, which are 1) TFIDF+SVM [9, 11, 20, 21]; and 2) a recent state-of-the-art system, LDADE+SVM [17]. Based on this comparative analysis, we answer two research questions:

RQ1: How does simpler method perform against most common sophisticated and recent state-of-the-art Search Based SE (SBSE) methods?

For software analytics, most text mining techniques use high dimensional TF or TFIDF features with complex classifiers like SVM [9, 10, 11, 12, 13, 14, 22, 20, 23, 21, 24]. These features are large in number, in the range of 1,000 to 10,000s making any classifier, complex. Researchers shifted their focus on using LDA features in text mining since it is a good way for dimensionality reduction [25, 15, 26]. SBSE is recently introduced to find the optimal parameters at the expense of heavy runtime [17, 27, 28]. Agrawal et al. [17] tuned the parameters of LDA to find the optimal number of topics () which is further used by SVM for classification task (state-of-the-art SBSE method).

We show that, FFT (with a depth, ) uses just 10 topics from LDA (simpler method) to achieve comparable performance as SVM with TFIDF features (sophisticated method) as well as LDADE with SVM (SBSE method). The runtime for building the simpler method is about 10 minutes slower than the sophisticated method’s runtime but this may not be an arduous increase given the gains from its power of comprehensibility, whereas simpler method is 100 times faster than SBSE method. Hence, we conclude that,

Result 1

Simpler method (LDA+FFT) offers similar performance as the sophisticated method (TFIDF+SVM) and the SBSE method (LDADE+SVM). Though simpler LDA+FFT method takes an extra 10 minutes than the baseline, but it is orders of magnitude faster than the SBSE method.

RQ2: Is simpler method more explainable or comprehensible relative to the most common sophisticated and recent state-of-the-art SBSE methods?

We answered the question that simpler method can show comparable performance against sophisticated, and SBSE methods. Now, we dive into the core of our study which is about comprehensibility. Why do we need comprehensible models? We need it to have some actionable insights from the model which will boost the confidence for businesses to accept the model for their software.

Representative characteristics help a model to be more explainable, i.e., small, visualized easily, and comprised of fewer rules that can quickly lead to decisions. The range of features between 1,000 to 10,000s, makes any classifier big and non-comprehensible by default. LDA features offer more comprehensibility aspect to the model than TFIDF or TF features [26, 15].

We show that FFT with LDA features, referencing only 4 topics (depth, ) provide explainable model satisfying the characteristics mentioned earlier. Also, we do not need a SBSE method which is orders of magnitude times slower to find optimal , when a simpler method can provide a well comprehensible model. Hence, we conclude that

Result 2

FFT generates fewer rules referencing only 4 topics found by LDA are far more comprehensible than the most common sophisticated and SBSE methods.

In summary, the main contributions of this paper are:

  • [leftmargin=0.4cm]

  • A novel inter-disciplinary contribution of the application of psychological science in comprehensibility of text mining models.

  • LDA+FFTs offer comparable performance against a common text mining method, TFIDF+SVM.

  • LDA+FFTs are better, faster, and more comprehensible against the recent state-of-the-art method, LDADE+SVM.

  • A new, very simple baseline data mining method (LDA+FFTs) against which more complex methods can be compared.

  • A reproduction package containing all the data and algorithms of this paper, see

The rest of this paper is structured as follows: Section II talks about the background and theory of comprehensibility. Section III describes the experimental setup of this paper and above research questions are answered in Section IV. Lastly, we discuss the validity of our results and a section describing our conclusions.

Ii Motivation and Related Work

This sections talks about theory of comprehensibility, the most commonly used text mining method for bug reports classification, curse of dimensionality, and power of computationally faster methods. We also show how FFTs are generated which is a great alternative to the existing approaches.

Ii-a Theory of Comprehensibility

For software analytics, it is a necessity to find such models that can produce simple and actionable insights for the software practitioners to interpret and act upon [29]. Models are effectively useless if they cannot be interpreted by researchers, developers, and testers [4]. Business users have been vocal in their complaints about analytics [8], saying that there are rarely producible models that they can comprehend. According to several researchers [30, 31, 32], actionable insights from software artifacts are the core deliverable of software analytics. These insights are then used by the users to enhance their productivity, which is measured in terms of the task that are accomplished. However, is model comprehensibility taken into consideration in the process of development?

Machine learners generate theories and people read theories. But how many of such learners generate the kind of theories that machine learning practitioners can read? In practice, with availability of big data and tremendous amount of information, yet limited time and resources to explore, such as manager rushing with deadlines to release a software or stockbrokers making instant decisions about buying or selling stocks. Rather, in such a critical situation, a person might instead just want to have the least expert-level comprehension of that domain to achieve the most benefits. It therefore follows that machine learning for these practical cases should not strive for elaborated theories or expressive power of the language. A better goal for machine learning would be to find the smallest set of theories with the most impacts and benefits 


Also, in today’s businesses, the problem is not accessing data but ignoring the irrelevant data. Most modern businesses can electronically access large amounts of data such as transactions for the past two years or the state of their assembly line. The trick is effectively using the available data. In practice, this means summarizing large datasets to find the “pearls in the dust” - that is, the data that really matters [33].

That is why, Gleicher [34] developed their framework of comprehensibility [34] and concluded that many researchers do not consider the power of comprehensibility and miss out on important aspects of their results. According to Gleicher:

  1. [leftmargin=0.4cm]

  2. Comprehensibility makes us understand a prediction to appropriately trust it, or a predictive process to trust in its ability to make predictions.

  3. Comprehensibility helps in prescriptiveness, which is the quality of a model that allows its user to act on something with a result, e.g., its ability to inform action.

  4. Understanding of a model can drive iterative refinement that is applied to improve predictive accuracy, efficiency, and robustness.

  5. While a statistical model usually uncovers correlations, discovers causality, it can also be a useful starting point for theory building, or an approach towards testing theory.

  6. Comprehensibility can characterize by easily interpreting what the model can do and where it can be applied.

  7. It can generalize modeling to other situations which can be part of other (future) applications.

  8. It can identify the success (or failures) in one model, modeling application, or modeling process, that can help us to improve our practices for future applications.

Comprehensibility is defined as the ability of the various stakeholders to understand relevant aspects of the modeling process. How can a model be comprehensible? According to various researchers [34, 4, 35, 36], a comprehensible model can be represented with a rule-based learning [37, 38], or size of the output, i.e., smaller models [39], or better visualization [34].

According to Phillips et al. [37], a model shown to be comprehensible enough for human, when a human can fit the model into their Long Term Memory (LTM) [40] and when the rules within the model can efficiently lead to decisions. Imagine a model as shown in Figure 1 of SVM, a human would not be able to reason from such a sophisticated output because of 2 reasons: 1) The model is mostly points of transformed data on a new multi-dimensional feature space automatically inferred by some kernel function. Due to the arcane nature of these kernels, it is hard for humans to attribute meaning to these points [41, 3]

; and 2) The model infers a decision boundary or hyperplane (as shown in Figure 

1) without any generalization [42]. A SVM defines its decision boundary in terms of the vectors nearest that boundary. All these “support vectors” are just points in space so understanding any one of them incurs the problems.

Fig. 1: An example of SVM model from  [43]
  if     topic 1 0.80     then false   else if   topic 7 0.60     then true   else if   topic 3 0.65     then true   else if   topic 5 0.50     then true   else             false
Fig. 2: Example of a much simpler FFT model. How this FFT is generated is explained in Section II-E. The premise of this paper is that such simple models can perform as well, or better than more complex models that use extra dimensions, like Figure 1. This is an example created by our proposed method on the dataset PitsA under study.

Further, SVMs offer much less support for understanding the entire set of these points than, say, some rule-based representation (as shown in Figure 2 which is an example created by our proposed method on the dataset under study). To understand this, consider a condition that might be found in a rule-based representation, and within the hyperspace of all data, this inequality defines a region within which center conclusions are true, regardless of other attributes. That is, this condition is a generalization across a large space of examples, a region that humans can understand as “within this space, certain properties exist”. The same is not true for support vectors. Such vectors do not tell humans which attributes are most important for selecting one conclusion over another, nor can they divide a space of examples into multiple regions. Rule-based representations do not have that limitation. They can divide space into multiple sectors within which humans know how far they can adjust a few key attributes in order to move from one classification to another.

Consequently, psychological scientists have developed FFT as a rule-based model that is quickly comprehensible, comprising of few rules. A FFT tree is a binary tree classifier, where either one or both node has a terminating branch to a decision node. Basically, it will trigger an immediate understanding and action for each question being asked or topic information feature. As shown in Figure 2, the same complex model of Figure 1 can be comprehensible enough using FFT which is just 5 lines of rules. We will study FFT in greater detail, later in the Section II-E.

Menzies et al. [44]

obtained similar Decision Tree (DT) rules for the same dataset PitsA which is under study in this paper. A condensed example of their rules are shown in Figure 

3, the conditions in these rules are at the term occurrence level, whereas our example of FFT (Figure 2) are at topic information level. The term occurrence condition failed to provide any generalized intuition or expert comprehension of how to use such a rule to classify bug report automatically. But if we consider our proposed FFT tree, we observed that if topic 3 0.65 then the report can be classified as severe. The top terms denoting topic 3 are messag unsign bit code file byte word ptr and we can say these terms generalizing “type conversion” topic.

Developers can now use this information to avoid future mistakes in the code where type conversion is happening. We contacted the original users of the PITS data [44] to look at the topics which we generated (and the conditions where they were found). They agreed that their rules were not generalizable; i.e. they could not use those rules to improve their systems but the topics which we generated are highly relevant and practical. This validates and motivates that the rules generated by our FFT on a topic occurrence level are more comprehensible.

  if  sr 0 & rvm 1 & l4 1 & cdf 1     then 4 else if  sr 1 & issu 1 & code 3        then 4 else if  control 1 & code 1 & attitud 4   then 2 else if  l3 2 & obc 0 & perform 0     then 2 else if  script 1 & trace 1           then 3 else                       3
Fig. 3: Similar Decision Tree (DT) rules obtained by Menzies et al. [44] for PitsA (the dataset under study in this paper).

While this paper places high value on comprehensibility, we note that much prior work has ignored this issue. In March 2018, we searched Google scholar for the papers that are published in the last decade, which does text mining to build defect/bug predictors and also talks about comprehensibility. From that list, we selected “highly-cited” papers, which we defined as having more than 5 citations per year. After reading through the titles and abstracts of those papers, and skimming the contents of the potentially interesting papers, we found 16 papers as shown in Table I that motivates our study.

From Table I, we can see that despite the importance of method comprehensibility as pointed out by Gleicher [34] and others, all the 16 “highly-cited” papers talk about comprehensibility in some form but do not have few rules which are browsable and can fit into human’s LTM.

Ii-B Bug Reports Classification

The case studies used in this paper comes from text classification of bug reports. This section describes those case studies.

Many SE text mining researches have been done on bug reports classification to categorize the description of the fault occurrence in a software system. Zhou et al. [26]

found the top 20, 50, 100 top terms and used these as features to model Naive Bayes, and Logistic regression classifiers. They reported on precision, recall and f-score, and concluded that their method had a significant improvement over other proposed methods. Yet, they did not use these top terms to comprehend the prediction model. Menzies et al. 


used TFIDF featurization technique with Naive Bayes classifier to predict the severity of defect reports and they lacked in showing how to interpret such a method. Few researchers 

[13, 14] used only top TF features to build a SVM classifier but did not provide interpretability of the method.

Year Citations
SVM as
a Classifier
[44] 2008 191
[9] 2011 119
[23] 2012 66
[26] 2016 51
[22] 2013 50
[16] 2012 47
[13] 2012 38
[10] 2015 28
[45] 2013 25
[14] 2012 25
[12] 2014 22
[20] 2015 20
[24] 2017 16
[11] 2014 14
[21] 2016 8
[15] 2016 6
TABLE I: Highly Cited Papers

Many other researchers used SVM as a classifier but used high number of TF features to do bug/defect prediction  [9, 10, 11, 12] and they provided top significant terms to explain about the cause of these bugs. In other works, few researchers used SVM with high number of TF features but did not report terms to provide any explanation [22, 20, 23, 21].

Researchers also used LDA’s document topic distribution as features to build bug report prediction models [45, 15, 16, 24]. Xia et al. [24] worked on LDA features with SVM classifier but did not have any interpretability power. Pingclasai et al. [45] compared different size of topics needed by LDA against different number of top TF features. They found that LDA with yields the best f-score. Layman et al. [15] used different number of topics to identify severity of bug reports on 6 NASA Space System Problem datasets. They also comprehensibly showed what these reports were talking about. The problem with this was that they chose high number of topics. Also, Chen et al. [16] used LDA to identify whether defect prone module stays defect prone even in future versions. They showed top topics with top words related to defect. But the problem existed similar to Layman et al., that they used high number of topics.

We looked at recent studies, which uses high dimensional features combined with different classifiers such as Naive Bayes, SVM, Logistic regression [9, 26, 44] to accurately model the data. But out of that, SVM is the most commonly, frequently and popularly used classifier. From Table I, we can see that 11/16 (about 70%) highly cited papers used SVM as classifiers. Therefore, we chose SVM classifier as the complex baseline learner to compare against the simple FFT model.

Ii-C Curse of Dimensionality

All the text mining techniques model high dimensional data, i.e., a corpus of documents that contains to unique words. The common problem associated with such data is that when the dimensionality increases, the volume of the space increases drastically which leads to available data getting sparsed [41]. This sparsity is problematic when we try to find statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, modeling such high dimensional data often relies on detecting areas where objects form groups with similar properties, however in high dimensional data, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient [46].

High dimensional data also increases the complexity for data modeling, and is a curse for finding comprehensible models. Researchers use TF and TFIDF feature extraction techniques 

[44, 12] which provides 1,000 to 10,000s of features for a learner to model it. These numerous features would not offer smaller concise comprehensible models. From Table I, we can see that all the 16 papers have high dimensional features driving us to find alternate methods for reduction in dimensionality.

To tackle the curse of dimensionality, researchers employ different dimensionality reduction techniques like feature transformation (Principal Component Analysis, Latent Dirichlet Allocation), Sampling, Feature Selection Techniques, and many more 

[47, 48]. For text mining, researchers used mostly a feature transformation or feature selection technique to reduce the feature space in order to find the top words from the corpus which can the be used in classifiers [26, 13, 14].

Latent Dirichlet Allocation (LDA) is a common technique observed in text mining for dimensionality reduction [15, 17]. LDA provides topics that are comprehensible enough and researchers can browse through them to make decisions as shown by Agrawal et al [17]. We agree with their work and their motivation of choosing such a feature extraction technique. That’s why, we chose LDA as a feature extraction technique (since we get concise topics) and after combining it with FFT (depth, ), we get few rules that are comprehensible enough while having better or comparable results classification performance.

Ii-D Computationally Inexpensiveness

There always exists a trade-off between the effectiveness and the cost of running any method. The method should not be expensive to apply (measured in terms of required CPU, or runtime). Before a community can adopt a method, we need to first ensure that the method executes very quickly. Some methods, especially which are used to solve the problem of hyperparameter optimization (the problem of choosing a set of optimal parameters for a learning algorithm), can require hours to days to years of CPU-time to terminate 

[49, 17]. Hence, unlike such methods, we need to select baseline methods that are reasonably fast.

One such resource expensive method is recently introduced by Agrawal et al. [17], where they optimized the hyperparameters of LDA to find the optimal settings. They optimized the LDA for

score which was the measure of how stable the generated topics are. They showed that stable topics are needed if developers/users are using these topics for further analysis, especially when it comes to unsupervised learning. They also used these stable topics for supervised learning and showed that the prediction performance is comparable against the commonly used text mining technique of TFIDF with SVM classifier. The major drawback with their method is that it is computationally expensive, and is about three to five times slower. It is computationally expensive due to 2 reasons: 1) Use of computationally expensive optimizer (Differential Evolution) and 2) Number of Topics, which has direct relation with its runtime, i.e., the more number of topics, the more the run time.

As previously mentioned, the reason for choosing LDA features was its power of comprehensibility. Though we do not want to use an expensive technique like LDADE, when we have the option of using default parameters without sacrificing the performance while achieving much better comprehensibility with FFT.

Ii-E How are FFTs generated?

Psychological scientists have developed FFTs (Fast and Frugal Trees) as one way to generate comprehensible models consisting of separate tiny rules [37, 29, 50]. A FFT is a decision tree made for binary classification problem with exactly two branches extending from each node, where either one or both branches is an exit branch leading to a leaf [50]. That is to say, in an FFT, every question posed by a node will trigger an immediate decision (so humans can read every leaf node as a separate rule).

We used the similar implementation of FFT as offered by Fu and Chen et al. [51, 29]. An FFT of depth has a choice of two “exit policies” at each level: the existing branch can select for the negation of the target, i.e., non-severe, (denoted “0”) or the target (denoted “1”), i.e., severe. The right-hand-side tree in Figure 4 is 01110 since:

  • [leftmargin=0.4cm]

  • The first level found a rule that exits to the negation of the target: hence, “0”.

  • While the next tree levels found rules that exit first to target; hence, “111”.

  • And the final line of the model exits to the opposite of the penultimate line; hence, the final “0”.

  if     topic 1 0.80     then false   #0   else if   topic 7 0.60     then true    #1   else if   topic 3 0.65     then true    #1   else if   topic 5 0.50     then true    #1   else             false      #0
Fig. 4: Example of an FFT

Following the advice of [51, 29, 37], for all the experiments of this paper, we use a depth . For trees of depth , there are possible trees which can be denoted as 00001, 00010, 00101, … , 11110. During FFT training, all trees are generated, then we select the best one (using the training data). This single best tree is then applied to the test data. Note that FFTs of such small depths are very succinct (see examples in Figures 2 and 4). Such FFTs generate rules which leads to decision of finding a report as severe and non-severe for the datasets under study. Many other data mining algorithms used in software analytics are far less succinct and far less comprehensible as explained in Section II-A.

Iii Experimentation

All our data, experiments, scripts are available to be downloaded from

Iii-a Dataset

PITS is a widely used text mining dataset in SE studies [44, 52, 15]. The dataset is generated from NASA software project and issue tracking system (PITS) reports [52, 44]. This text discusses bugs and changes found in big reports and review patches. Such issues are used to manage quality assurance, to support communication between developers. Text Mining techniques can be used to predict each severity separately [15]. The dataset can be downloaded from Note that, this data comes from six different NASA projects, which we label as PitsA, PitsB, and so on. For this study, we converted these severity into binary classification where the max number of reports with one severity is labeled as positive class and the rest as negative. We employed the usual preprocessing steps mentioned in text mining literature [17, 53] which are tokenization, stop-words removal, and stemming. Table II shows the number of documents, feature size, and the percentage of severe classes after preprocessing.

No. of Documents
Feature Size
Severe %
PitsA 965 2001 39
PitsB 1650 1685 40
PitsC 323 544 56
PitsD 182 557 92
PitsE 825 1628 63
PitsF 744 1431 64

TABLE II: Dataset statistics. Data comes from the SEACRAFT repository:

Iii-B Feature Extraction

Textual data are actually series of words. In order to run machine learning algorithms we need to convert the text into numerical feature vectors. We used 2 types of feature extraction techniques:

  • [leftmargin=0.4cm]

  • Term Frequency-Inverse Document Frequency (TFIDF): If a word occurs times and is found in documents and there are , and as total number of words and documents respectively [17], then TFIDF is scored as follows:

  • Topic Information Features (LDA): We need to decide the number of topics size before applying the LDA model to generate topic information features. To identify the number of topics we employed 2 strategies: 1) Manual number of topic size (10, 25, 50, 100) and 2) Choosing an optimal K using LDADE method [17]

    . LDA model produces the probability of a document in each topic which is used as a feature vector. Normally, the number of topics is significantly smaller than the number of terms, thus LDA can effectively reduce the feature dimension 


Fig. 5:

Comparison of LDA (K=10, 25, 50, 100) with FFT against TFIDF+SVM, LDADE+SVM and LDADE+FFT. Columns represent different datasets under study and scored on precision and recall. We show median and IQR (inter-quartile range, 75th-25th percentile) values. Different color coding shows the results from Scott-Knott procedure. The statistical comparison is across rows to find which method performs the best.

Iii-C Classifier

For this study we used 2 machine learning algorithms, 1) Support Vector Machine (SVM) and 2) Fast and Frugal Trees (FFTs). We use these, as explained earlier in Section II

. Though, there are other available choices like Deep learning, Decision Tree (DT), and Random Forest (RF) which have shown to be powerful in SE applications 

[54, 27, 55, 56]. However, deep learning does not readily support explainability, they have been criticized as “data mining alchemy” [57] and also a recent study by Majumder et al. [58] suggest it may not be the most useful for SE data. DT or RF can generate small set of rules but performance can be sacrificed. Camilleri et al. [59] showed that, DT have accuracy and significantly increased to when the depth of the tree increased from 0 to , meaning that rules generated also moved from less to many. Hence, DT or RF may not be useful for this study.

Using a dataset, a performance measure and a classifier, this experiment conducts the 5*5 stratified [60, 61]

cross-validation study to make our results more robust and reliable. This checks the amount of variance for such learners. The variance should be as minimal as possible. To control the randomization, seed is set so that the results can be reproducible. For implementation of SVM and other methods, we used the open source tool Scikit-Learn

[62] and we relied upon their default parameters as our baseline. Our stratified cross-validation study [60, 27] which includes the process of DE is defined as follows:

  • [leftmargin=0.4cm]

  • We randomized the order of the dataset set five times. This reduces the sampling bias, that some random ordering of examples in the data can conflate our results.

  • Each time, we divided the data into ten bins.

  • For each bin (the test), we trained on four bins (the rest) and then tested on the test bin.

  • When using LDADE, we further divide those four bins of training data. three bins are used for training the model, and one bin is used for validation in DE. DE is run to improve the performance measure when the LDA was applied to the training data. Important point: When tuning, this rig never uses test data.

  • The model is applied to the test data to collect scores.

Iii-D Evaluation Measure

The problem studied in this paper is a binary classification task. The performance of a binary classifier can be assessed via a confusion matrix as shown in Table 

III where a “positive” output is the positive class under study and a “negative” output is the negative one.

Prediction false true
TABLE III: Results Matrix

Further, “false” means the learner got it wrong and “true” means the learner correctly identified a positive or negative class. Hence, Table III has four quadrants containing, e.g., which denotes “false positive”.

We used the following 2 measures that can be defined from this matrix as:

  • [leftmargin=0.4cm]

  • Recall

  • Precision

No evaluation criteria is “best” since different criteria are appropriate in different real-world contexts. Specifically, in order to optimize the performance of the released software, management would maximize the precision which would reduce the recall. When dealing with safety-critical applications, management may be “risk adverse” and hence many elect to maximize recall, regardless of the time wasted exploring false alarm [27]. Both precision and recall cannot be maximized at the same time. We assume that this holds true in the context of this paper and a business user wants to maximize either precision or recall and that is why we evaluate FFT on individual scores.

Iii-E Statistical Analysis

We compared our results using statistical significance test and an effect size test. Significance test is useful for detecting if two populations differ merely by random noise. Scott-Knott procedure was used as significance test [63, 54, 64].

Effect sizes are useful for checking whether two populations differ by more than just a trivial amount. A12 effect size test was used [65]. Our stats test are statistically significant with 95% confidence and not a “small” effect ().

Iv Results

RQ1: How does simpler method perform against most common sophisticated and recent state-of-the-art Search Based SE (SBSE) methods?

As discussed in Section II-B, we found that the most common text mining technique for binary classification in software engineering is TFIDF as the feature extraction method with SVM as a classifier. In recent studies [17, 15], LDA feature extraction is shown to be of a great alternative due to it achieving similar performance as well as reduction in dimensionality.

Some researchers also adapted hyperparameter tuning to optimize performance but they do come with an expense of heavy runtime [28, 54, 17, 27]. Agrawal et al. [17] showed LDADE with SVM (SBSE method) to achieve better performance for classification tasks. LDADE finds optimal , and , but matters the most for supervised learning [17].

FFT is shown to be a good classifier when dealing with low dimensionality in defect prediction studies [51, 29]. We used LDA as features for FFT due to its power to explain about the text. That is why we compared sophisticated method (TFIDF+SVM) as well as SBSE method (LDADE+SVM) against the proposed simpler method (LDA+FFT). We also compared LDADE+FFT against LDA+FFT, and tried with different variants of FFTs by using different topic sizes (), changing K manually rather than using an automatic technique like LDADE which is an expensive task, to see what improvement can we find.

Figure 5 offers a statistical analysis of different results achieved between TFIDF+SVM, LDADE+SVM, LDADE+FFT against 10_FFT, 25_FFT, 50_FFT, 100_FFT. Each column represents different datasets and each sub-figure shows precision and recall scores. We assume that business users want to maximize either precision or recall and that is why we run FFTs separately on individual scores. We report median and IQR (inter-quartile range, 75th-25th percentile) values, and darker the cell, the statistically better the performance. For example, in sub-figure where we report precision values, consider the column of pitsA dataset, we will read across rows to know which method works the best. In this case, TFIDF_SVM is better across other methods. Similarly other dataset’s results can be read. Also, if the same color exists across, they are either statistically insignificant or are different only via a small effect (as stated by the statistical methods described in Section III-E).

For recall, we observe that 10_FFT, 25_FFT, 50_FFT, and 100_FFT (LDA_FFTs) are performing statistically similar against all 6 datasets, whereas for precision scores, 10_FFT, 25_FFT, 50_FFT, 100_FFT are performing similar in 4 out of 6 datasets and 10_FFT wins on the remaining 2 occasions. This came as a surprise since value of K are shown to have effect on the classification performance in recent SBSE method [17] whereas FFT has minimal effect on what value of is used. From now on, that is why all our comparisons are with 10_FFT.

We note that simpler methods (10_FFT) are statistically better or similar on 5 out of 6 datasets against TFIDF+SVM (sophisticated method) when compared on recall but it performs similar on 2 out of 6 datasets when we look at precision value. This tells that simple FFT method have comparable performance against the complex method.

We also found that 10_FFT is winning on precision by a big margin on all 6 datasets when compared against LDADE_SVM. On the other hand, 10_FFT method offered comparable performance against the other 6 datasets for recall. This changes a recent study’s conclusion [17] where Agrawal et al. showed LDADE_SVM, new simpler state-of-the-art method, defeating the sophisticated method (TFIDF+SVM). The datasets under study are different than what Agrawal et al. used, which might have affected our results. Though, our findings say that:

LDADE+SVM is worse than LDA+FFT and TFIDF+SVM but LDA+FFT is similar to TFIDF+SVM.

  • [leftmargin=0.4cm]

  • LDA_FFT with offers comparable performance against TFIDF+SVM.

  • LDA_FFT with are wining against LDADE+SVM in majority cases.

With any empirical study, besides classification power, we have to look at the runtimes as another criteria to evaluate the methods performance. Table IV shows the runtimes in minutes. From the table, it can be observed that LDA+FFT is only somewhat slower than TFIDF+SVM which may not be an arduous increase given the gains from its power of comprehensibility discussed in RQ2. However, it can be observed that LDA+FFT combination is orders of magnitude faster (100 fold) than SBSE method (LDADE+SVM). This concludes that SBSE method is quite expensive and our picked alternative solution, i.e., LDA+FFT, is a promising candidate.

PitsA 1 8 900
PitsB 1 9 500
PitsC 1 3 200
PitsD 1 2 150
PitsE 1 7 400
PitsF 1 8 400

TABLE IV: Runtimes (in minutes)

Lastly, we would like to make a point that, complex and time-costly model like LDADE or other values of is not needed. We can use as the optimal number of features to build a simple FFT model. Hence,

Result 1

Simpler method (LDA+FFT) offers similar performance as the sophisticated method (TFIDF+SVM) and the SBSE method (LDADE+SVM). Though simpler LDA+FFT method takes an extra 10 minutes than the baseline, but it is orders of magnitude faster than the SBSE method.

PITS_A Dataset:   if     topic 1 0.80     then false   else if   topic 7 0.60     then true   else if   topic 3 0.65     then true   else if   topic 5 0.50     then true   else              false      Topic 1: type data line code statu packet word function      Topic 7: mode point control project attitud rate error prd      Topic 3: messag unsign bit code file byte word ptr      Topic 5: file variabl code symbol messag line initi access
PITS_B Dataset:   if     topic 2 0.70     then true   else if   topic 4 0.75     then false   else if   topic 7 0.65     then true   else if   topic 6 0.80     then true   else              false      Topic 2: command gce counter step bgi test state antenna      Topic 4: line code function file declar comment return use      Topic 7: ace command fsw shall level state trace packet      Topic 6: test interfac plan file dmr document section data
PITS_C Dataset:   if     topic 1 0.70     then false   else if   topic 6 0.55     then true   else if   topic 8 0.73     then true   else if   topic 2 0.85     then false   else              false      Topic 1: requir fsw command specif state specifi shall ground      Topic 6: tim trace section document traceabl matrix rqt requir      Topic 8: appropri thermal field integr test valid ram violat      Topic 2: header zero posit network indic action spacecraft base
PITS_D Dataset:   if     topic 6 0.50     then false   else if   topic 1 0.80     then true   else if   topic 4 0.85     then false   else if   topic 9 0.60     then false   else              true      Topic 6: essenti record heater occurr indic includ rollov      Topic 1: fsw csc trace data field fpa tabl command      Topic 4: enabl wheel use disabl respons control protect fault      Topic 9: line cpp case switch default projectd file fsw
PITS_E Dataset:   if     topic 8 0.75     then true   else if   topic 5 0.70     then false   else if   topic 7 0.50     then false   else if   topic 10 0.9     then false   else              true      Topic 8: line file function cmd paramet ccu fsw vml      Topic 5: inst phx test project set document softwar verifi      Topic 7: ptr size time prioriti ega defin data null      Topic 10: word fsw enabl capabl follow vagu present emic
PITS_F Dataset:   if     topic 5 0.80     then false   else if   topic 8 0.75     then true   else if   topic 2 0.50     then true   else if   topic 9 0.65     then true   else              false      Topic 5: requir projectf tabl ref boot bsw fsw section      Topic 8: fsw requir test projectf procedur suffici softwar      Topic 2: code variabl test point build defin float valu      Topic 9: number byte word limit buffer dump ffp error
Fig. 6: Comprehensible models generated by FFT for all 6 datasets

RQ2: Is simpler method more explainable or comprehensible against the most common sophisticated and recent state-of-the-art SBSE methods?

Beside the comparable performance of the simpler method against the most common sophisticated method and the recent SBSE method, it would not bring any merits to practice for software analytics without having explainable insights that can be easily interpreted from the model. Representative characteristics that help a model more explainable, includes small architecture, easily visualized, and comprise of fewer rules that can quickly lead to decisions. From Table II, with large features size range of 550-2000 features from the six datasets of our study, the classifier built on top of that will be too big and complex. Since 2013, researchers have started focusing on using LDA features instead of TFIDF to offer the comprehensible aspect of the models. However, LDA features only provide better sense of interpretability if we have 10s of features not 100s. Researchers have showed both the top key words from TFIDF or LDA [15, 17, 10, 11, 12] features in an attempt to compensate for the comprehensibility of the model but there were no simple decision-making process embedded with it, so the model is not actionable.

For this study, support vector machines were picked as the most common sophisticated method in text mining. SVMs achieve the results after synthesizing new dimensions through the kernel function which are totally unfamiliar to human users. Hence, it is hard to explain to the users.

The proposed simple model of FFT with LDA topics, depth , references the trend of only 4 topics from LDA. At each level of the FFT tree, the existing branch can select for the severeness target, i.e., true (denoted “1”), or the non-severeness target, i.e., false (denoted “0”), as it’s exiting policies. The exiting policies selected by FFT are a trace of the model sampling around the space toward the sections of the data containing the targets of severe class of bug reports. With this architecture, the LDA+FFT would be more explainable for text mining to determine the severity of the bug.

Figure 6 demonstrates how our models can be explainable. The right hand side of the figure shows the four most important topics as a list of top relevant words per dataset. The left hand side includes decision rules of the best performing FFT tree that fit with the LDA generated topics. Some of the possible interpretations of the FFT models from Figure 6 include:

  • [leftmargin=0.4cm]

  • The FFT tree from PitsC dataset, say for depth 1, the exiting policy says that when a report of the dataset will have probability of topic 1 higher than then that report will be a non-severe report.

  • In other case, the exiting policies for PitsE FFT is “10001”. It starts off with deciding the severeness targeting some low hanging fruit of severe bug reports. Only after clearing away all the non-severe examples at levels two, three, four, it makes a final “true” conclusion. Note that all the exits, except the first and the last, are “false”.

  • For PitsF FFT’s exiting policies of “01110”. It is similar to “10001” where “01110” starts off with clearing away the non-severe examples then commit on finding the target classes and then clear the rest of non-severe examples. Note that all the exits, except the first and the last, are “true”.

In practice, business users/experts can use this explainable and comprehensible method to identify a new unseen/not labeled report into severe and non-severe, reducing the time and cost spent by business in labeling these reports [66, 67]. For e.g., once FFT tree is built on the seen examples using LDA, a new bug report instance will use LDA to automatically come up with topic probabilities of this report (like topic 1 = 0.7, topic 2 = 0.02 and so on). We can then use the probabilities to traverse through the built FFT tree to classify the severeness of the bug report automatically. With the comparable performance demonstrated in RQ1, this method shall confidently give those experts an actionable and intuitive but more scientific way to quickly label the severeness of the bug report.

Moreover, comprehensibility aspect of the model also let the expert testing theories appropriately. For instance, some of the top words from topic 6 generated for the PitsB dataset (Figure 6) include “test, plan, document, data” in which test planning topic can be easily inferred from. By following the respective FFT model, the development team would now take test planning into more serious consideration in the software development lifecycle to minimize future sever bugs in the software. The team will have the autonomy to easily refine the method accordingly or generalize this method for future applications, which is the two strongly suggested characteristics of the power of comprehensibility by Gleicher [34].

On the other hand, the models generated from complex or SBSE method will look like Figure 1. As discussed earlier in Section II-A, SVM model generates synthetic feature space and an imaginary hyperplane boundary that lack the power of explainability of such a model to humans. We can not use such a decision space to reason from or make it actionable.

Altogether, our proposed LDA+FFT method has more actionable and comprehensible aspects against TFIDF+SVM, our most sophisticated method, and LDADE+SVM, SBSE method. Moreover, the cost of running LDA+FFT in RQ1 will be compensated with the interpretability of the model. Hence,

Result 2

FFT generates fewer rules referencing only 4 topics found by LDA are far more comprehensible than the most common sophisticated and SBSE methods.

V Discussion

We found that FFT with small feature space (10 features) found by LDA works as well as SVM with 100s to 1000s TFIDF features and much better than the combination of LDADE and SVM which makes the discussion important on why FFT works. There could be two reasons behind this:

  1. [leftmargin=0.4cm]

  2. The exit policies selected by FFTs are like a trace of the reasoning jumping around the data. For example, a tree with 11110 policy jumps towards sections of the data containing most severe reports. Also, a 00001 tree shows another model trying to jump away from severe reports until, in its last step, it does one final jump towards severe. This tells us that software data could be “lumpy”, i.e., it divides into a few separate regions, each with different properties. In such a “lumpy” space, a learning policy like FFT works well since its exit policies let a learner discover how to best jump between the “lumps” and other learners fail in this coarse-grained lumpy space [29, 51].

  3. FFT combines good and bad attributes together to find the best decision policy [37]. FFT finds a rule by identifying the exit policy that has the highest probability of that rule leading to a particular class even if the rule contains mixed class distribution. On the other hand, learners like SVM, transform the data into different feature space which could still contain noisy relationship between the transformed space and the decisions.

Based on the above discussion, we will need to extend the usage of FFT in other software analytics tasks on more complex data to see whether the results from this paper holds true for them or not.

Vi Threats to Validity

As with any empirical study, biases can affect the final results. Therefore, any conclusions made from this work must consider the following issues in mind.

Order bias: With each dataset how data samples are distributed in training and testing set is completely random. Though there could be times when all good samples are binned into training set. To mitigate this order bias, we run the experiment 25 times by randomly changing the order of the data samples each time.

Sampling bias threatens any classification experiment, i.e., what matters here may not be true there. For e.g., the datasets used here comes from the SEACRAFT repository and were supplied by one individual. These datasets have been used in various case studies by various researchers [15, 44, 52], i.e., our results are not more biased than many other studies in this arena. That said, our 6 open-source datasets are mostly from NASA. Hence it is an open issue if our results will hold true for both proprietary and open source projects from other sources. Also, our FFT results can also be affected by the size of each datasets. These datasets are smaller in corpus size, so in future, we plan to extend this analysis on larger and higher dimensional datasets.

Learner bias: For LDADE, we selected parameters as default as provided by Agrawal et al. [17]. But there could be some datasets where by tuning them there could be larger improvement. We only used SVM as classifier but there could be other classifiers which can change our conclusions. Data Mining is a large and active field and any single study can only use a small subset of the known data miners.

Evaluation bias: This paper uses topic similarity () for LDADE, and precision and recall for classifiers, but there are other measures which are used in software engineering which includes perplexity, accuracy, etc. Moreover, based on our experiment, we picked precision, and there would be loss in recall performance and vice-versa. Assessing the performance of both the metrics together showing there trade-offs is left for future work.

We would also like to point out that FFTs are only for binary classification, however for multi-class the FFTs can be improvised upon to accommodate this request. Also, FFTs do not scale well with 1000s of features and becomes computationally expensive, which can further be improved. In this study, we used a default depth of 4 to build the trees (in total 16 trees are build to find the best one), but we also need to try with other depth size to see what performance changes will we see making it a clear focus for future.

Vii Conclusion

This paper has shown that a simple and comprehensible data mining algorithm, called Fast and Frugal trees (FFTs) developed by psychological scientist, is remarkably effective for creating few decision rules that are actionable and browsable.

Despite their succinctness, LDA+FFTs are remarkably effective in showing comparable performance on recall and precision when compared against the most common technique of TFIDF with SVM as well as state-of-the-art SBSE method (LDADE+SVM). It can also be said that, we do not need computationally expensive methods to find succinct models.

From the above, we conclude that, there is much for software analytics community that could be learned from psychological science. Proponents of complex methods should always baseline against simpler alternative methods. For example, FFTs could be used as a standard baseline learner against which other software analytics tools can compare.


  • [1] A.-H. Tan et al., “Text mining: The state of the art and the challenges,” in Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, vol. 8.   sn, 1999, pp. 65–70.
  • [2] W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, vol. 21, no. 8, pp. 879–886, 2008.
  • [3] T. Menzies, O. Mizuno, Y. Takagi, and T. Kikuno, “Explanation vs performance in data mining: A case study with predicting runaway projects.” JSEA, vol. 2, no. 4, pp. 221–236, 2009.
  • [4] A. Vellido, J. D. Martín-Guerrero, and P. J. Lisboa, “Making machine learning models interpretable.” in ESANN, vol. 12.   Citeseer, 2012, pp. 163–172.
  • [5] J. Moeyersoms, E. J. de Fortuny, K. Dejaeger, B. Baesens, and D. Martens, “Comprehensible software fault and effort prediction: A data mining approach,” Journal of Systems and Software, vol. 100, pp. 80–90, 2015.
  • [6] D. Martens, J. Vanthienen, W. Verbeke, and B. Baesens, “Performance of classification models from a user perspective,” Decision Support Systems, vol. 51, no. 4, pp. 782–793, 2011.
  • [7]

    K. Dejaeger, T. Verbraken, and B. Baesens, “Toward comprehensible software fault prediction models using bayesian network classifiers,”

    IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 237–257, 2013.
  • [8]

    J. Hihn and T. Menzies, “Data mining methods and cost estimation models: Why is it so hard to infuse new ideas?” in

    Automated Software Engineering Workshop (ASEW), 2015 30th IEEE/ACM International Conference on.   IEEE, 2015, pp. 5–9.
  • [9] A. Lamkanfi, S. Demeyer, Q. D. Soetens, and T. Verdonck, “Comparing mining algorithms for predicting the severity of a reported bug,” in Software Maintenance and Reengineering (CSMR), 2011 15th European Conference on.   IEEE, 2011, pp. 249–258.
  • [10] X. Xia, D. Lo, E. Shihab, X. Wang, and B. Zhou, “Automatic, high accuracy prediction of reopened bugs,” Automated Software Engineering, vol. 22, no. 1, pp. 75–109, 2015.
  • [11] P. S. Kochhar, F. Thung, and D. Lo, “Automatic fine-grained issue report reclassification,” in Engineering of Complex Computer Systems (ICECCS), 2014 19th International Conference on.   IEEE, 2014, pp. 126–135.
  • [12] X. Xia, D. Lo, W. Qiu, X. Wang, and B. Zhou, “Automated configuration bug report prediction using text mining,” in Computer Software and Applications Conference (COMPSAC), 2014 IEEE 38th Annual.   IEEE, 2014, pp. 107–116.
  • [13] K. Chaturvedi and V. Singh, “Determining bug severity using machine learning techniques,” in Software Engineering (CONSEG), 2012 CSI Sixth International Conference on.   IEEE, 2012, pp. 1–6.
  • [14] M. Sharma, P. Bedi, K. Chaturvedi, and V. Singh, “Predicting the priority of a reported bug using machine learning techniques and cross project validation,” in Intelligent Systems Design and Applications (ISDA), 2012 12th International Conference on.   IEEE, 2012, pp. 539–545.
  • [15] L. Layman, A. P. Nikora, J. Meek, and T. Menzies, “Topic modeling of nasa space system problem reports: research in practice,” in Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on.   IEEE, 2016, pp. 303–314.
  • [16] T.-H. Chen, S. W. Thomas, M. Nagappan, and A. E. Hassan, “Explaining software defects using topic models,” in Proceedings of the 9th IEEE Working Conference on Mining Software Repositories.   IEEE Press, 2012, pp. 189–198.
  • [17] A. Agrawal, W. Fu, and T. Menzies, “What is wrong with topic modeling? and how to fix it using search-based software engineering,” Information and Software Technology, 2018.
  • [18] H. Brighton, “Robust inference with simple cognitive models.” in AAAI spring symposium: Between a rock and a hard place: Cognitive science principles meet AI-hard problems, 2006, pp. 17–22.
  • [19]

    G. Gigerenzer, J. Czerlinski, and L. Martignon, “How good are fast and frugal heuristics?” in

    Decision science and technology.   Springer, 1999, pp. 81–103.
  • [20] Y. Tian, D. Lo, X. Xia, and C. Sun, “Automated prediction of bug report priority using multi-factor analysis,” Empirical Software Engineering, vol. 20, no. 5, pp. 1354–1383, 2015.
  • [21] Y. Tian, N. Ali, D. Lo, and A. E. Hassan, “On the unreliability of bug severity data,” Empirical Software Engineering, vol. 21, no. 6, pp. 2298–2323, 2016.
  • [22] Y. Tian, D. Lo, and C. Sun, “Drone: Predicting priority of reported bugs by multi-factor analysis,” in Software Maintenance (ICSM), 2013 29th IEEE International Conference on.   IEEE, 2013, pp. 200–209.
  • [23] F. Thung, D. Lo, and L. Jiang, “Automatic defect categorization,” in Reverse Engineering (WCRE), 2012 19th Working Conference on.   IEEE, 2012, pp. 205–214.
  • [24] X. Xia, D. Lo, Y. Ding, J. M. Al-Kofahi, T. N. Nguyen, and X. Wang, “Improving automated bug triaging with specialized topic model,” IEEE Transactions on Software Engineering, vol. 43, no. 3, pp. 272–297, 2017.
  • [25] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
  • [26] Y. Zhou, Y. Tong, R. Gu, and H. Gall, “Combining text mining and data mining for bug report classification,” Journal of Software: Evolution and Process, vol. 28, no. 3, pp. 150–176, 2016.
  • [27] A. Agrawal and T. Menzies, “Is “better data” better than “better data miners” (benefits of tuning smote for defect prediction),” International Conference on Software Engineering, 2018.
  • [28] W. Fu, T. Menzies, and X. Shen, “Tuning for software analytics: Is it really necessary?” Information and Software Technology, vol. 76, pp. 135–146, 2016.
  • [29] D. Chen, W. Fu, R. Krishna, and T. Menzies, “Applications of psychological science for actionable analytics,” arXiv preprint arXiv:1803.05067, 2018.
  • [30] M. Kim, T. Zimmermann, R. DeLine, and A. Begel, “The emerging role of data scientists on software development teams,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16.   New York, NY, USA: ACM, 2016, pp. 96–107. [Online]. Available:
  • [31] H. K. Dam, T. Tran, and A. Ghose, “Explainable software analytics,” CoRR, vol. abs/1802.00603, 2018. [Online]. Available:
  • [32] Z. C. Lipton, “The mythos of model interpretability,” CoRR, vol. abs/1606.03490, 2016. [Online]. Available:
  • [33] T. Menzies and Y. Hu, “Data mining for very busy people,” Computer, vol. 36, no. 11, pp. 22–29, 2003.
  • [34] M. Gleicher, “A framework for considering comprehensibility in modeling,” Big data, vol. 4, no. 2, pp. 75–88, 2016.
  • [35] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, “Comprehensible credit scoring models using rule extraction from support vector machines,” European journal of operational research, vol. 183, no. 3, pp. 1466–1476, 2007.
  • [36] D. Martens and F. Provost, “Explaining data-driven document classifications,” Management Information Systems Quarterly, vol. 38, no. 1, pp. 73–99, 2014.
  • [37] N. D. Phillips, H. Neth, J. K. Woike, and W. Gaissmaier, “Fftrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees,” Judgment and Decision Making, vol. 12, no. 4, p. 344, 2017.
  • [38] H. Brighton, “Robust inference with simple cognitive models,” in Between a rock and a hard place: Cognitive science principles meet AI-hard problems: Papers from the AAAI Spring Symposium.   AAAI Press, 2006, pp. 17–22.
  • [39] O. Maimon and L. Rokach, “Decomposition methodology for knowledge discovery and data mining,” in Data mining and knowledge discovery handbook.   Springer, 2005, pp. 981–1003.
  • [40] J. Larkin, J. McDermott, D. P. Simon, and H. A. Simon, “Expert and novice performance in solving physics problems,” Science, vol. 208, no. 4450, pp. 1335–1342, 1980.
  • [41]

    N. M. Nasrabadi, “Pattern recognition and machine learning,”

    Journal of electronic imaging, vol. 16, no. 4, p. 049901, 2007.
  • [42] B. Haasdonk, “Feature space interpretation of svms with indefinite kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 482–492, 2005.
  • [43] Cathy Yeh, “Support Vector Machines for classification,”, 2015, online accessed 24 April 2018.
  • [44] T. Menzies and A. Marcus, “Automated severity assessment of software defect reports,” in Software Maintenance, 2008. ICSM 2008. IEEE International Conference on.   IEEE, 2008, pp. 346–355.
  • [45] N. Pingclasai, H. Hata, and K.-i. Matsumoto, “Classifying bug reports to bugs and other requests using topic modeling,” in Software Engineering Conference (APSEC), 2013 20th Asia-Pacific, vol. 2.   IEEE, 2013, pp. 13–18.
  • [46] J. H. Friedman, “On bias, variance, 0/1—loss, and the curse-of-dimensionality,” Data mining and knowledge discovery, vol. 1, no. 1, pp. 55–77, 1997.
  • [47] L. Van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: a comparative,” J Mach Learn Res, vol. 10, pp. 66–71, 2009.
  • [48] I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence Livermore National Lab., CA (US), Tech. Rep., 2002.
  • [49] T. Wang, M. Harman, Y. Jia, and J. Krinke, “Searching for better configurations: a rigorous approach to clone evaluation,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering.   ACM, 2013, pp. 455–465.
  • [50] L. Martignon, K. V. Katsikopoulos, and J. K. Woike, “Categorization with limited resources: A family of simple heuristics,” Journal of Mathematical Psychology, vol. 52, no. 6, pp. 352–361, 2008.
  • [51] W. Fu, T. Menzies, D. Chen, and A. Agrawal, “Building better quality predictors using “-dominance”,” arXiv preprint arXiv:1803.04608, 2018.
  • [52] T. Menzies, “Improving iv&v techniques through the analysis of project anomalies: Text mining pits issue reports-final report,” Citeseer, 2008.
  • [53] R. Feldman and J. Sanger, Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data.   New York, NY, USA: Cambridge University Press, 2006.
  • [54] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact of classification techniques on the performance of defect prediction models,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1.   IEEE Press, 2015, pp. 789–800.
  • [55] M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk, “Toward deep learning software repositories,” in Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on.   IEEE, 2015, pp. 334–345.
  • [56] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep learning for just-in-time defect prediction,” in Software Quality, Reliability and Security (QRS), 2015 IEEE International Conference on.   IEEE, 2015, pp. 17–26.
  • [57] A. T. . I. R. Synced. (2017) Lecun vs rahimi: Has machine learning become alchemy? [Online]. Available:
  • [58] S. Majumder, N. Balaji, K. Brey, W. Fu, and T. Menzies, “500+ times faster than deep learning (a case study exploring faster methods for text mining stackoverflow),” in Mining Software Repositories (MSR), 2018 IEEE/ACM 15th International Conference on.   ACM, 2018.
  • [59]

    M. Camilleri and F. Neri, “Parameter optimization in decision tree learning by using simple genetic algorithms,”

    WSEAS Transactions on Computers, vol. 13, pp. 582–591, 2014.
  • [60] P. Refaeilzadeh, L. Tang, and H. Liu, “Cross-validation,” in Encyclopedia of database systems.   Springer, 2009, pp. 532–538.
  • [61] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Ijcai, vol. 14, no. 2.   Stanford, CA, 1995, pp. 1137–1145.
  • [62] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.
  • [63] N. Mittas and L. Angelis, “Ranking and clustering software cost estimation models through a multiple comparisons algorithm,” IEEE Transactions on software engineering, vol. 39, no. 4, pp. 537–551, 2013.
  • [64] A. Agrawal, A. Rahman, R. Krishna, A. Sobran, and T. Menzies, “We don’t need another hero? the impact of “heroes” on software development,” in Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice Track, ser. ICSE-SEIP ’18, 2018, to Appear, preprint:
  • [65] A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in Software Engineering (ICSE), 2011 33rd International Conference on.   IEEE, 2011, pp. 1–10.
  • [66] J. Deng, O. Russakovsky, J. Krause, M. S. Bernstein, A. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.   ACM, 2014, pp. 3099–3102.
  • [67] D. Chen, K. T. Stolee, and T. Menzies, “Replicating and scaling up qualitative analysis using crowdsourcing: A github-based case study,” arXiv preprint arXiv:1702.08571, 2017.