Software analytics have been focusing on working with adept and state-of-the-art data miners in order to find the optimal results. One sub-topic for software analytics is the use of sophisticated text mining techniques . Text mining is much more complex task as it involves dealing with high dimensional textual data that are inherently unstructured [2, 1]. These complex methods often generate models not comprehensible to humans (e.g., using synthetic dimensions generated by an SVM kernel ). This complexity might not be necessary if simpler methods can be found to achieve the same performance, while at the same time generating easy-to-understand models . We define our terminologies “simple” and “comprehensible” in this paper as:
simple - (1) has low dimensionality of features (in 10s and not 100 to 1000s); (2) generate small set of theories and (3) is not computationally expensive. Otherwise, we call it “complex”.
comprehensible - (1) comprise of small rules and (2) rules that quickly lead to decisions
Moeyersoms et al.  comment that predictive models not only need to be accurate but also comprehensible, demanding that the user can understand the motivation behind the model’s prediction. They further remark that, to obtain such predictive performance, comprehensibility is often sacrificed and vice-versa. Do simpler methods perform worse? Martens et al.  referred comprehensibility as to how well humans grasp the classifier induced or how strong the mental fit of the classifier is. Dejaeger et al.  said comprehensible models are often needed in order to inspire confidence in a business setting and improve model acceptance. Business users are vocal in their complaints about analytics , stating that there are rarely producible models that business users can comprehend.
Researchers in SE use complex methods, such as Support Vector Machine (SVM) with 1,000 to 10,000s of Term Frequency (TF) or Term Frequency - Inverse Document Frequency (TFIDF) features in order to achieve high performance of prediction. Yet, they do not try to comprehend the model itself[9, 10, 11, 12, 13, 14] making business users more hesitant to adopt their methodologies and losing the value of their work. Though, Latent Dirichlet Allocation (LDA) uses less number of features but does require 100s of features for finding an optimal model and be human-comprehensible [15, 16]. An alternative, better, search-based SE method (LDADE) was proposed recently which tries to find optimal parameters of LDA that can make the model more stable and achieve optimal results . The problem with this model is that it is quite expensive in terms of CPU usage and still need 100s of features for it to be comprehensible. We need a simple method which: 1) offers comparable performance; and 2) human comprehensible.
This paper study’s a simple data miner taken from the psychological science literature, i.e., FFT which outputs small trees, (and generally, smaller is better comprehensible [18, 19]). In this study, FFT uses LDA features () with its default parameters, which does not require any expensive optimization to find the optimal K, and build its trees. We seek few rules through FFT that can report severe and non-severe for the datasets under study. We compared this method against complex and most commonly used methods in SE literature, which are 1) TFIDF+SVM [9, 11, 20, 21]; and 2) a recent state-of-the-art system, LDADE+SVM . Based on this comparative analysis, we answer two research questions:
RQ1: How does simpler method perform against most common sophisticated and recent state-of-the-art Search Based SE (SBSE) methods?
For software analytics, most text mining techniques use high dimensional TF or TFIDF features with complex classifiers like SVM [9, 10, 11, 12, 13, 14, 22, 20, 23, 21, 24]. These features are large in number, in the range of 1,000 to 10,000s making any classifier, complex. Researchers shifted their focus on using LDA features in text mining since it is a good way for dimensionality reduction [25, 15, 26]. SBSE is recently introduced to find the optimal parameters at the expense of heavy runtime [17, 27, 28]. Agrawal et al.  tuned the parameters of LDA to find the optimal number of topics () which is further used by SVM for classification task (state-of-the-art SBSE method).
We show that, FFT (with a depth, ) uses just 10 topics from LDA (simpler method) to achieve comparable performance as SVM with TFIDF features (sophisticated method) as well as LDADE with SVM (SBSE method). The runtime for building the simpler method is about 10 minutes slower than the sophisticated method’s runtime but this may not be an arduous increase given the gains from its power of comprehensibility, whereas simpler method is 100 times faster than SBSE method. Hence, we conclude that,
Simpler method (LDA+FFT) offers similar performance as the sophisticated method (TFIDF+SVM) and the SBSE method (LDADE+SVM). Though simpler LDA+FFT method takes an extra 10 minutes than the baseline, but it is orders of magnitude faster than the SBSE method.
RQ2: Is simpler method more explainable or comprehensible relative to the most common sophisticated and recent state-of-the-art SBSE methods?
We answered the question that simpler method can show comparable performance against sophisticated, and SBSE methods. Now, we dive into the core of our study which is about comprehensibility. Why do we need comprehensible models? We need it to have some actionable insights from the model which will boost the confidence for businesses to accept the model for their software.
Representative characteristics help a model to be more explainable, i.e., small, visualized easily, and comprised of fewer rules that can quickly lead to decisions. The range of features between 1,000 to 10,000s, makes any classifier big and non-comprehensible by default. LDA features offer more comprehensibility aspect to the model than TFIDF or TF features [26, 15].
We show that FFT with LDA features, referencing only 4 topics (depth, ) provide explainable model satisfying the characteristics mentioned earlier. Also, we do not need a SBSE method which is orders of magnitude times slower to find optimal , when a simpler method can provide a well comprehensible model. Hence, we conclude that
FFT generates fewer rules referencing only 4 topics found by LDA are far more comprehensible than the most common sophisticated and SBSE methods.
In summary, the main contributions of this paper are:
A novel inter-disciplinary contribution of the application of psychological science in comprehensibility of text mining models.
LDA+FFTs offer comparable performance against a common text mining method, TFIDF+SVM.
LDA+FFTs are better, faster, and more comprehensible against the recent state-of-the-art method, LDADE+SVM.
A new, very simple baseline data mining method (LDA+FFTs) against which more complex methods can be compared.
A reproduction package containing all the data and algorithms of this paper, see https://github.com/ai-se/LDA_FFT.
The rest of this paper is structured as follows: Section II talks about the background and theory of comprehensibility. Section III describes the experimental setup of this paper and above research questions are answered in Section IV. Lastly, we discuss the validity of our results and a section describing our conclusions.
Ii Motivation and Related Work
This sections talks about theory of comprehensibility, the most commonly used text mining method for bug reports classification, curse of dimensionality, and power of computationally faster methods. We also show how FFTs are generated which is a great alternative to the existing approaches.
Ii-a Theory of Comprehensibility
For software analytics, it is a necessity to find such models that can produce simple and actionable insights for the software practitioners to interpret and act upon . Models are effectively useless if they cannot be interpreted by researchers, developers, and testers . Business users have been vocal in their complaints about analytics , saying that there are rarely producible models that they can comprehend. According to several researchers [30, 31, 32], actionable insights from software artifacts are the core deliverable of software analytics. These insights are then used by the users to enhance their productivity, which is measured in terms of the task that are accomplished. However, is model comprehensibility taken into consideration in the process of development?
Machine learners generate theories and people read theories. But how many of such learners generate the kind of theories that machine learning practitioners can read? In practice, with availability of big data and tremendous amount of information, yet limited time and resources to explore, such as manager rushing with deadlines to release a software or stockbrokers making instant decisions about buying or selling stocks. Rather, in such a critical situation, a person might instead just want to have the least expert-level comprehension of that domain to achieve the most benefits. It therefore follows that machine learning for these practical cases should not strive for elaborated theories or expressive power of the language. A better goal for machine learning would be to find the smallest set of theories with the most impacts and benefits.
Also, in today’s businesses, the problem is not accessing data but ignoring the irrelevant data. Most modern businesses can electronically access large amounts of data such as transactions for the past two years or the state of their assembly line. The trick is effectively using the available data. In practice, this means summarizing large datasets to find the “pearls in the dust” - that is, the data that really matters .
That is why, Gleicher  developed their framework of comprehensibility  and concluded that many researchers do not consider the power of comprehensibility and miss out on important aspects of their results. According to Gleicher:
Comprehensibility makes us understand a prediction to appropriately trust it, or a predictive process to trust in its ability to make predictions.
Comprehensibility helps in prescriptiveness, which is the quality of a model that allows its user to act on something with a result, e.g., its ability to inform action.
Understanding of a model can drive iterative refinement that is applied to improve predictive accuracy, efficiency, and robustness.
While a statistical model usually uncovers correlations, discovers causality, it can also be a useful starting point for theory building, or an approach towards testing theory.
Comprehensibility can characterize by easily interpreting what the model can do and where it can be applied.
It can generalize modeling to other situations which can be part of other (future) applications.
It can identify the success (or failures) in one model, modeling application, or modeling process, that can help us to improve our practices for future applications.
Comprehensibility is defined as the ability of the various stakeholders to understand relevant aspects of the modeling process. How can a model be comprehensible? According to various researchers [34, 4, 35, 36], a comprehensible model can be represented with a rule-based learning [37, 38], or size of the output, i.e., smaller models , or better visualization .
According to Phillips et al. , a model shown to be comprehensible enough for human, when a human can fit the model into their Long Term Memory (LTM)  and when the rules within the model can efficiently lead to decisions. Imagine a model as shown in Figure 1 of SVM, a human would not be able to reason from such a sophisticated output because of 2 reasons: 1) The model is mostly points of transformed data on a new multi-dimensional feature space automatically inferred by some kernel function. Due to the arcane nature of these kernels, it is hard for humans to attribute meaning to these points [41, 3]
; and 2) The model infers a decision boundary or hyperplane (as shown in Figure1) without any generalization . A SVM defines its decision boundary in terms of the vectors nearest that boundary. All these “support vectors” are just points in space so understanding any one of them incurs the problems.
Further, SVMs offer much less support for understanding the entire set of these points than, say, some rule-based representation (as shown in Figure 2 which is an example created by our proposed method on the dataset under study). To understand this, consider a condition that might be found in a rule-based representation, and within the hyperspace of all data, this inequality defines a region within which center conclusions are true, regardless of other attributes. That is, this condition is a generalization across a large space of examples, a region that humans can understand as “within this space, certain properties exist”. The same is not true for support vectors. Such vectors do not tell humans which attributes are most important for selecting one conclusion over another, nor can they divide a space of examples into multiple regions. Rule-based representations do not have that limitation. They can divide space into multiple sectors within which humans know how far they can adjust a few key attributes in order to move from one classification to another.
Consequently, psychological scientists have developed FFT as a rule-based model that is quickly comprehensible, comprising of few rules. A FFT tree is a binary tree classifier, where either one or both node has a terminating branch to a decision node. Basically, it will trigger an immediate understanding and action for each question being asked or topic information feature. As shown in Figure 2, the same complex model of Figure 1 can be comprehensible enough using FFT which is just 5 lines of rules. We will study FFT in greater detail, later in the Section II-E.
Menzies et al. 
obtained similar Decision Tree (DT) rules for the same dataset PitsA which is under study in this paper. A condensed example of their rules are shown in Figure3, the conditions in these rules are at the term occurrence level, whereas our example of FFT (Figure 2) are at topic information level. The term occurrence condition failed to provide any generalized intuition or expert comprehension of how to use such a rule to classify bug report automatically. But if we consider our proposed FFT tree, we observed that if topic 3 0.65 then the report can be classified as severe. The top terms denoting topic 3 are messag unsign bit code file byte word ptr and we can say these terms generalizing “type conversion” topic.
Developers can now use this information to avoid future mistakes in the code where type conversion is happening. We contacted the original users of the PITS data  to look at the topics which we generated (and the conditions where they were found). They agreed that their rules were not generalizable; i.e. they could not use those rules to improve their systems but the topics which we generated are highly relevant and practical. This validates and motivates that the rules generated by our FFT on a topic occurrence level are more comprehensible.
While this paper places high value on comprehensibility, we note that much prior work has ignored this issue. In March 2018, we searched Google scholar for the papers that are published in the last decade, which does text mining to build defect/bug predictors and also talks about comprehensibility. From that list, we selected “highly-cited” papers, which we defined as having more than 5 citations per year. After reading through the titles and abstracts of those papers, and skimming the contents of the potentially interesting papers, we found 16 papers as shown in Table I that motivates our study.
Ii-B Bug Reports Classification
The case studies used in this paper comes from text classification of bug reports. This section describes those case studies.
Many SE text mining researches have been done on bug reports classification to categorize the description of the fault occurrence in a software system. Zhou et al. 
found the top 20, 50, 100 top terms and used these as features to model Naive Bayes, and Logistic regression classifiers. They reported on precision, recall and f-score, and concluded that their method had a significant improvement over other proposed methods. Yet, they did not use these top terms to comprehend the prediction model. Menzies et al.
used TFIDF featurization technique with Naive Bayes classifier to predict the severity of defect reports and they lacked in showing how to interpret such a method. Few researchers[13, 14] used only top TF features to build a SVM classifier but did not provide interpretability of the method.
Many other researchers used SVM as a classifier but used high number of TF features to do bug/defect prediction [9, 10, 11, 12] and they provided top significant terms to explain about the cause of these bugs. In other works, few researchers used SVM with high number of TF features but did not report terms to provide any explanation [22, 20, 23, 21].
Researchers also used LDA’s document topic distribution as features to build bug report prediction models [45, 15, 16, 24]. Xia et al.  worked on LDA features with SVM classifier but did not have any interpretability power. Pingclasai et al.  compared different size of topics needed by LDA against different number of top TF features. They found that LDA with yields the best f-score. Layman et al.  used different number of topics to identify severity of bug reports on 6 NASA Space System Problem datasets. They also comprehensibly showed what these reports were talking about. The problem with this was that they chose high number of topics. Also, Chen et al.  used LDA to identify whether defect prone module stays defect prone even in future versions. They showed top topics with top words related to defect. But the problem existed similar to Layman et al., that they used high number of topics.
We looked at recent studies, which uses high dimensional features combined with different classifiers such as Naive Bayes, SVM, Logistic regression [9, 26, 44] to accurately model the data. But out of that, SVM is the most commonly, frequently and popularly used classifier. From Table I, we can see that 11/16 (about 70%) highly cited papers used SVM as classifiers. Therefore, we chose SVM classifier as the complex baseline learner to compare against the simple FFT model.
Ii-C Curse of Dimensionality
All the text mining techniques model high dimensional data, i.e., a corpus of documents that contains to unique words. The common problem associated with such data is that when the dimensionality increases, the volume of the space increases drastically which leads to available data getting sparsed . This sparsity is problematic when we try to find statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, modeling such high dimensional data often relies on detecting areas where objects form groups with similar properties, however in high dimensional data, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient .
High dimensional data also increases the complexity for data modeling, and is a curse for finding comprehensible models. Researchers use TF and TFIDF feature extraction techniques[44, 12] which provides 1,000 to 10,000s of features for a learner to model it. These numerous features would not offer smaller concise comprehensible models. From Table I, we can see that all the 16 papers have high dimensional features driving us to find alternate methods for reduction in dimensionality.
To tackle the curse of dimensionality, researchers employ different dimensionality reduction techniques like feature transformation (Principal Component Analysis, Latent Dirichlet Allocation), Sampling, Feature Selection Techniques, and many more[47, 48]. For text mining, researchers used mostly a feature transformation or feature selection technique to reduce the feature space in order to find the top words from the corpus which can the be used in classifiers [26, 13, 14].
Latent Dirichlet Allocation (LDA) is a common technique observed in text mining for dimensionality reduction [15, 17]. LDA provides topics that are comprehensible enough and researchers can browse through them to make decisions as shown by Agrawal et al . We agree with their work and their motivation of choosing such a feature extraction technique. That’s why, we chose LDA as a feature extraction technique (since we get concise topics) and after combining it with FFT (depth, ), we get few rules that are comprehensible enough while having better or comparable results classification performance.
Ii-D Computationally Inexpensiveness
There always exists a trade-off between the effectiveness and the cost of running any method. The method should not be expensive to apply (measured in terms of required CPU, or runtime). Before a community can adopt a method, we need to first ensure that the method executes very quickly. Some methods, especially which are used to solve the problem of hyperparameter optimization (the problem of choosing a set of optimal parameters for a learning algorithm), can require hours to days to years of CPU-time to terminate[49, 17]. Hence, unlike such methods, we need to select baseline methods that are reasonably fast.
One such resource expensive method is recently introduced by Agrawal et al. , where they optimized the hyperparameters of LDA to find the optimal settings. They optimized the LDA for
score which was the measure of how stable the generated topics are. They showed that stable topics are needed if developers/users are using these topics for further analysis, especially when it comes to unsupervised learning. They also used these stable topics for supervised learning and showed that the prediction performance is comparable against the commonly used text mining technique of TFIDF with SVM classifier. The major drawback with their method is that it is computationally expensive, and is about three to five times slower. It is computationally expensive due to 2 reasons: 1) Use of computationally expensive optimizer (Differential Evolution) and 2) Number of Topics, which has direct relation with its runtime, i.e., the more number of topics, the more the run time.
As previously mentioned, the reason for choosing LDA features was its power of comprehensibility. Though we do not want to use an expensive technique like LDADE, when we have the option of using default parameters without sacrificing the performance while achieving much better comprehensibility with FFT.
Ii-E How are FFTs generated?
Psychological scientists have developed FFTs (Fast and Frugal Trees) as one way to generate comprehensible models consisting of separate tiny rules [37, 29, 50]. A FFT is a decision tree made for binary classification problem with exactly two branches extending from each node, where either one or both branches is an exit branch leading to a leaf . That is to say, in an FFT, every question posed by a node will trigger an immediate decision (so humans can read every leaf node as a separate rule).
We used the similar implementation of FFT as offered by Fu and Chen et al. [51, 29]. An FFT of depth has a choice of two “exit policies” at each level: the existing branch can select for the negation of the target, i.e., non-severe, (denoted “0”) or the target (denoted “1”), i.e., severe. The right-hand-side tree in Figure 4 is 01110 since:
The first level found a rule that exits to the negation of the target: hence, “0”.
While the next tree levels found rules that exit first to target; hence, “111”.
And the final line of the model exits to the opposite of the penultimate line; hence, the final “0”.
Following the advice of [51, 29, 37], for all the experiments of this paper, we use a depth . For trees of depth , there are possible trees which can be denoted as 00001, 00010, 00101, … , 11110. During FFT training, all trees are generated, then we select the best one (using the training data). This single best tree is then applied to the test data. Note that FFTs of such small depths are very succinct (see examples in Figures 2 and 4). Such FFTs generate rules which leads to decision of finding a report as severe and non-severe for the datasets under study. Many other data mining algorithms used in software analytics are far less succinct and far less comprehensible as explained in Section II-A.
All our data, experiments, scripts are available to be downloaded from https://github.com/ai-se/LDA_FFT.
PITS is a widely used text mining dataset in SE studies [44, 52, 15]. The dataset is generated from NASA software project and issue tracking system (PITS) reports [52, 44]. This text discusses bugs and changes found in big reports and review patches. Such issues are used to manage quality assurance, to support communication between developers. Text Mining techniques can be used to predict each severity separately . The dataset can be downloaded from http://tiny.cc/seacraft. Note that, this data comes from six different NASA projects, which we label as PitsA, PitsB, and so on. For this study, we converted these severity into binary classification where the max number of reports with one severity is labeled as positive class and the rest as negative. We employed the usual preprocessing steps mentioned in text mining literature [17, 53] which are tokenization, stop-words removal, and stemming. Table II shows the number of documents, feature size, and the percentage of severe classes after preprocessing.
Iii-B Feature Extraction
Textual data are actually series of words. In order to run machine learning algorithms we need to convert the text into numerical feature vectors. We used 2 types of feature extraction techniques:
Term Frequency-Inverse Document Frequency (TFIDF): If a word occurs times and is found in documents and there are , and as total number of words and documents respectively , then TFIDF is scored as follows:
Topic Information Features (LDA): We need to decide the number of topics size before applying the LDA model to generate topic information features. To identify the number of topics we employed 2 strategies: 1) Manual number of topic size (10, 25, 50, 100) and 2) Choosing an optimal K using LDADE method 
. LDA model produces the probability of a document in each topic which is used as a feature vector. Normally, the number of topics is significantly smaller than the number of terms, thus LDA can effectively reduce the feature dimension.
For this study we used 2 machine learning algorithms, 1) Support Vector Machine (SVM) and 2) Fast and Frugal Trees (FFTs). We use these, as explained earlier in Section II54, 27, 55, 56]. However, deep learning does not readily support explainability, they have been criticized as “data mining alchemy”  and also a recent study by Majumder et al.  suggest it may not be the most useful for SE data. DT or RF can generate small set of rules but performance can be sacrificed. Camilleri et al.  showed that, DT have accuracy and significantly increased to when the depth of the tree increased from 0 to , meaning that rules generated also moved from less to many. Hence, DT or RF may not be useful for this study.
cross-validation study to make our results more robust and reliable. This checks the amount of variance for such learners. The variance should be as minimal as possible. To control the randomization, seed is set so that the results can be reproducible. For implementation of SVM and other methods, we used the open source tool Scikit-Learn and we relied upon their default parameters as our baseline. Our stratified cross-validation study [60, 27] which includes the process of DE is defined as follows:
We randomized the order of the dataset set five times. This reduces the sampling bias, that some random ordering of examples in the data can conflate our results.
Each time, we divided the data into ten bins.
For each bin (the test), we trained on four bins (the rest) and then tested on the test bin.
When using LDADE, we further divide those four bins of training data. three bins are used for training the model, and one bin is used for validation in DE. DE is run to improve the performance measure when the LDA was applied to the training data. Important point: When tuning, this rig never uses test data.
The model is applied to the test data to collect scores.
Iii-D Evaluation Measure
The problem studied in this paper is a binary classification task. The performance of a binary classifier can be assessed via a confusion matrix as shown in TableIII where a “positive” output is the positive class under study and a “negative” output is the negative one.
Further, “false” means the learner got it wrong and “true” means the learner correctly identified a positive or negative class. Hence, Table III has four quadrants containing, e.g., which denotes “false positive”.
We used the following 2 measures that can be defined from this matrix as:
No evaluation criteria is “best” since different criteria are appropriate in different real-world contexts. Specifically, in order to optimize the performance of the released software, management would maximize the precision which would reduce the recall. When dealing with safety-critical applications, management may be “risk adverse” and hence many elect to maximize recall, regardless of the time wasted exploring false alarm . Both precision and recall cannot be maximized at the same time. We assume that this holds true in the context of this paper and a business user wants to maximize either precision or recall and that is why we evaluate FFT on individual scores.
Iii-E Statistical Analysis
We compared our results using statistical significance test and an effect size test. Significance test is useful for detecting if two populations differ merely by random noise. Scott-Knott procedure was used as significance test [63, 54, 64].
Effect sizes are useful for checking whether two populations differ by more than just a trivial amount. A12 effect size test was used . Our stats test are statistically significant with 95% confidence and not a “small” effect ().
RQ1: How does simpler method perform against most common sophisticated and recent state-of-the-art Search Based SE (SBSE) methods?
As discussed in Section II-B, we found that the most common text mining technique for binary classification in software engineering is TFIDF as the feature extraction method with SVM as a classifier. In recent studies [17, 15], LDA feature extraction is shown to be of a great alternative due to it achieving similar performance as well as reduction in dimensionality.
Some researchers also adapted hyperparameter tuning to optimize performance but they do come with an expense of heavy runtime [28, 54, 17, 27]. Agrawal et al.  showed LDADE with SVM (SBSE method) to achieve better performance for classification tasks. LDADE finds optimal , and , but matters the most for supervised learning .
FFT is shown to be a good classifier when dealing with low dimensionality in defect prediction studies [51, 29]. We used LDA as features for FFT due to its power to explain about the text. That is why we compared sophisticated method (TFIDF+SVM) as well as SBSE method (LDADE+SVM) against the proposed simpler method (LDA+FFT). We also compared LDADE+FFT against LDA+FFT, and tried with different variants of FFTs by using different topic sizes (), changing K manually rather than using an automatic technique like LDADE which is an expensive task, to see what improvement can we find.
Figure 5 offers a statistical analysis of different results achieved between TFIDF+SVM, LDADE+SVM, LDADE+FFT against 10_FFT, 25_FFT, 50_FFT, 100_FFT. Each column represents different datasets and each sub-figure shows precision and recall scores. We assume that business users want to maximize either precision or recall and that is why we run FFTs separately on individual scores. We report median and IQR (inter-quartile range, 75th-25th percentile) values, and darker the cell, the statistically better the performance. For example, in sub-figure where we report precision values, consider the column of pitsA dataset, we will read across rows to know which method works the best. In this case, TFIDF_SVM is better across other methods. Similarly other dataset’s results can be read. Also, if the same color exists across, they are either statistically insignificant or are different only via a small effect (as stated by the statistical methods described in Section III-E).
For recall, we observe that 10_FFT, 25_FFT, 50_FFT, and 100_FFT (LDA_FFTs) are performing statistically similar against all 6 datasets, whereas for precision scores, 10_FFT, 25_FFT, 50_FFT, 100_FFT are performing similar in 4 out of 6 datasets and 10_FFT wins on the remaining 2 occasions. This came as a surprise since value of K are shown to have effect on the classification performance in recent SBSE method  whereas FFT has minimal effect on what value of is used. From now on, that is why all our comparisons are with 10_FFT.
We note that simpler methods (10_FFT) are statistically better or similar on 5 out of 6 datasets against TFIDF+SVM (sophisticated method) when compared on recall but it performs similar on 2 out of 6 datasets when we look at precision value. This tells that simple FFT method have comparable performance against the complex method.
We also found that 10_FFT is winning on precision by a big margin on all 6 datasets when compared against LDADE_SVM. On the other hand, 10_FFT method offered comparable performance against the other 6 datasets for recall. This changes a recent study’s conclusion  where Agrawal et al. showed LDADE_SVM, new simpler state-of-the-art method, defeating the sophisticated method (TFIDF+SVM). The datasets under study are different than what Agrawal et al. used, which might have affected our results. Though, our findings say that:
LDADE+SVM is worse than LDA+FFT and TFIDF+SVM but LDA+FFT is similar to TFIDF+SVM.
LDA_FFT with offers comparable performance against TFIDF+SVM.
LDA_FFT with are wining against LDADE+SVM in majority cases.
With any empirical study, besides classification power, we have to look at the runtimes as another criteria to evaluate the methods performance. Table IV shows the runtimes in minutes. From the table, it can be observed that LDA+FFT is only somewhat slower than TFIDF+SVM which may not be an arduous increase given the gains from its power of comprehensibility discussed in RQ2. However, it can be observed that LDA+FFT combination is orders of magnitude faster (100 fold) than SBSE method (LDADE+SVM). This concludes that SBSE method is quite expensive and our picked alternative solution, i.e., LDA+FFT, is a promising candidate.
Lastly, we would like to make a point that, complex and time-costly model like LDADE or other values of is not needed. We can use as the optimal number of features to build a simple FFT model. Hence,
Simpler method (LDA+FFT) offers similar performance as the sophisticated method (TFIDF+SVM) and the SBSE method (LDADE+SVM). Though simpler LDA+FFT method takes an extra 10 minutes than the baseline, but it is orders of magnitude faster than the SBSE method.
|PITS_A Dataset: if topic 1 0.80 then false else if topic 7 0.60 then true else if topic 3 0.65 then true else if topic 5 0.50 then true else false ⬇ Topic 1: type data line code statu packet word function Topic 7: mode point control project attitud rate error prd Topic 3: messag unsign bit code file byte word ptr Topic 5: file variabl code symbol messag line initi access|
|PITS_B Dataset: if topic 2 0.70 then true else if topic 4 0.75 then false else if topic 7 0.65 then true else if topic 6 0.80 then true else false ⬇ Topic 2: command gce counter step bgi test state antenna Topic 4: line code function file declar comment return use Topic 7: ace command fsw shall level state trace packet Topic 6: test interfac plan file dmr document section data|
|PITS_C Dataset: if topic 1 0.70 then false else if topic 6 0.55 then true else if topic 8 0.73 then true else if topic 2 0.85 then false else false ⬇ Topic 1: requir fsw command specif state specifi shall ground Topic 6: tim trace section document traceabl matrix rqt requir Topic 8: appropri thermal field integr test valid ram violat Topic 2: header zero posit network indic action spacecraft base|
|PITS_D Dataset: if topic 6 0.50 then false else if topic 1 0.80 then true else if topic 4 0.85 then false else if topic 9 0.60 then false else true ⬇ Topic 6: essenti record heater occurr indic includ rollov Topic 1: fsw csc trace data field fpa tabl command Topic 4: enabl wheel use disabl respons control protect fault Topic 9: line cpp case switch default projectd file fsw|
|PITS_E Dataset: if topic 8 0.75 then true else if topic 5 0.70 then false else if topic 7 0.50 then false else if topic 10 0.9 then false else true ⬇ Topic 8: line file function cmd paramet ccu fsw vml Topic 5: inst phx test project set document softwar verifi Topic 7: ptr size time prioriti ega defin data null Topic 10: word fsw enabl capabl follow vagu present emic|
|PITS_F Dataset: if topic 5 0.80 then false else if topic 8 0.75 then true else if topic 2 0.50 then true else if topic 9 0.65 then true else false ⬇ Topic 5: requir projectf tabl ref boot bsw fsw section Topic 8: fsw requir test projectf procedur suffici softwar Topic 2: code variabl test point build defin float valu Topic 9: number byte word limit buffer dump ffp error|
RQ2: Is simpler method more explainable or comprehensible against the most common sophisticated and recent state-of-the-art SBSE methods?
Beside the comparable performance of the simpler method against the most common sophisticated method and the recent SBSE method, it would not bring any merits to practice for software analytics without having explainable insights that can be easily interpreted from the model. Representative characteristics that help a model more explainable, includes small architecture, easily visualized, and comprise of fewer rules that can quickly lead to decisions. From Table II, with large features size range of 550-2000 features from the six datasets of our study, the classifier built on top of that will be too big and complex. Since 2013, researchers have started focusing on using LDA features instead of TFIDF to offer the comprehensible aspect of the models. However, LDA features only provide better sense of interpretability if we have 10s of features not 100s. Researchers have showed both the top key words from TFIDF or LDA [15, 17, 10, 11, 12] features in an attempt to compensate for the comprehensibility of the model but there were no simple decision-making process embedded with it, so the model is not actionable.
For this study, support vector machines were picked as the most common sophisticated method in text mining. SVMs achieve the results after synthesizing new dimensions through the kernel function which are totally unfamiliar to human users. Hence, it is hard to explain to the users.
The proposed simple model of FFT with LDA topics, depth , references the trend of only 4 topics from LDA. At each level of the FFT tree, the existing branch can select for the severeness target, i.e., true (denoted “1”), or the non-severeness target, i.e., false (denoted “0”), as it’s exiting policies. The exiting policies selected by FFT are a trace of the model sampling around the space toward the sections of the data containing the targets of severe class of bug reports. With this architecture, the LDA+FFT would be more explainable for text mining to determine the severity of the bug.
Figure 6 demonstrates how our models can be explainable. The right hand side of the figure shows the four most important topics as a list of top relevant words per dataset. The left hand side includes decision rules of the best performing FFT tree that fit with the LDA generated topics. Some of the possible interpretations of the FFT models from Figure 6 include:
The FFT tree from PitsC dataset, say for depth 1, the exiting policy says that when a report of the dataset will have probability of topic 1 higher than then that report will be a non-severe report.
In other case, the exiting policies for PitsE FFT is “10001”. It starts off with deciding the severeness targeting some low hanging fruit of severe bug reports. Only after clearing away all the non-severe examples at levels two, three, four, it makes a final “true” conclusion. Note that all the exits, except the first and the last, are “false”.
For PitsF FFT’s exiting policies of “01110”. It is similar to “10001” where “01110” starts off with clearing away the non-severe examples then commit on finding the target classes and then clear the rest of non-severe examples. Note that all the exits, except the first and the last, are “true”.
In practice, business users/experts can use this explainable and comprehensible method to identify a new unseen/not labeled report into severe and non-severe, reducing the time and cost spent by business in labeling these reports [66, 67]. For e.g., once FFT tree is built on the seen examples using LDA, a new bug report instance will use LDA to automatically come up with topic probabilities of this report (like topic 1 = 0.7, topic 2 = 0.02 and so on). We can then use the probabilities to traverse through the built FFT tree to classify the severeness of the bug report automatically. With the comparable performance demonstrated in RQ1, this method shall confidently give those experts an actionable and intuitive but more scientific way to quickly label the severeness of the bug report.
Moreover, comprehensibility aspect of the model also let the expert testing theories appropriately. For instance, some of the top words from topic 6 generated for the PitsB dataset (Figure 6) include “test, plan, document, data” in which test planning topic can be easily inferred from. By following the respective FFT model, the development team would now take test planning into more serious consideration in the software development lifecycle to minimize future sever bugs in the software. The team will have the autonomy to easily refine the method accordingly or generalize this method for future applications, which is the two strongly suggested characteristics of the power of comprehensibility by Gleicher .
On the other hand, the models generated from complex or SBSE method will look like Figure 1. As discussed earlier in Section II-A, SVM model generates synthetic feature space and an imaginary hyperplane boundary that lack the power of explainability of such a model to humans. We can not use such a decision space to reason from or make it actionable.
Altogether, our proposed LDA+FFT method has more actionable and comprehensible aspects against TFIDF+SVM, our most sophisticated method, and LDADE+SVM, SBSE method. Moreover, the cost of running LDA+FFT in RQ1 will be compensated with the interpretability of the model. Hence,
FFT generates fewer rules referencing only 4 topics found by LDA are far more comprehensible than the most common sophisticated and SBSE methods.
We found that FFT with small feature space (10 features) found by LDA works as well as SVM with 100s to 1000s TFIDF features and much better than the combination of LDADE and SVM which makes the discussion important on why FFT works. There could be two reasons behind this:
The exit policies selected by FFTs are like a trace of the reasoning jumping around the data. For example, a tree with 11110 policy jumps towards sections of the data containing most severe reports. Also, a 00001 tree shows another model trying to jump away from severe reports until, in its last step, it does one final jump towards severe. This tells us that software data could be “lumpy”, i.e., it divides into a few separate regions, each with different properties. In such a “lumpy” space, a learning policy like FFT works well since its exit policies let a learner discover how to best jump between the “lumps” and other learners fail in this coarse-grained lumpy space [29, 51].
FFT combines good and bad attributes together to find the best decision policy . FFT finds a rule by identifying the exit policy that has the highest probability of that rule leading to a particular class even if the rule contains mixed class distribution. On the other hand, learners like SVM, transform the data into different feature space which could still contain noisy relationship between the transformed space and the decisions.
Based on the above discussion, we will need to extend the usage of FFT in other software analytics tasks on more complex data to see whether the results from this paper holds true for them or not.
Vi Threats to Validity
As with any empirical study, biases can affect the final results. Therefore, any conclusions made from this work must consider the following issues in mind.
Order bias: With each dataset how data samples are distributed in training and testing set is completely random. Though there could be times when all good samples are binned into training set. To mitigate this order bias, we run the experiment 25 times by randomly changing the order of the data samples each time.
Sampling bias threatens any classification experiment, i.e., what matters here may not be true there. For e.g., the datasets used here comes from the SEACRAFT repository and were supplied by one individual. These datasets have been used in various case studies by various researchers [15, 44, 52], i.e., our results are not more biased than many other studies in this arena. That said, our 6 open-source datasets are mostly from NASA. Hence it is an open issue if our results will hold true for both proprietary and open source projects from other sources. Also, our FFT results can also be affected by the size of each datasets. These datasets are smaller in corpus size, so in future, we plan to extend this analysis on larger and higher dimensional datasets.
Learner bias: For LDADE, we selected parameters as default as provided by Agrawal et al. . But there could be some datasets where by tuning them there could be larger improvement. We only used SVM as classifier but there could be other classifiers which can change our conclusions. Data Mining is a large and active field and any single study can only use a small subset of the known data miners.
Evaluation bias: This paper uses topic similarity () for LDADE, and precision and recall for classifiers, but there are other measures which are used in software engineering which includes perplexity, accuracy, etc. Moreover, based on our experiment, we picked precision, and there would be loss in recall performance and vice-versa. Assessing the performance of both the metrics together showing there trade-offs is left for future work.
We would also like to point out that FFTs are only for binary classification, however for multi-class the FFTs can be improvised upon to accommodate this request. Also, FFTs do not scale well with 1000s of features and becomes computationally expensive, which can further be improved. In this study, we used a default depth of 4 to build the trees (in total 16 trees are build to find the best one), but we also need to try with other depth size to see what performance changes will we see making it a clear focus for future.
This paper has shown that a simple and comprehensible data mining algorithm, called Fast and Frugal trees (FFTs) developed by psychological scientist, is remarkably effective for creating few decision rules that are actionable and browsable.
Despite their succinctness, LDA+FFTs are remarkably effective in showing comparable performance on recall and precision when compared against the most common technique of TFIDF with SVM as well as state-of-the-art SBSE method (LDADE+SVM). It can also be said that, we do not need computationally expensive methods to find succinct models.
From the above, we conclude that, there is much for software analytics community that could be learned from psychological science. Proponents of complex methods should always baseline against simpler alternative methods. For example, FFTs could be used as a standard baseline learner against which other software analytics tools can compare.
-  A.-H. Tan et al., “Text mining: The state of the art and the challenges,” in Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, vol. 8. sn, 1999, pp. 65–70.
-  W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, vol. 21, no. 8, pp. 879–886, 2008.
-  T. Menzies, O. Mizuno, Y. Takagi, and T. Kikuno, “Explanation vs performance in data mining: A case study with predicting runaway projects.” JSEA, vol. 2, no. 4, pp. 221–236, 2009.
-  A. Vellido, J. D. Martín-Guerrero, and P. J. Lisboa, “Making machine learning models interpretable.” in ESANN, vol. 12. Citeseer, 2012, pp. 163–172.
-  J. Moeyersoms, E. J. de Fortuny, K. Dejaeger, B. Baesens, and D. Martens, “Comprehensible software fault and effort prediction: A data mining approach,” Journal of Systems and Software, vol. 100, pp. 80–90, 2015.
-  D. Martens, J. Vanthienen, W. Verbeke, and B. Baesens, “Performance of classification models from a user perspective,” Decision Support Systems, vol. 51, no. 4, pp. 782–793, 2011.
K. Dejaeger, T. Verbraken, and B. Baesens, “Toward comprehensible software fault prediction models using bayesian network classifiers,”IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 237–257, 2013.
J. Hihn and T. Menzies, “Data mining methods and cost estimation models: Why is it so hard to infuse new ideas?” inAutomated Software Engineering Workshop (ASEW), 2015 30th IEEE/ACM International Conference on. IEEE, 2015, pp. 5–9.
-  A. Lamkanfi, S. Demeyer, Q. D. Soetens, and T. Verdonck, “Comparing mining algorithms for predicting the severity of a reported bug,” in Software Maintenance and Reengineering (CSMR), 2011 15th European Conference on. IEEE, 2011, pp. 249–258.
-  X. Xia, D. Lo, E. Shihab, X. Wang, and B. Zhou, “Automatic, high accuracy prediction of reopened bugs,” Automated Software Engineering, vol. 22, no. 1, pp. 75–109, 2015.
-  P. S. Kochhar, F. Thung, and D. Lo, “Automatic fine-grained issue report reclassification,” in Engineering of Complex Computer Systems (ICECCS), 2014 19th International Conference on. IEEE, 2014, pp. 126–135.
-  X. Xia, D. Lo, W. Qiu, X. Wang, and B. Zhou, “Automated configuration bug report prediction using text mining,” in Computer Software and Applications Conference (COMPSAC), 2014 IEEE 38th Annual. IEEE, 2014, pp. 107–116.
-  K. Chaturvedi and V. Singh, “Determining bug severity using machine learning techniques,” in Software Engineering (CONSEG), 2012 CSI Sixth International Conference on. IEEE, 2012, pp. 1–6.
-  M. Sharma, P. Bedi, K. Chaturvedi, and V. Singh, “Predicting the priority of a reported bug using machine learning techniques and cross project validation,” in Intelligent Systems Design and Applications (ISDA), 2012 12th International Conference on. IEEE, 2012, pp. 539–545.
-  L. Layman, A. P. Nikora, J. Meek, and T. Menzies, “Topic modeling of nasa space system problem reports: research in practice,” in Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on. IEEE, 2016, pp. 303–314.
-  T.-H. Chen, S. W. Thomas, M. Nagappan, and A. E. Hassan, “Explaining software defects using topic models,” in Proceedings of the 9th IEEE Working Conference on Mining Software Repositories. IEEE Press, 2012, pp. 189–198.
-  A. Agrawal, W. Fu, and T. Menzies, “What is wrong with topic modeling? and how to fix it using search-based software engineering,” Information and Software Technology, 2018.
-  H. Brighton, “Robust inference with simple cognitive models.” in AAAI spring symposium: Between a rock and a hard place: Cognitive science principles meet AI-hard problems, 2006, pp. 17–22.
G. Gigerenzer, J. Czerlinski, and L. Martignon, “How good are fast and frugal heuristics?” inDecision science and technology. Springer, 1999, pp. 81–103.
-  Y. Tian, D. Lo, X. Xia, and C. Sun, “Automated prediction of bug report priority using multi-factor analysis,” Empirical Software Engineering, vol. 20, no. 5, pp. 1354–1383, 2015.
-  Y. Tian, N. Ali, D. Lo, and A. E. Hassan, “On the unreliability of bug severity data,” Empirical Software Engineering, vol. 21, no. 6, pp. 2298–2323, 2016.
-  Y. Tian, D. Lo, and C. Sun, “Drone: Predicting priority of reported bugs by multi-factor analysis,” in Software Maintenance (ICSM), 2013 29th IEEE International Conference on. IEEE, 2013, pp. 200–209.
-  F. Thung, D. Lo, and L. Jiang, “Automatic defect categorization,” in Reverse Engineering (WCRE), 2012 19th Working Conference on. IEEE, 2012, pp. 205–214.
-  X. Xia, D. Lo, Y. Ding, J. M. Al-Kofahi, T. N. Nguyen, and X. Wang, “Improving automated bug triaging with specialized topic model,” IEEE Transactions on Software Engineering, vol. 43, no. 3, pp. 272–297, 2017.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
-  Y. Zhou, Y. Tong, R. Gu, and H. Gall, “Combining text mining and data mining for bug report classification,” Journal of Software: Evolution and Process, vol. 28, no. 3, pp. 150–176, 2016.
-  A. Agrawal and T. Menzies, “Is “better data” better than “better data miners” (benefits of tuning smote for defect prediction),” International Conference on Software Engineering, 2018.
-  W. Fu, T. Menzies, and X. Shen, “Tuning for software analytics: Is it really necessary?” Information and Software Technology, vol. 76, pp. 135–146, 2016.
-  D. Chen, W. Fu, R. Krishna, and T. Menzies, “Applications of psychological science for actionable analytics,” arXiv preprint arXiv:1803.05067, 2018.
-  M. Kim, T. Zimmermann, R. DeLine, and A. Begel, “The emerging role of data scientists on software development teams,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16. New York, NY, USA: ACM, 2016, pp. 96–107. [Online]. Available: http://doi.acm.org/10.1145/2884781.2884783
-  H. K. Dam, T. Tran, and A. Ghose, “Explainable software analytics,” CoRR, vol. abs/1802.00603, 2018. [Online]. Available: http://arxiv.org/abs/1802.00603
-  Z. C. Lipton, “The mythos of model interpretability,” CoRR, vol. abs/1606.03490, 2016. [Online]. Available: http://arxiv.org/abs/1606.03490
-  T. Menzies and Y. Hu, “Data mining for very busy people,” Computer, vol. 36, no. 11, pp. 22–29, 2003.
-  M. Gleicher, “A framework for considering comprehensibility in modeling,” Big data, vol. 4, no. 2, pp. 75–88, 2016.
-  D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, “Comprehensible credit scoring models using rule extraction from support vector machines,” European journal of operational research, vol. 183, no. 3, pp. 1466–1476, 2007.
-  D. Martens and F. Provost, “Explaining data-driven document classifications,” Management Information Systems Quarterly, vol. 38, no. 1, pp. 73–99, 2014.
-  N. D. Phillips, H. Neth, J. K. Woike, and W. Gaissmaier, “Fftrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees,” Judgment and Decision Making, vol. 12, no. 4, p. 344, 2017.
-  H. Brighton, “Robust inference with simple cognitive models,” in Between a rock and a hard place: Cognitive science principles meet AI-hard problems: Papers from the AAAI Spring Symposium. AAAI Press, 2006, pp. 17–22.
-  O. Maimon and L. Rokach, “Decomposition methodology for knowledge discovery and data mining,” in Data mining and knowledge discovery handbook. Springer, 2005, pp. 981–1003.
-  J. Larkin, J. McDermott, D. P. Simon, and H. A. Simon, “Expert and novice performance in solving physics problems,” Science, vol. 208, no. 4450, pp. 1335–1342, 1980.
N. M. Nasrabadi, “Pattern recognition and machine learning,”Journal of electronic imaging, vol. 16, no. 4, p. 049901, 2007.
-  B. Haasdonk, “Feature space interpretation of svms with indefinite kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 482–492, 2005.
-  Cathy Yeh, “Support Vector Machines for classification,” http://efavdb.com/svm-classification/, 2015, online accessed 24 April 2018.
-  T. Menzies and A. Marcus, “Automated severity assessment of software defect reports,” in Software Maintenance, 2008. ICSM 2008. IEEE International Conference on. IEEE, 2008, pp. 346–355.
-  N. Pingclasai, H. Hata, and K.-i. Matsumoto, “Classifying bug reports to bugs and other requests using topic modeling,” in Software Engineering Conference (APSEC), 2013 20th Asia-Pacific, vol. 2. IEEE, 2013, pp. 13–18.
-  J. H. Friedman, “On bias, variance, 0/1—loss, and the curse-of-dimensionality,” Data mining and knowledge discovery, vol. 1, no. 1, pp. 55–77, 1997.
-  L. Van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: a comparative,” J Mach Learn Res, vol. 10, pp. 66–71, 2009.
-  I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence Livermore National Lab., CA (US), Tech. Rep., 2002.
-  T. Wang, M. Harman, Y. Jia, and J. Krinke, “Searching for better configurations: a rigorous approach to clone evaluation,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 2013, pp. 455–465.
-  L. Martignon, K. V. Katsikopoulos, and J. K. Woike, “Categorization with limited resources: A family of simple heuristics,” Journal of Mathematical Psychology, vol. 52, no. 6, pp. 352–361, 2008.
-  W. Fu, T. Menzies, D. Chen, and A. Agrawal, “Building better quality predictors using “-dominance”,” arXiv preprint arXiv:1803.04608, 2018.
-  T. Menzies, “Improving iv&v techniques through the analysis of project anomalies: Text mining pits issue reports-final report,” Citeseer, 2008.
-  R. Feldman and J. Sanger, Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. New York, NY, USA: Cambridge University Press, 2006.
-  B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact of classification techniques on the performance of defect prediction models,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015, pp. 789–800.
-  M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk, “Toward deep learning software repositories,” in Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on. IEEE, 2015, pp. 334–345.
-  X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep learning for just-in-time defect prediction,” in Software Quality, Reliability and Security (QRS), 2015 IEEE International Conference on. IEEE, 2015, pp. 17–26.
-  A. T. . I. R. Synced. (2017) Lecun vs rahimi: Has machine learning become alchemy? [Online]. Available: https://medium.com/@Synced/lecun-vs-rahimi-has-machine-learning-become-alchemy-21cb1557920d
-  S. Majumder, N. Balaji, K. Brey, W. Fu, and T. Menzies, “500+ times faster than deep learning (a case study exploring faster methods for text mining stackoverflow),” in Mining Software Repositories (MSR), 2018 IEEE/ACM 15th International Conference on. ACM, 2018.
M. Camilleri and F. Neri, “Parameter optimization in decision tree learning by using simple genetic algorithms,”WSEAS Transactions on Computers, vol. 13, pp. 582–591, 2014.
-  P. Refaeilzadeh, L. Tang, and H. Liu, “Cross-validation,” in Encyclopedia of database systems. Springer, 2009, pp. 532–538.
-  R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Ijcai, vol. 14, no. 2. Stanford, CA, 1995, pp. 1137–1145.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.
-  N. Mittas and L. Angelis, “Ranking and clustering software cost estimation models through a multiple comparisons algorithm,” IEEE Transactions on software engineering, vol. 39, no. 4, pp. 537–551, 2013.
-  A. Agrawal, A. Rahman, R. Krishna, A. Sobran, and T. Menzies, “We don’t need another hero? the impact of “heroes” on software development,” in Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice Track, ser. ICSE-SEIP ’18, 2018, to Appear, preprint: https://arxiv.org/abs/1710.09055.
-  A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in Software Engineering (ICSE), 2011 33rd International Conference on. IEEE, 2011, pp. 1–10.
-  J. Deng, O. Russakovsky, J. Krause, M. S. Bernstein, A. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2014, pp. 3099–3102.
-  D. Chen, K. T. Stolee, and T. Menzies, “Replicating and scaling up qualitative analysis using crowdsourcing: A github-based case study,” arXiv preprint arXiv:1702.08571, 2017.