How should we reason about software quality? Should we use general models that hold over many projects? Or must we use an ever-changing set of ideas that are continually adapted to the task at hand? Or does the truth lie somewhere in-between? To say that another way:
Are there general principles we can use to guide project management, software standards, education, tool development, and legislation about software?
Or is software engineering some “patchwork quilt” of ideas and methods where it only makes sense to reason about specific, specialized, and small sets of projects?
If the latter were true then then there would be no stable conclusions about what is best practice for SE (since those best practices would keep changing as we move from project to project). As discussed in section 2.1, such conclusion instability has detrimental implications for generality, trust, insight, training, and tool development.
Finding general lessons across multiple projects is a complex task. A new approach, that shows much promise, is the “bellwether” method for transferring conclusions between projects [krishna2018bellwethers, krishna16a, mensah18z, mensah2017stratification, mensah2017investigating] (the bellwether is the leading sheep of a flock, with a bell on its neck). That method:
Finds a “bellwether” project that is exemplar for the rest;
Draw conclusions from that project.
This approach has been successfully applied to defect prediction, software effort estimation, bad smell detection, issue lifetime estimation, and configuration optimization. As a method of transferring lessons learned from one project to another, bellwethers have worked better than the Burak filter[turhan09], Ma et al. [Ma2012]
’s transfer naive Bayes (TNB); and Nam et al. TCA and TCA+, algorithms[Nam13, Nam2015].
In terms of transfer and lessons learned, such bellwethers have tremendous practical significance. When new projects arrive, then even before there is much experience with those new projects, lessons learned from other projects can be applied to the new project (just by studying the bellwether), Further, since the bellwether is equivalent to the other models, then when new projects appear, their quality can be evaluated even before there is an extensive experience base within that particular project (again, just by studying the bellwether).
But existing methods for bellwether transfer are very slow. When applied to the 697 projects studied here, they took 60 days of CPU to find and certify the bellwethers. There are many reasons for that including how the models were certified (20 repeats with different train/test sets) and the complexity of the analysis procedure (which includes fixing class imbalance and feature selection). But the major cause of this slow down was that those methods required ancomparison between projects.
This paper reports a novel approach that dramatically improves on existing bellwether methods. Our GENERAL method uses hierarchical clustering model to recursively divide a large number of projects into smaller clusters. Starting at the leaves of that tree of clusters, GENERAL finds the bellwethers within sibling projects. That bellwether is then promoted up the tree. The output of GENERAL is the project promoted to the top of the tree of clusters.
This paper evaluates the model built from the project found by GENERAL. We will find that the predictions from this model are as good, or better, than those found via “within” learning (where models are trained and tested on local data) or “all-pairs” learning (where models are found after building models from all pairs of projects). That is
Learning from many other projects can be better than learning just from your own local project data.
GENERAL’s clustering methods divide projects into groups. In that space, GENERAL only needs to compare projects to find the bellwether. Theoretically and empirically, this means GENERAL will scale up much better than traditional methods. For example:
This paper applies GENERAL and traditional bellwether to 697 projects. GENERAL and the traditional approach terminated in 1.5 and 72 hours (respectively).
Figure 1 shows a hypothetical cost comparison in AWS between standard bellwethers and GENERAL when running for 100 to 1,000,000 projects. Note that GENERAL is inherently more scalable.
Using GENERAL, we can explore these research questions:
RQ1: Can hierarchical clustering tame the complexity of bellwether-based reasoning?
RQ2: Is this faster bellwether effective?
RQ3: Does learning from too many projects have detrimental effect?
RQ4: What exactly did we learn from all those projects?
Overall, the contributions of this paper are
Hierarchical bellwethers for transfer learner: We offer a novel hierarchical clustering bellwether algorithm called GENERAL (described in section 3) that finds bellwether in hierarchical clusters, then promotes those bellwether to upper levels The final project that is promoted to the root of the hierarchy is returned as “the” bellwether.
Showing inherent generality in SE: In this study we discover a source data set for transfer learner from a large number of projects, hence proving generality in the SE datasets (where some datasets can act as exemplars for the rest of them for defect prediction).
Lessons about software quality that are general to hundreds of software projects: As said above, in this sample of 697 projects, we find that code interface issues are the dominant factor on software defects.
Replication Package: We have made available a replication package111http://tiny.cc/bellwether . The replication package consists of all the datasets used in this paper, in addition to mechanisms for computation of other statistical measures.
The rest of this paper is structured as follows. Some background and related work are discussed in section 2. Our algorithm GENERAL is described in section 3. Data collection and experimental setup are in section 4. Followed by evaluation criteria in section 4.4 and performance measures in section 4.5. The results and answers to the research questions are presented in section 5, which is followed by threats to validity in section 6. Finally the conclusion is provided in section 7.
2 Background and Related Work
2.1 Why Seek Generality?
There are many reasons to seek stable general conclusions in software engineering. If our conclusions about best practices for SE projects keep changing, that will be detrimental to generality, trust, insight, training, and tool development.
Generality:Data science for software engineering cannot be called a “science” unless it makes general conclusions that hold across multiple projects. If we cannot offer general rules across a large number of software projects, then it is difficult to demonstrate such generality.
Trust: Hassan [Hassan17] cautions that managers lose faith in software analytics if its models keep changing since the assumptions used to make prior policy decisions may no longer hold.
Insight: Kim et al. [Kim2016], say that the aim of software analytics is to obtain actionable insights that help practitioners accomplish software development goals. For Tan et al. [tan2016defining], such insights are a core deliverable. Sawyer et al. agree, saying that insights are the key driver for businesses to invest in data analytics initiatives [sawyer2013bi]. Bird, Zimmermann, et al. [Bird:2015] say that such insights occur when users reflect, and react, to the output of a model generated via software analytics. But if new models keep being generated in new projects, then that exhausts the ability of users to draw insight from new data.
Training: Another concern is what do we train novice software engineers or newcomers to a project? If our models are not stable, then it hard to teach what factors most influence software quality.
Tool development: Further to the last point— if we are unsure what factors most influence quality, it is difficult to design and implement and deploy tools that can successfully improve that quality.
2.2 Why Shun Generality?
Just to balance the above argument, we add that sometimes it is possible to learn from Petersen and Wohlin [Petersen2009] argue that for empirical SE, context matters. That is, they would predict that one model will not cover all projects and that tools that report generality over many software projects need to also know the communities within which those conclusions apply. Hence, this work divides into (a) automated methods for finding sets of projects in the same community; and (b) within each community, find the model that works best.
too much data.
The size of the communities found in this way would have a profound impact on how we should reason about software engineering. Consider the hypothetical results of Figure 2. The BLUE curve shows some quality predictor that (hypothetically) gets better, the more projects it learns from (i.e. higher levels in the hierarchical cluster). After about learning from 1000 projects, the BLUE curve’s growth stops and we would say that community size here was around cluster size in level 1. In this case, while we could not offer a single model that covers all of SE, we could offer a handful of models, each of which would be relevant to project clusters at that level.
Now consider the hypothetical RED curve of Figure 2. Here, we see that (hypothetically) learning from more projects makes quality predictions worse which means the our 10,000 projects break up into “communities” of size one. In this case, (a) principles about what is “best practice” for different software projects would be constantly changing (whenever we jump from small community to small community); and (b) the generality issues would be becoming open and urgent concerns for the SE analytics community.
In summary, the above two sections lead to our research question RQ4: does learning from too many projects have detrimental effects. Later in this paper, we will return to this issue.
2.3 Why Transfer Knowledge?
In this section, we ask “Why even bother to transfer lessons learned between projects?”. In several recent studies [bettenburg2012think, menzies2012local, posnett2011ecological] with readily-available data from SE repositories, numerous authors report the locality effect in SE; i.e. general models outperformed by specialized models localized to particular parts of the data. For example. Menzies et al. explored local vs global learning in defect prediction and effort estimation [menzies2012local] and found that learning rules from specific local data was more effective than learning rules from the global space.
On the other hand, Herbold et al. [herbold2017global] offered an opposite conclusion. In their study regarding global vs local model for cross-project defect prediction, they saw that local models offered little to no improvement over models learned from all the global data. One explanation for this discrepancy is the size of number of projects that they explored. Menzies, Herbold et al. explored less than two dozen projects which raises issues of external validity in their conclusions. Accordingly, here, we explore nearly 700 projects. As shown below, the results of this paper agree more with Herbold et al. than Menzies et al. since we show that one global model (learned from a single bellwether projects) does just as well as anything else.
Apart from the above discrepancy in research results, there are many other reasons to explore learning from many projects. Those reasons falls into four groups:
(a) The lesson on big data is that that the more training data, the better the learned model. Vapnik [vapnik14] discusses examples where models accuracy improves to nearly 100%, just by training on times as much data. This effect has yet to be seen in SE data [menzies2013guest] but that might just mean we have yet to use enough training data (hence, this study).
(b) We need to learn from more data since there is very little agreement on what has been learned to far: Another reason to try generalizing across more SE data is that, among developers, there is little agreement on what many issues relating to software:
According to Passos et al. [passos11], developers often assume that the lessons they learn from a few past projects are general to all their future projects. They comment, “past experiences were taken into account without much consideration for their context” [passos11].
Jørgensen & Gruschke [Jo09] offer a similar warning. They report that the suppose software engineering “gurus” rarely use lessons from past projects to improve their future reasoning and that such poor past advice can be detrimental to new projects. [Jo09].
Other studies have shown some widely-held views are now questionable given new evidence. Devanbu et al. examined responses from 564 Microsoft software developers from around the world. They comment programmer beliefs can vary with each project, but do not necessarily correspond with actual evidence in that project [De16].
The good news is that using software analytics, we can correct the above misconceptions. If data mining shows that doing XYZ is bug prone, then we could guide developers to avoid XYZ. But will developers listen to us? If they ask “are we sure XYZ causes problems?”, can we say that we have mined enough projects to ensure that XYZ is problematic?
It turns out that developers are not the only one’s confused about how various factors influence software projects. Much recent research calls into question the “established wisdoms” of SE field. For example, here is a list of recent conclusions that contradict prior conclusions:
In stark contrast to much prior research, pre- and post- release failures are not connected [fenton2000quantitative];
Static code analyzers perform no better than simple statistical predictors [Fa13];
The language construct GOTO, as used in contemporary practice, is rarely considered harmful [nagappan2015empirical];
Strongly typed languages are not associated with successful projects [ray2014large];
Test-driven development is not any better than ”test last” [fucci2017dissection];
Delayed issues are not exponentially more expensive to fix [menzies2017delayed];
Note that if the reader disputes any of the above, then we ask how would you challenge the items on this list? Where would you get the data, from enough projects, to successfully refute the above? And where would you get that data? And how would you draw conclusions from that large set? Note that the answers to these questions requires learning from multiple projects. Hence, this paper.
(c) Imported data can be more useful than local data:
Another benefit of importing data from other projects is that, sometimes, that imported data can be better than the local information. For example, Rees-Jones reports in one study that while building predictors for Github close time for open source projects[rees2017better] using data from other projects performs much better then building models using local learning (because there is better information there than here).
(d) When there is insufficient local data, learning from other projects is very useful: When developing new software in novel areas, it is useful to draw on the relevant experience from related areas with a larger experience base.This is particularly true when developers are doing something that is novel to them, but has been widely applied elsewhere For example, Clark and Madachy [clark15] discuss 65 types of software they see under-development by the US Defense Department in 2015. Some of these types are very common (e.g. 22 ground-based communication systems) but other types are very rare (e.g. only one avionics communication system). (e.g. workers on flight avionics might check for lessons learned from ground-based communications). Developers working in an uncommon area (e.g. avionics communications) might want to transfer in lessons from more common areas (e.g. ground-based communication).
2.4 How to Transfer Knowledge
This art of moving data and/or lessons learned from one project or another is Transfer Learning. When there is insufficient current data to apply data miners to learn defect predictors, transfer learning can be used to transfer lessons learned from other source projects S to the target project T .
Initial experiments with transfer learning offered very pessimistic results. Zimmermann et al. [zimmermann2009cross] tried to port models between two web browsers (Internet Explorer and Firefox) and found that cross-project prediction was still not consistent: a model built on Firefox was useful for Explorer, but not vice versa, even though both of them are similar applications. Turhan’s initial experimental results were also very negative: given data from 10 projects, training on S = 9 source projects and testing on T = 1 target projects resulted in alarmingly high false positive rates (60% or more).
Subsequent research realized that data had to be carefully sub-sampled and possibly transformed before quality predictors from one source are applied to a target project. Successful transfer learning can have two variants -
Heterogeneous Transfer Learning: This type of transfer learning operates on source and target data that contain the different attributes.
Homogeneous Transfer Learning: This kind of transfer learning operates on source and target data that contain the same attributes. This paper explores scalable methods for homogeneous transfer.
Another way to divide transfer learning is the approach that is followed. There are 2 approaches that are frequently used in many research: similarity-based approaches and dimensional transforms.
Similarity-Based Approaches: In this approach we can transfer some/all subset of the rows or columns of data from source to target. For example, the Burak filter [turhan09] builds its training sets by finding the k = 10 nearest code modules in S for every . However, the Burak filter suffered from the all too common instability problem (here, whenever the source or target is updated, data miners will learn a new model since different code modules will satisfy the k = 10 nearest neighbor criteria). Other researchers [kocaguneli2012, kocaguneli2011find]
doubted that a fixed value of k was appropriate for all data. That work recursively bi-clustered the source data, then pruned the cluster sub-trees with greatest “variance” (where the “variance” of a sub-tree is the variance of the conclusions in its leaves). This method combined row selection with row pruning (of nearby rows with large variance). Other similarity methods[Zhang16aa] combine domain knowledge with automatic processing: e.g. data is partitioned using engineering judgment before automatic tools cluster the data. To address variations of software metrics between different projects, the original metric values were discretized by rank transformation according to similar degree of context factors.
Dimensional Transformation: In this approach we manipulate the raw source data until it matches the target. An initial attempt on performing transfer learning with Dimensionality transform was undertaken by Ma et al. [Ma2012] with an algorithm called transfer naive Bayes (TNB). This algorithm used information from all of the suitable attributes in the training data. Based on the estimated distribution of the target data, this method transferred the source information to weight instances the training data. The defect prediction model was constructed using these weighted training data. Nam et al. [Nam13] originally proposed a transform-based method that used TCA based dimensionality rotation, expansion, and contraction to align the source dimensions to the target. They also proposed a new approach called TCA+, which selected suitable normalization options for TCA, When there are no overlapping attributes (in heterogeneous transfer learning) Nam et al. [Nam2015] found they could dispense with the optimizer in TCA+ by combining feature selection on the source/target following by a Kolmogorov-Smirnov test to find associated subsets of columns. Other researchers take a similar approach, they prefer instead a canonical-correlation analysis (CCA) to find the relationships between variables in the source and target data [jing2015heterogeneous].
Considering all the attempts at transfer learning sampled above, suggested a surprising lack of consistency in the choice of datasets, learning methods, and statistical measures while reporting results of transfer learning. This issue was addressed by “Bellwether” suggested by Krishna et al. [krishna2017simpler, krishna16]. which is a simple transfer learning technique is defined in 2- folds namely the Bellwether effect and the Bellwether method:
The Bellwether effect states that, when a community works on multiple software projects, then there exists one exemplary project, called the bellwether, which can define predictors for the others.
The Bellwether method is where we search for the exemplar bellwether project and construct a transfer learner with it. This transfer learner is then used to predict for effects in future data for that community.
In their paper Krishna et al. performed experiment with communities of 3, 5 and 10 projects in each, and showed that (a) bellwethers are not rare, (b) their prediction performance is better than local learning, and (c) they do fairly well when compared with the state-of-the-art transfer learning methods discussed above. This motivated us to use bellwethers as our choice of method for transfer learning to search for generality in SE datasets.
That said, Krishna et al. warn that in order to find bellwether we need to do a comparison; i.e. standard bellwethers have complexity (N being the number of projects in community).
The goal of this paper is to find ways to reduce the Equation 1 complexity.
3 About GENERAL
Our proposed improvement to bellwethers is called GENERAL. The core intuition of this new approach is that if many projects are similar, then we do not need to run comparisons between all pairs of projects. When such similar projects exist, if may suffice to just compare a small number of representative examples.
Accordingly, the rest of this paper performs the following experiment:
Using some clustering algorithm, group all our data into sets of similar projects.
The groups are themselves grouped into super-groups, then super-super-groups, etc to form a tree. This step requires a hierarchical clustering algorithm (see §3.2).
Once bellwether from each group is pushed up the tree, then steps 4,5 are repeated, recursively.
The project pushed to the root of the tree is then used as the bellwether for all the projects.
Note that when the clustering algorithm divides the data into clusters, then the complexity of this method (which we call GENERAL) is:
Figure 1 contrasts the computational cost of Equation 2 with Equation 1 (and that figure assumes , which is the division constant we used in these experiments– see below.). As seen in that figure, the analysis is inherently more scalable than the analysis required by standard bellwether.
To operationalize steps 1,2,3,4,5 listed above, we need to make some lower-level engineering decisions. The rest of this section documents those decisions.
3.1 Feature Extraction
Prior to anything else, we must summarize our projects. Xenos [Xenos] distinguishes between product metrics (e.g. counts of lines of code ); and process metrics about the development process (e.g. number of file revisions). Using the Understand tool [visualize], we calculated 21 product and 5 process metrics to build defect prediction models (see Table I). These product metrics are calculated from snapshots from every 6 months of the data. The process metrics are computed using the change history in the six-months period before the split date via manual collection of data using scripts. The data collected for this project is summarized in Figure 3 and Figure 4.
Understand is a widely used tool in software analytics [Zhang16aa, gizas2012comparative, fontana2011experience, orru2015curated, pattison2008talk, malloy2002testing]. The advantage of using this tool is that much of the tooling needed for this kind of large scale analysis is already available. On the other hand, it also means that we can only reason about the features that Understand can report– which could be a threat to the validity of the conclusions reached. As shown below, the Table I metrics were shown to be effective for our task. Nevertheless, in future work, this study needs to be repeated whenever new tools allow for the widespread collection of different kinds of features.
|Metric||Metric level||Metric Name||Metric Description|
|Product||File||LOC||Lines of Code|
|NSTMT||Number of Statements|
|NFUNC||Number of Functions|
|RCC||Ratio Comments to Code|
|MNL||Max Nesting Level|
|Class||WMC||Weighted Methods per Class|
|DIT||Depth of Inheritance Tree|
|RFC||Response For a Class|
|NOC||Number of Immediate Subclasses|
|CBO||Coupling Between Objects|
|LCOM||Lack of Cohesion in Methods|
|NIV||Number of instance variables|
|NIM||Number of instance methods|
|NOM||Number of Methods|
|NPBM||Number of Public Methods|
|NPM||Number of Protected Methods|
|NPRM||Number of Private Methods|
|Methods||CC||McCabe Cyclomatic Complexity|
|FANIN||Number of Input Data|
|FANOUT||Number of Output Data|
|Process||File||NREV||Number of revisions|
|NFIX||Number of revisions a file|
|ADDED LOC||Lines added|
|DELETED LOC||Lines deleted|
|MODIFIED LOC||Lines modified|
3.2 Hierarchical Clustering
After data collection, comes the hierarchical clustering needed for step 2. For this purpose, we followed the advice from the scikit.learn [scikit-learn]
documentation that recommends the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm for hierarchical clustering for large sample datasets that might contain spurious outliers[zhang1996birch]. BIRCH has the ability to incrementally and dynamically cluster incoming, multi-dimensional data in an attempt to maintain best quality clustering. BIRCH also has the ability to identify data points that are not part of the underlying pattern (so it can effectively identifying and avoid outliers). Google Scholar reports that the original paper proposing BIRCH has been cited over 5,400 times. For this experiment we used defaults proposed by [zhang1996birch]; a branching factor of 20 and the “new cluster creation” threshold of 0.5.
3.3 Data Mining
The bellwether analysis of step3 requires a working data miner. Three requirements for that learner are:
Since it will be called thousands of times, it must run quickly. Hence, we did not use any methods that require neural nets or ensembles.
Since some projects have relatively few defects, before learning, some over-sampling is required to increase the number of defective examples in the training sets
Since one of our research questions (RQ5) asks “what did we learn from all these projects”, we needed a learning method that generate succinct models. According, we used feature selection to check which subset of Table I mattered the most.
According, this study used:
The logistic regression learner (since it is relatively fast);
The SMOTE class imbalance correction algorithm [Chawla2002], which we run on the training data222 The SMOTE Synthetic Minority Over-Sampling Technique algorithms sub-samples the majority class (i.e., deletes examples) while over-sampling the minority class until all classes have the same frequency. To over-sample, new examples are synthesized extrapolating between known examples (of the minority class) and its nearest neighbors. While it is useful to artificially boost the number of target examples in the training data [Chawla2002, Pelayo2007, mensah2017investigating], it is a methodological error to also change the distributions in the test data [agrawal17]. Hence, for our work, we take care to only resample the training data.;
and Hall’s CFS feature selector [hall1999correlation]333
CFS is based on the heuristic that “good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other”. Using this heuristic, CFS performs a best-first search to discover interesting sets of features. Each subset is scored viawhere is the value of some subset of the features containing features; is a score describing the connection of that feature set to the class; and is the mean score of the feature to feature connection between the items in . Note that for this to be maximal, must be large and must be small. That is, features have to connect more to the class than each other..
We selected use these tools since in the domain of software analytics, the use of LR (logistic regression) and SMOTE is endorsed by recent ICSE papers [Rahman:2013, ghotra2015revisiting, agrawal17]. As to CFS, we found that without it, our recalls were very low and we could not identify which metrics mattered the most. Also, extensive studies have found that CFS more useful than many other feature subset selection methods such as PCA or InfoGain or RELIEF [hall1999correlation].
3.4 Select the Best Model
As discussed below, the defect models assessed in these experiments
To find the bellwether, our method must compare many models and select the best one. As discussed below, we score model performance according to five goals:
Maximize recall and precision and popt(20);
While minimizing false alarms and ifa_auc.
(For definitions and details for these criteria, and why we selected them, see §4.5.)
In such multi-objective problems, one model is better than another if it satisfies a “domination predicate”. We use the Zitler indicator dominance predictor [zit02] to select our bellwether (since this is known to select better models for 5-goal optimization [Sayyad:2013, Sayyad:2013:SPL]). This predicate favors model over model if “losses” most:
where “” is the number of objectives (for us, ) and depending on whether we seek to maximize goal .
An alternative to the Zitler indicator is ‘boolean domination ” that says one thing is better than another it if it no worse on any criteria and better on at least one criteria. We prefer Equation 3 to boolean domination since we have a 5-goal optimization problem and it it is known that boolean domination often fails for 3 or more goals [Wagner:2007, Sayyad:2013].
4 Experimental Methods
4.1 Data Collection
To perform our experiments we choose to work with defect prediction datasets. We use the data collected by Zhang et al. [zhang15]. This data has the features of Table I. Originally, this data was collected by Mockus et al. [mockus2009amassing] from SourceForge and GoogleCode. The dataset contains the full history of about 154,000 projects that are hosted on SourceForge and 81,000 projects that are hosted on GoogleCode to the date they were collected. In the original dataset each file contained the revision history and commit logs linked using a unique identifier. Although there were 235K,000 projects in the original database, many of there are trivially small or are about non-software development projects. Zhang et al. cleaned the dataset using the following criteria:
Avoid projects with a small number of commits:
Zhang et al. removed any projects with less than 32 commits (which is the 25 % quantile of the number of commits as the threshold).
Avoid projects with lifespan less than one year: Zhang et al. filtered out any projects with a lifespan less than one year.
Avoid projects with limited defect data: Zhang et al. in their study counted the number of fix-inducing and non-fixing commits from a one-year period and removed any projects with 75 % quantile of the number of fix-inducing and non-fixing commits.
Avoid projects without fix-inducing commits: Zhang et al. filtered out projects that have no fix-inducing commits during six months as abnormal projects, as projects in defect prediction studies need to contain both defective and non-defective commits.
On top of that, we also applied two more filters:
Use mainstream programming Languages: the tool we used (Understand [visualize]) only supported mainstream languages in widespread industrial use; specifically: object-oriented languages with file extension i.e *.c, *.cpp, *.cxx, *.cc, *.cs, *.java, and *.pas.
Avoid projects with less than 50 rows: We removed any project with less than 50 rows as they are too small to build a meaningful predictor.
Avoid projects with too few errors: We pruned projects which did not have enough fix-inducing vs non-fixing data points to create a stratified k=5 fold cross-validation an
These filters resulted in a training set of 697 projects444http://tiny.cc/bellwether_data. Fig 4 and fig 3 shows the Distribution of projects depending on defect percentage, data set size, lines of code, number of files and project languages to confirm the projects selected comes from wide verity and representative of a software community. From these selected projects, the data was labeled using issue tracking system and commit messages. If a project used issue tracking system for maintaining issue/defect history the data was labeled using that. Like Zhang et al., we found that nearly half of the projects did not use an issue tracking system. For these projects, labels were created analyzing commit messages by tagging them as fix-inducing commit if commit message matches the following regular expression
(bug fix error issue crash problem fail defect patch)
4.2 Experimental Setup
Figure 5 illustrates our experimental rig. The following process was repeated 20 times, with different random seeds used each time.
Projects were divided randomly into train_1 and test_1 as a 90:10 split.
The projects in train_1 were used to find the bellwether .
Each project in test_1 was then divide into train_2 and test_2 (using a 2:1 split).
LR and feature selection and SMOTE were then used to build two models: one from the train_1 bellwether and one from the train_2 data.
Both models were then applied to the test_2 data.
In this study, we applied the follow learners:
Self: (a.k.a. local learning). This is the standard method used in software analytics [menzies2013software, zhang2013software]. In this approach, the local project data is divided into a 90% training set (which we call train_2) and a 10% test set (which we call test_2). After that, some some learner builds a model from the training data (using the methods of §4.2), which is then assessed on the test data.
As we shall see, this approach produces competent defect predictors. Recalling the motivation of this paper: we do not seek better predictor that is (say) more accurate than self. That is, hierarchical bellwethers can be recommended even if they perform no better than self. Rather:
As listed in the motivations of §2.1, we seek ways to make conclusions across a wide number of projects.
That is, our goal is to test if hierarchical bellwethers can quickly find a small set of adequate conclusions that hold across a large space of projects.
So, here, by “adequate”, we mean conclusions that perform no worse than those found by other methods.
ZeroR: In his textbook on “Empirical AI”, Cohen [Cohen:1995] recommends base-lining new methods against some simpler approach. For that purpose, we use ZeroR learner. This learner assigns labels every test instance according to the majority class of the training data. Note that if anything we do performs worse that ZeroR, then there is no point to any of the learning technology explored in this paper.
Global: Another baseline, against which we compare our methods is a Global learner build using all the data train_1. Note that, if this learner performs best, then this would mean that we could replace GENERAL with a much simpler system.
Bellwether0: This learner is the bellwether method proposed by Krishna et al. [krishna16a]. What will we show is that GENERAL does better than Bellwether0 is three ways: (a) GENERAL is inherently more scalable; (b) GENERAL is (much) faster; and (c) GENERAL produced better predictions. That is, our new GENERAL method is a significant improvement over the prior state-of-the-art.
GENERAL_level2: GENERAL finds bellwethers at various levels of the BIRCH cluster tree. GENERAL_level2 results show the performance of the model learned from the bellwether found in the leaves of the BIRCH cluster tree. That is these results come from a bellwether generated from 15 to 30 projects. For this process:
First, we tag each leaf cluster with its associated bellwether;
Second, we use the test procedure built into BIRCH; i.e. a test case is presented to the root of the cluster tree and BIRCH returns its relevant leaf; i.e. the cluster closest to that test case.
We then apply the bellwether tagged at that leaf.
GENERAL_level1: GENERAL_level1 results show the performance of the model learned from the bellwether found between the root and the leaves of the BIRCH cluster tree. In practice, BIRCH divides our data only twice so there is only one GENERAL_level1 between root and leaves. For this process, we use the same procedure as GENERAL_level2 but this time, we use the bellwether tagged in the parent cluster of the relevant leaf. Note that these level1 results come from an analysis of between 50 to 200 projects (depneding on the shape of the cluster tree genrated via BIRCH).
GENERAL_level0: In the following, the GENERAL_level0 results show the performance of the model learned from the bellwether found at the root of the BIRCH cluster tree. Note that these results come from an analysis of over 600 projects.
4.4 Statistical Tests
When comparing the results different models in this study, we used a statistical significance test and an effect size test. Significance test is useful for detecting if two populations differ merely by random noise. Also, effect sizes are useful for checking that two populations differ by more than just a trivial amount. For the significance test, we use the Scott-Knott procedure recommended at TSE’13 [mittas2013ranking] and ICSE’15 [ghotra2015revisiting]. This technique recursively bi-clusters a sorted set of numbers. If any two clusters are statistically indistinguishable, Scott-Knott reports them both as one group. Scott-Knott first looks for a break in the sequence that maximizes the expected values in the difference in the means before and after the break. More specifically, it splits values into sub-lists and in order to maximize the expected value of differences in the observed performances before and after divisions. For e.g., lists and of size and where , Scott-Knott divides the sequence at the break that maximizes:
Scott-Knott then applies some statistical hypothesis testto check if and are significantly different. If so, Scott-Knott then recurses on each division. For this study, our hypothesis test was a conjunction of the A12 effect size test (endorsed by [arcuri2011practical]) and non-parametric bootstrap sampling [efron94], i.e., our Scott-Knott divided the data if both bootstrapping and an effect size test agreed that the division was statistically significant (90% confidence) and not a “small” effect ().
4.5 Performance Measures
In this section, we introduce the following 5 evaluation measures used in this study to evaluate the performance of machine learning models. Suppose we have a dataset with M changes and N defects. After inspecting 20% LOC, we inspectedchanges and found defects. Also, when we find the first defective change, we have inspected k changes. Using this data, we can define 5 evaluation measures as follows:
(1) Recall: This is the proportion of inspected defective changes among all the actual defective changes; i.e. . Recall is used in many previous studies [kamei2012large, yang2016effort, yang2017tlel, xia2016collective, yang2015deep].
(2) Precision: This is the proportion of inspected defective changes among all the inspected changes; i.e. . A low Precision indicates that developers would encounter more false alarms, which may have negative impact on developers’ confidence on the prediction model.
(3) pf: This is the proportion of all suggested defective changes which are not actual defective changes among all the suggested defective changes. A high pf suggests developers will encounter more false alarms which may have negative impact on developers’ confidence in the prediction model.
(4) popt20: This is the proportion number of suggested defective changes among all suggested defective changes, when when 20% LOC modified by all changes are inspected. A high popt20 values mean that developers can find most bugs in a small percent of the code. To compute Popt20, we divided the test set into the modules predicted to be faulty (set1) and predicted to be bug-free (set2). Each set was then sorted in ascending order by lines of code. We then ran down set1, then set2, till 20% of the total lines of code were reached– at which point popt20 is the percent of buggy modules seen up to that point.
(5) ifa_auc: Number of initial false alarms encountered before we find the first defect. Inspired by previous studies on fault localization [parnin2011automated, kochhar2016practitioners, xia2016automated], we caution that if the top-k changes recommended by the model are all false alarms, developers would be frustrated and are not likely to continue inspecting the other changes. For example, Parnin and Orso [parnin2011automated] found that developers would stop inspecting suspicious statements, and turn back to traditional debugging, if they could not get promising results within the first few statements they inspect. Using the nomenclature reported about Ifa. In this study we use a modified version of ifa called ifa_auc, which calculates ifa based on efforts spent on inspecting the code. We use gradually increment the efforts spent by increasing the total LOC inspected and calculate ifa on each iteration to get the area under the curve (auc), here the x-axis is the percentage of effort spent on inspection and y-axis is ifa.
RQ1: Can hierarchical clustering tame the complexity of bellwether-based reasoning?
Figure 1 showed that, theoretically, GENERAL is an inherently faster approach than traditional bellwether methods. To test that theoretical conclusion, we ran the rig of Figure 5 on an four core machine running at 2.3GHz with 8GB of RAM.
Figure 6 shows the mean runtimes for one run of GENERAL versus traditional bellwether. For certification purposes, this had to be repeated 20 times. In that certification run:
The analysis of the traditional bellwether0 approach needed 60 days of CPU time.
The analysis of GENERAL needed 30 hours. That is, in empirical result consistent with the theoretical predictions of Figure 1, GENERAL runs much faster than traditional bellwether.
All the other methods required another 6 hours of computation.
If we were merely seeking conclusions from one project, then we would recommend ignoring bellwethers and just use results from each project. That said, we still endorse bellwether method since we seek lessons that hold across many projects.
In summary, based on these results, we conclude that:
RQ2: Is this faster bellwether effective?
The speed improvements reported in RQ1 are only useful of this faster method can also deliver adequate predictions (i.e. predictions that are not worse than those generated by other methods).
Figure 7 shows the distribution of performance score results seen in of Figure 5. These results are grouped together by the “rank” score of the left-hand-side column (and this rank was generated using the statistical methods of §4.4).
In these results, the ifa_auc and precision scores were mostly uninformative. With the exception of ZeroR, there was very little difference in these scores.
As to ZeroR, we cannot recommend that approach. While ZeroR makes few mistakes (low ifas and low pfs), it scores badly on other measures (very low recalls and popt(20).
Similarly, we cannot recommend the global approach. In this approach, quality predictors are learned from one data set that combines data from hundreds of projects. As seen in Figure 7 that approach generates an unacceptably large false alarm rate ().
Another approach we would deprecate is the traditional bellwether approach. By all the measures of Figure 7, the bellwether0 are in the middle of the pack. That is:
That approach is in no way outstanding.
So, compared to hierarchical bellwether, there is no evidence here of a performance benefit from using traditional bellwether.
Given this lack luster performance, and the RQ1 results (where traditional bellwether ran very slowly), we therefore deprecate the traditional bellwether approach.
As to GENERAL vs the local learning results of self, in many ways their performance in Figure 7, is indistinguishable:
As mentioned above, measured in terms of ifa_auc and precision, there is no significant differences.
In terms of recall there is no statistical difference in the rank of local learning with self and GENERAL_level0 (a bellwether generated from the root of a BIRCH cluster tree) and
In terms of pf (false alarms), some of the GENERAL results are ranked the same as self (and we will expand on this point, below).
Overall, we summarize the Figure 7 results as follows:
When two options have similar predictive performance, then other criteria can be used to select between them:
If the goal is to quickly generate conclusions about one project, then we would recommend local learning since (as seen above), local learning is five times faster than hierarchical bellwether.
But, as said at the start of §5, our goal is to quickly generalize across hundreds of projects.
RQ3: Does learning from too many projects have detrimental effect?
Returning now to Figure 2, this research question asks if there is such a thing as learning from too much data. What we will see is that answers to this question are much more complex than the simplistic picture of Figure 2. While for some goals it is possible to learn from too much data, there are other goals where it seems more is always better.
To answer RQ3, we first note that when GENERAL calls the BIRCH clustering algorithm, it generates the tree of clusters shown in Figure 8. In that tree:
The bellwether found at level 0 of the tree (which we call GENERAL_level0) is learned from 627 projects.
The bellwethers found at level 1 of the tree (which we call GENERAL_level1) is learned from four sub-groups of our projects.
The bellwethers found at level 2 of the tree (which we call GENERAL_level2) is learned from 80 sub-sub groups of our projects.
That is, to answer RQ3 we need only compare the predictive performance of models learned from these different levels. In that comparison, if the level_ bellwethers generated better predictions that the level_ bellwethers, then we would conclude that it is best to learn lessons from smaller groups of projects.
Figure 7 lets us compare the performance of the bellwethers learned from different levels:
The ifa and Popt20 and precision results for the different levels are all ranked the same. Hence we say that, measured in terms of those measures, we cannot distinguish the performance at different levels.
As to recall, the level2,1,0 bellwether results are respectively ranked worst, better, best.
Slightly different results are offered in the pf false alarm results. Here, levels2,1,0 bellwether are respectively ranked best, best, worst.
That is, these results say that:
To put that another way, the answer to “is is possible to learn from too much data”, is “depends on what you value”:
For risk-adverse development of mission or safety critical systems, it is best to use all data to learn the bellwether since that finds most defects.
On the other hand, for cost-adverse development of non-critical systems (where cutting development cost is more important than removing bugs), then there seems to be a “Goldilocks zone” where the bellwether is learned from just enough data (but not too much or too little).
RQ4: What exactly did we learn from all those projects?
Having demonstrated that we can quickly find bellwethers from hundreds of software projects, it is appropriate to ask what model was learned from all that data. This is an important question for this research sinceif we cannot show the lessons learned from our 627 projects, then all the above is wasted effort.
Table II shows the weights learned by logistic regression after feature selection using the bellwether project selected by GENERAL_level0. Note that:
Table II is sorted by the absolute value of the weights associated with those features. The last two features have near zero weights; i.e. they have negligible effect.
Apart from the negligible features, all that is left are NPRM, NPNM, RFC , and CBO. As shown in Table I, these features all relate to class interface concepts; specifically:
The number of public and private methods;
The average number of methods that respond to an incoming message;
Figure 9 shows what might be learned with and without the methods of this paper. Recall that the learners used in this research used feature selection and logistic regression.
Just to say the obvious: when learning local models from very many projects, there is a wide range of features used in the model. It is far easier to definitively learn lessons from a much smaller range of features, such as those listed in Table II. For example, based on these results we can say that for predicting defects, in this sample of features taken from 627 projects:
Issues of inter-class interface are paramount;
To say that another way,
Learning from many other projects can be better than learning just from your own local project data.
6 Threats to Validity
As with any large scale empirical study, biases can affect the final results. Therefore, any conclusions made from this work must be considered with the following issues in mind:
(a) Evaluation Bias: In RQ1, RQ2 and RQ3
we have shown the performance of local model, hierarchical bellwether models, default bellwether model and compared them using statistical tests on their performance to make conclusion about presence of generality in SE datasets. While those results are true, that conclusion is scoped by the evaluation metrics we used to write this paper. It is possible that, using other measurements, there may well be a difference in these different kinds of projects. This is a matter that needs to be explored in future research.
(b) Construct Validity
: At various places in this report, we made engineering decisions about (e.g.) choice of machine learning models, hierarchical clustering algorithm, selecting feature vectors for each project. While those decisions were made using advice from the literature, we acknowledge that other constructs might lead to different conclusions.
(c) External Validity: For this study we have relied on data collected by Zhang et al. [zhang15] for their studies. The metrics collected for each project were done using an commercialized tool called “Understand”. There is a possibility that calculation of metrics or labeling of defective vs non-defective using other tools or methods may result in different outcome. That said, the “Understand” is a commercialized tool which has detailed documentation about the metrics calculations and Zhang et al. has shared their scripts and process to convert the metrics to usable format and has described the approach to label defects.
We have relied on issues marked as a ‘bug’ or ‘enhancement’ to count bugs or enhancements, and bug or enhancement resolution times. In Github, a bug or enhancement might not be marked in an issue but in commits. There is also a possibility that the team of that project might be using different tag identifiers for bugs and enhancements. To reduce the impact of this problem, we did take precautionary step to (e.g.,) include various tag identifiers from Cabot et al. [cabot2015exploring]. We also took precaution to remove any pull merge requests from the commits to remove any extra contributions added to the hero programmer.
(d) Statistical Validity: To increase the validity of our results, we applied two statistical tests, bootstrap and the a12. Hence, anytime in this paper we reported that “X was different from Y” then that report was based on both an effect size and a statistical significance test.
(e) Sampling Bias: Our conclusions are based on the 697 projects collected by Zhang et al. [zhang15] for their studies. It is possible that different initial projects would have lead to different conclusions. That said, this sample is very large so we have some confidence that this sample represents an interesting range of projects. As evidence of that, we note that our sampling bias is less pronounced than other “Bellwether” studies since we explored.
In this paper, we have proposed a new transfer learning bellwether method called GENERAL. While GENERAL only reflects on a small percent of the projects, its hierarchical methods find projects which yield models whose performance is comparable to anything else we studied in this analysis. Using GENERAL, we have shown that issues of class interface design were the most critical issue within a sample of 628 projects.
One reason we recommend GENERAL is its scalabiity. Pre-existing bellwether methods are very slow. Here, we show that a new method based on hierarchical reasoning is both must faster (empirically) and can scale to much larger sets of projects (theoretically). Such scalability is vital to our research since, now that we have shown we can reach general conclusions from 100s of projects, our next goal is to analyze 1000s to 10,000s of projects.
Finally, we warn that much of the prior work on homogeneous transfer learning many have complicated the homogeneous transfer learning process with needlessly complicated methods. We strongly recommend that when building increasingly complex and expensive methods, researchers should pause and compare their supposedly more sophisticated method against simpler alternatives. Going forward from this paper, we would recommend that the transfer learning community uses GENERAL as a baseline method against which they can test more complex methods.
This work was partially funded by NSF Grant #1908762.