Data mining tools have been applied to many applications in SE (e.g. [1, 2, 3, 4, 5, 6, 7]). Despite these successes, current software analytic tools have certain drawbacks. At a workshop on “Actionable Analytics” at the 2015 IEEE conference on Automated Software Engineering, business users were vocal in their complaints about analytics . “Those tools tell us what is, ” said one business user, “But they don’t tell us what to do”. Hence we seek new tools that offer guidance on “what to do” within a specific project.
We seek such new tools since current analytics tools are mostly prediction algorithms such as support vector machines 10], logistic regression . For example, defect prediction tools report what combinations of software project features predict for some dependent variable (such as the number of defects). Note that this is a different task to planning, which answers the question: what to change in order to improve quality.
More specifically, we seek plans that offer least changes in software but most improve the quality; here:
Quality = defects reported by the development team;
Improvement = lowered likelihood of future defects.
“ …When a community of programmers work on a set of projects, then within that community there exists one exemplary project, called the bellwether111According to the Oxford English Dictionary, the bellwether is the leading sheep of a flock, with a bell on its neck., which can best define quality predictors for the other projects …”
Utilizing the bellwether effect, we propose a cross-project variant of our XTREE contrast set learner called BELLTREE (a portmanteau, BELLTREE = BellwetherXTREE).
BELLTREE searches for an exemplar project, or bellwether , to construct plans from other projects. As shown by the experiments of this paper, these plans can be remarkably effective. In 10 open-source JAVA systems, hundreds of defects could potentially be reduced in sections of the code that followed the plans generated by our planners. Further, we show that planning is possible across projects, which is particularly useful when there are no historical logs available for a particular project to generate plans from.
The structure of this paper is as follows: the rest of this section introduces the four research questions asked in this paper, our contributions (§ 1.1), and relationships between this and our prior work (§ 1.2). In § 2 we discuss the notion of planning. There, in § 2.1, we discuss the planners studied here. § 3 contains the research methods, datasets, and evaluation strategy. In § 4 we answer the research questions. In § 5 we discuss the implications of our findings. Finally, § 6 and § 7 present threats to validity and conclusions respectively.
RQ1: Is within-project planning with XTREE comparatively more effective?
In this research question, we explore what happens when XTREE is trained on past data from within a project. XTREE uses historical logs from past releases of a project to recommend changes that might reduce defects in the next version of the software. Since such effects might not actually be casual, our first research question compares the effectiveness of XTREE’s recommendations against alternative planning methods. Recent work by Shatnawi , Alves et al. , and Oliveira et al.  assume that unusually large measurements in source code metrics point to larger likelihood of defects. Those planners recommend changing all such unusual code since they assume that, otherwise, this may lead to defect-prone code. When XTREE is compared to the methods of Shatnawi, Alves, Oliveira et al., we find that:
Here, by “superior” we mean “larger defect reduction”.
RQ2: Is cross-project planning with BELLTREE effective?
This research question addresses cases where projects lack local data (perhaps due to the project being relatively new). We ask if we can move data across from other projects to generate actionable plans.
For prediction tasks, in the case of domains like defect prediction, effort estimation, code smell detection, etc., it has been shown that thebellwether effect  can be used to make cross-project prediction when a project lacks sufficient within-project data. We ask if a similar approach is possible for planning across projects. To assert this, we first discover a bellwether dataset from the available projects, and then we construct the BELLTREE planner.
As with RQ1, we note that in a cross-project setting, plans generated using BELLTREE were effective in generating actionable plans. Also, when compared plans from Shatnawi, Alves, Oliveira et al., we find BELLTREE to be a significantly superior approach to cross-project planning.
RQ3: Are cross-project plans generated by BELLTREE as effective as within-project plans of XTREE?
In this research question, we compare the effectiveness of plans obtained with BELLTREE (cross-project) to plans obtained with XTREE (within-project).
This is an important result—when local data is missing, projects can use lessons learned from other projects.
RQ4: How many changes do the planners propose?
In the final research question, we ask how many attributes are recommended to be changed by different planners.
This result has a lot of practical significance since developers have a hard time following those plans that recommend too many changes.
1. New kinds of software analytics techniques: This work combines planning  with cross-project learning using bellwethers . This is a unique approach since our previous work on bellwethers [11, 17] explored prediction and not planning as described here. Also, our previous work on planning  explored with-project problems but not cross-project problems as explored here.
2. Compelling results about planning: Our results show that planning is quite successful in producing actions that can reduce the number of defects; Further, we see that plans learned on one project can be translated to other projects.
3. More evidence of generality of bellwethers: Bellwethers were originally just used in the context of prediction  and have been shown to work for (i) defect prediction, (ii) effort estimation, (iii) issues close time, and (iv) detecting code smells . This paper extends those results with a new finding that bellwethers can also be used from cross-project planning. This is an important result of much significance since, it suggests that general conclusions about SE can be easily found (with bellwethers).
4. An open source reproduction package containing all our scripts and data. For readers interested in replicating this work, kindly see https://git.io/fNcYY.
1.2 Relationship to Prior Work
As for the connections to prior research, this article significantly extends those results. As shown in Fig. 1, originally in 2007 we explored software quality prediction in the context of training and testing within the same software project . After that we found ways in 2009 to train these predictors on some projects, then test them on others 
. Subsequent work in 2016 found that bellwethers were a simpler and effective way to implement transfer learning, which worked well for a wide range of software analytics tasks . Meanwhile, in the area of planning, we conducted some limited within-project planning in 2017 on recommending what to change in software . This current article now addresses a much harder question: can plans be generated from one project and applied to the another? While answering this question, we have endeavored to avoid our mistakes from the past, e.g., the use of overly complex methodologies to achieve a relatively simpler goal. Accordingly, this work experiments with bellwethers to see if this simple method works for planning as with prediction.
One assumption across much of our work is the homogeneity of the learning, i.e., although the training and testing data may belong to different projects, they share the same attributes [11, 12, 17, 18, 19]. Since that is not always the case, we have recently been exploring heterogeneous learning where attribute names may change between the training and test sets . Heterogeneous planning is primary focus of our future work.
This paper extends a short abstract presented at the IEEE ASE’17 Doctoral Symposium . Most of this paper, including all experiments, did not appear in that abstract.
2 About Planning
We distinguish planning from prediction for software quality as follows: Quality prediction points to the likelihood of defects. Predictors take the form:
where in contains many independent features (such as OO metrics) and out contains some measure of how many defects are present. For software analytics, the function is learned via mining static code attributes.
|amc||average method complexity|
|avg cc||average McCabe|
|cam||cohesion amongst classes|
|cbm||coupling between methods|
|cbo||coupling between objects|
|dit||depth of inheritance tree|
|lcom (lcom3)||2 measures of lack of cohesion in methods|
|loc||lines of code|
|max cc||maximum McCabe|
|noc||number of children|
|npm||number of public methods|
|rfc||response for a class|
|wmc||weighted methods per class|
|#defects||raw defect counts|
On the other hand, quality planning generates a concrete set of actions that can be taken (as precautionary measures) to significantly reduce the likelihood of future defects.
For a formal definition of plans, consider a defective test example , a planner proposes a plan “” to adjust attribute as follows:
The above plans are described in terms of a range of numeric values. In this case, they represent an increase (or decrease) in some of the static code metrics of Fig. 2.
In order to operationalize such plans, developers need some guidance on what to change in order to achieve the desired effect. There are two ways to generate that guidance. One way is to is to use a technique used by Nayrolles et al.  at MSR 2018. In that approach, we look through the developer’s own history to find old examples where they have made the kinds of changes recommended by the plan.
When that data is not accessible, another way to operationalize plans is via guidance from the literature. Many papers have discussed how code changes adjust static code metrics [22, 23, 24, 25, 26, 27]. Fig. 3(b) shows a summary of that research. For example, in Fig. 3, say a planner has recommended the changes shown in Fig. 3(a). Then, we use 3LABEL:sub@subfig:actions to look-up possible actions developers may take, there we see that performing an “extract method” operation may help alleviate certain defects (this is highlighted in gray). In 3LABEL:sub@subfig:before we show a simple example of a class where the above operation may be performed. In 3LABEL:sub@subfig:after, we demonstrate how a developer may perform the “extract method”.
We say that Fig. 3 is an example of code-based planning where the goal is to change a code base in order to improve that code in some way. The rest of this section discusses other kinds of planning.
2.1 Other Kinds of Planning
Planning is extensively explored in artificial intelligence research. There, it usually refers to generating a sequence of actions that enables anagent to achieve a specific goal . This can be achieved by classical search-based problem solving approaches or logical planning agents. Such planning tasks now play a significant role in a variety of demanding applications, ranging from controlling space vehicles and robots to playing the game of bridge . Some of the most common planning paradigms include: (a) classical planning ; (b) probabilistic planning [32, 33, 34]; and (c) preference-based planning [35, 36]. Existence of a model precludes the use of each of these planning approaches. This is a limitation of all these planning approaches since not every domain has a reliable model.
Apart from code-based planning, we know of at least two other kinds of planning research in SE. Each kind is distinguishable by what is being changed.
In process-based planning some search-based optimizer is applied to a software process model to infer high-level business plans about software projects. Examples of that kind of work include our own prior studies combining simulated annealing with the COCOMO models or Ruhe et al.’s work on next release planning in requirements engineering [40, 41].
In software engineering, the planning problem translates to proposing changes to software artifacts. These are usually a hybrid task combining probabilistic planning and preference-based planning using search-based software engineering techniques [42, 43]
. These search-based techniques are evolutionary algorithms that propose actions guided by a fitness function derived from a well established domain model. Examples of algorithms used here include GALE, NSGA-II, NSGA-III, SPEA2, IBEA, MOEA/D, etc.[44, 45, 46, 47, 48, 49, 50]. As with traditional planning, these planning tools all require access to some trustworthy models that can be used to explore some highly novel examples. In some software engineering domains there is ready access to such models which can offer assessment of newly generated plans. Examples of such domains within software engineering include automated program repair [51, 52, 53], software product line management [54, 55, 56], automated test generation [57, 58], etc.
However, not all domains come with ready-to-use models. For example, consider all the intricate issues that may lead to defects in a product. A model that includes all those potential issues would be very large and complex. Further, the empirical data required to validate any/all parts of that model can be hard to find. Worse yet, our experience has been that accessing and/or commissioning a model can be a labor-intensive process. For example, in previous work  we used models developed by Boehm’s group at the University of Southern California. Those models took as inputs project descriptors to output predictions of development effort, project risk, and defects. Some of those models took decades to develop and mature (from 1981  to 2000 ). Lastly, even when there is an existing model, they can require constant maintenance lest they become out-dated. Elsewhere, we have described our extensions to the USC models to enable reasoning about agile software developments. It took many months to implement and certify those extensions [62, 63]. The problem of model maintenance is another motivation to look for alternate methods that can be quickly and automatically updated whenever new data becomes available.
In summary, for domains with readily accessible models, we recommend the kinds of tools that are widely used in the search-based software engineering community such as GALE, NSGA-II, NSGA-III, SPEA2, IBEA, particle swarm optimization, MOEA/D, etc. In other cases where this is not an option, we propose the use of data mining approaches to create a quasi-model of the domain and make of use observable states from this data to generate an estimation of the model. Examples of such a data mining approaches are described below. These include five methods described in the rest of this section: XTREE, BELLTREE and the approaches of Alves et al., Shatnawi , and Oliveira et al. 
Fig. 4.A: Using XTREE
Using the training data, construct a decision tree. For each test item, find thecurrent leaf: take each test instance, run it down to a leaf in the decision tree. After that, find the desired leaf: [leftmargin=3mm] Starting at current, ascend the tree levels; Identify sibling leaves; i.e. leaf clusters that can be reached from level that are not same as current Find the better siblings; i.e. those with a score (#defects) less than times the score of current branch. If none found, then repeat for . Also, return no plan if the new is above the root. Return the closest better sibling to the current. Also, find the delta; i.e. the set difference between conditions in the decision tree branch to desired and current. To find that delta: (1) for discrete attributes, delta is the value from desired; (2) for numerics, delta is the numeric difference; (3) for numerics discretized into ranges, delta is a random number selected from the low and high boundaries of the that range. Finally, return the delta as the plan for improving the test instance. Fig. 4.B: A sample decision tree.
2.1.1 Within-Project Planning With XTREE
XTREE builds a decision tree, then generates plans by contrasting the differences between two branches: (1) the current branch; (2) the desired branch. XTREE uses a supervised regression tree algorithm that is constructed on discretized values OO metrics (we use Fayyad-Irani discretizer ). Next, XTREE builds plans from the branches of the tree as follows, for every test instance, we ask:
Which current branch matches the test instance?
Which desired branch would the test want to emulate?
What are the deltas between current and desired?
2.1.2 Cross-project Planning with BELLTREE
Many methods have been proposed for transferring data or lessons learned from one project to another, for examples see [65, 66, 67, 68, 69, 19, 70]. Of all these, the bellwether method described here is one of the simplest. Transfer learning with bellwethers is just a matter of calling existing learners inside a for-loop. For all the training data from different projects , a bellwether learner conducts a round-robin experiment where a model is learned from project, then applied to all others. The bellwether is that project which generates the best performing model. The bellwether effect, states that models learned from this bellwether performs as well as, or better than, other transfer learning algorithms.
For the purposes of prediction, we have shown previously that bellwethers are remarkably effective for many different kinds of SE tasks such as (i) defect prediction, (ii) effort estimation, and (iii) detecting code smells . This paper is the first to check the value of bellwethers for the purposes of planning. Note also that this paper’s use of bellwethers enables us to generate plans from different data sets from across different projects. This represents a novel and significant extension to our previous work  which was limited to the use of datasets from within a few projects.
BELLTREE extends the three bellwether operators defined in our previous work  on bellwethers: DISCOVER, PLAN, VALIDATE. That is:
DISCOVER: Check if a community has bellwether. This step is similar to our previous technique used to discover bellwethers . We see if standard data miners can predict for the number of defects, given the static code attributes. This is done as follows:
For a community obtain all pairs of data from projects such that ;
Predict for defects in using a quality predictor learned from data taken from ;
Report a bellwether if one generates consistently high predictions in a majority of .
PLAN: Using the bellwether, we generate plans that can improve a new project. That is, having learned the bellwether on past data, we now construct a decision tree similar to within-project XTREE. We then use the same methodology to generate the plans.
VALIDATE: Go back to step 1 if the performance statistics seen during PLAN fail to generate useful actions.
Looking through the SE literature, we can see researchers have proposed three other methods analogous to XTREE planning. Those other methods are proposed by Alves et al.  (described in this section) plus those of Shatnawi  and Oliveira et al.  described below.
Alves et al.  proposed an unsupervised approach that uses the underlying statistical distribution and scale of the OO metrics. It works by first weighting each metric value according to the source lines of code (SLOC) of the class it belongs to. All the weighted metrics are then normalized by the sum of all weights for the system. The normalized metric values are ordered in an ascending fashion (this is equivalent a density function, where the x-axis represents the weight ratio (0-100%), and the y-axis the metric scale).
Alves et al. then select a percentage value (they suggest 70%) which represents the “normal” values for metrics. The metric threshold, then, is the metric value for which 70% of the classes fall below. The intuition is that the worst code has outliers beyond 70% of the normal code measurements i.e., they state that the risk of there existing a defect is moderate to high when the threshold value of 70% is exceeded.
Here, we explore the correlation between the code metrics and the defect counts with a univariate logistic regression and reject code metrics that are poor predictors of defects (i.e. those with ). For the remaining metrics, we obtain the threshold ranges which are denoted by ranges for each metric. The plans would then involve reducing these metric range to lie within the thresholds discovered above.
Shatnawi  offers a different alternative Alves et al by using VARL (Value of Acceptable Risk Level). This method was initially proposed by Bender  for his epidemiology studies. This approach uses two constants ( and ) to compute the thresholds, which Shatnawi recommends to be set to . Then using a univariate binary logistic regression three coefficients are learned: the intercept constant; the coefficient for maximizing log-likelihood; and to measure how well this model predicts for defects. (Note: the univariate logistic regression was conducted comparing metrics to defect counts. Any code metric with is ignored as being a poor defect predictor.)
Thresholds are learned from the surviving metrics using the risk equation proposed by Bender:
In a similar fashion to Alves et al., we deduce the threshold ranges as for each selected metric. The plans would again involve reducing these metric range to lie within the thresholds discovered above.
Oliveira et al. in their 2014 paper offer yet another alternative to absolute threshold methods discussed above . Their method is still unsupervised, but they propose complementing the threshold by a second piece of information called the relative threshold. This measure denotes the percentage of entities the upper limit should be applied to. These have the following format:
Here, is an OO metric, is the upper limit of the metric value, and (expressed as %) is the minimum percentage of entities are required to follow this upper limit. As an example Oliveira et al. state, “85% of the methods should have . Essentially, this threshold expresses that high-risk methods may impact the quality of a system when they represent more than 15% of the whole population”
The procedure attempts derive these values of for each metric . They define a function ComplianceRate(p, k) that returns the percentage of system that follows the rule defined by the relative threshold pair . They then define two penalty functions: (1) penalty1(p, k) that penalizes if the compliance rate is less than a constant , and (2) penalty2(k) to define the distance between and the median of preset -th percentile. (Note: according to Oliveira et al., median of the tail is an idealized upper value for the metric, i.e., a value representing classes that, although present in most systems, have very high values of M). They then compute the total penalty as penalty = penalty1(p, k) + penalty2(k). Finally, the relative threshold is identified as the pair of values that has the lowest total penalty. After obtaining the for each OO metric. As in the above two methods, the plan would involve ensuring the for every metric of the entities have a value that lies between .
3 Research Methods
The rest of this paper compares XTREE and BELLTREE against Alves, Shatnawi, Oliveira et al.
The defect dataset used in this study comprises a total of 38 datasets from 10 different projects taken from previous transfer learning studies. This group of data was gathered by Jureczko et al. . They recorded the number of known defects for each class using a post-release bug tracking system. The classes are described in terms of 20 OO metrics, including CK metrics and McCabes complexity metrics, see Fig. 2 for description. Since we attempt to learn plans from within the project and across projects, we explore homogeneous transfer of plans. Homogeneity requires that the attributes (static code metrics) are the same for all the datasets and all the projects. We obtained the dataset from the SEACRAFT repository222https://zenodo.org/communities/seacraft/ (formerly the PROMISE repository ). For more information see .
3.2 A Strategy for Evaluating Planners
It can be somewhat difficult to judge the effects of applying plans to software projects. These plans cannot be assessed just by a rerun of the test suite for three reasons: (1) The defects were recorded by a post release bug tracking system. It is entirely possible it escaped detection by the existing test suite; (2) Rewriting test cases to enable coverage of all possible scenarios presents a significant challenge; and (3) It may take a significant amount of effort to write new test cases that identify these changes as they are made.
To resolve this problem, SE researchers such as Cheng et al. , O’Keefe et al. [75, 76], Moghadam  and Mkaouer et al.  use a verification oracle learned separately from the primary oracle. This oracles assesses how defective the code is before and after some code changes. For their oracle, Cheng, O’Keefe, Moghadam and Mkaouer et al. use the QMOOD quality model .
A shortcoming of QMOOD is that quality models learned from other projects may perform poorly when applied to new projects . Hence, we eschew older quality models like QMOOD and propose a verification oracle based on the overlap. between two sets: (1) The changes that developers made, perhaps in response to the issues raised in a post-release issue tracking system; and (2) Plans recommended by an automated planning tool such as XTREE/BELLTREE. Using these two sources of changes, it is possible to compute the extent to which a developer’s action matches that of the actions recommended by planners. This is measured using overlap:
That is, we measure overlap using the size of the intersections divided by the size of the union of the changes. Here represents the changes made by the developers and represents the changes recommended by the planner. Accordingly, the larger the intersection between the changes made by the developers to the changes recommended by the planner, the greater the overlap.
As an example, consider Fig. 6; there we have 2 sets of changes: (1) Changes made by developers (), and (2) Changes recommended by the planner (). In each case we have 3 possible actions for every metric: (1) Make no change (‘’), (2) Increase (‘’), and (3) Decrease (‘’). The intersection of the changes represents the number of times the actions taken by the developers is the same as the actions recommended by the planner. This the above example, the intersection, , out of a total of possible actions. This leads to .
3.2.1 The -test
Measuring overlap as described in the previous section only measures the extent to which the recommendations made by planning tools match those undertaken by the developers. Note that overlap does not measure the quality of the recommendation. Specifically, it does not tell us what the impact of making those changes would be to future releases of a project. Therefore, it is necessary to augment the overlap with a measure of how many defects are reduced as a result of making the changes.
For this purpose, we propose the -test. Given a project with versions ordered chronologically (that is, version in terms of release dates), we divide project data into three sets train, test; and validation which are used as follows:
First, train the planner on version . Note: this could either be data that is either a previous release (version i), or it could be data from the bellwether dataset.
Next, use the planner to generate plans to reduce defects for version .
Finally, on version , we measure the OO metrics for each class in , then we (a) measure the overlap between plans recommended by the planner and the developer’s actions; (b) count the number of defects reduced/increased when compared to the previous release as a result of implementing these plans.
As the outcome of the -test we obtain the number of defects (increased or decreased) and the extent of overlap (from 0% to 100%). These two measures enable us to plot the operating characteristic curve for the planners (referred to henceforth as planner effectiveness curve). The operating characteristic (OC) curve depicts the effectiveness of a planner with respect to its ability to reduce defects. The OC curve plots the overlap of developer changes with the planner’s recommendations versus the number of defects reduced. A sample curve for one of our datasets is shown in Fig. 7.
For each of the datasets with versions we (1) train the planner on version ; (2) deploy the planner to recommend plans for version ; and (3) validate plans for version . Following this, we plot the “planner effectiveness curve”. Finally, as in Fig. 7, we compute the area under the planner effectiveness curve (AUPEC) using simpsons rule .
Here, the variable represents the overlap between plans and developer changes and represents the number of defects reduced as result of the overlap.
We should interpret AUPEC as follows:
4 Experimental Results
RQ1: Is within-project planning with XTREE comparatively more effective?
We answer this question in two parts: (a) First, we assess the effectiveness of XTREE (using Area Under Planner Effectiveness Curve); (b) Next, we compare XTREE with other threshold based planners. In each case, we split the available data into training, testing, and validation. That is, given versions , we, train the planners on version ; then generate plans using the planners for version ; then validate the effectiveness of those plans on using the -test. Then, we repeat the process by training on , testing on , and validating on version , and so on.
For each of these sets, we generate performance statistics as per Fig. 7; i.e. plot the planner effectiveness curve to measure the number of defects reduced (and increased) as a function of extent of overlap. Then, we measure the Area-Under the Planner Effectiveness Curve (AUPEC).
Fig. 8LABEL:sub@subfig:wp shows the results of planning with XTREE (see column labeled XTREE). The columns constitutes of 2 parts (labeled and ) where:
(a) represents AUPEC for the number of defects reduced, and larger values are better;
(b) indicates the number of defects increased in response to overlap with XTREE’s plans, and smaller values are better.
We observe that, in 14 out of 18 cases, the AUPEC of defect reduced in much larger than AUPEC of defects increased. This indicates that within-project XTREE is very effective in generating plans that reduce the number of defects. Further, we note that in terms of the number of defects reduced, in all 18 datasets, XTREE significantly outperforms other threshold based planners.
It is worth noting that, in the case of AUPEC for defects increased, XTREE’s plans do seem to result in larger increase in the number of defects increased when compared to other threshold based learners. But, the number of defects reduced as result of XTREE’s plans are much larger than the occasional increase in defects. Thus, in summary,
[0.4cm] Xalan Poi Log4j Jedit Metrics XTREE Alves Shatnawi Oliveira XTREE Alves Shatnawi Oliveira XTREE Alves Shatnawi Oliveira XTREE Alves Shatnawi Oliveira wmc 72 100 90 98 91 94 80 100 93 87 69 94 [t] dit 59 77 67 50 100 50 16 100 46 69 100 66 noc 22 70 1 50 100 cbo 100 1 92 100 2 49 90 100 1 17 89 36 6 95 rfc 69 22 96 2 49 97 15 16 97 42 72 98 lcom 31 30 83 9 78 100 94 65 84 91 37 58 60 93 ca 8 77 78 7 7 80 12 84 75 12 23 17 87 ce 11 22 88 7 88 6 1 85 21 93 npm 78 96 100 89 13 98 100 93 97 100 87 97 100 90 lcom3 65 22 84 41 56 97 91 22 59 92 loc 100 100 100 100 100 100 100 100 100 1 100 98 100 dam 100 51 6 80 16 49 15 60 moa 35 35 47 35 30 56 30 46 100 54 67 100 54 mfa 64 75 81 4 64 100 84 38 16 52 51 62 75 cam 77 95 49 97 83 97 2 77 98 ic 77 45 43 43 2 100 28 7 22 47 cbm 44 24 52 99 100 62 100 31 11 100 51 amc 70 32 99 13 66 62 99 82 17 99 36 84 64 99 max_cc 81 76 76 14 81 57 74 53 83 82 100 70 50 87 avg_cc 94 97 96 98
RQ2: Is cross-project planning with BELLTREE effective?
In the previous research question, we construct XTREE using historical logs of previous releases of a project. However, when such logs are not available, we may seek to generate plans using data from across software projects. To do this we offer BELLTREE, a planner that makes use of the Bellwether Effect in conjunction with XTREE to perform cross-project planning. In this research question, we assess the effectiveness of BELLTREE. For details of construction of BELLTREE, see § 2.1.2.
Our experimental methodology for answering this research question is as follows:
Next, we construct XTREE, but we do this using the bellwether dataset. We call this variant BELLTREE.
For each of the other projects, we use BELLTREE constructed above to recommend plans.
Then, we use the subsequent releases of the above projects to validate the effectiveness of those plans.
Finally, we generate performance statistics as per Fig. 7; i.e. plot the planner effectiveness curve to measure the number of defects reduced (and increased) as a function of extent of overlap. Then, we measure the Area-Under the Planner Effectiveness Curve (AUPEC). Figure 8LABEL:sub@subfig:cp shows the AUPEC scores that were the outcome cross-project planning with BELLTREE (see column labeled BELLTREE). Similar to our findings in RQ1, we note that, in 15 out of 18 cases, AUPEC of defect reduced is much larger that AUPEC of defects increased. This indicates that cross-project planning BELLTREE is also very effective in generating plans that reduce the number of defects. Further, when we train each of the other planners with the bellwether dataset and compare them with XTREE, we note that, as with RQ1, BELLTREE outperforms other threshold based planners for cross-project planning.
RQ3: Are cross-project plans generated by BELLTREE as effective as within-project plans of XTREE?
The third research question compares within-project planning with XTREE to cross-project planning with BELLTREE. To answer this research question, we train XTREE on within-project data and generate plans for reducing the number of defects. We then compare this with plans derived from the bellwether data and BELLTREE. We hypothesized that since bellwethers have been demonstrated to be efficient in prediction tasks, learning from the bellwethers for a specific community of projects would produce performance scores comparable to within-project data. We found that this was indeed the case; i.e. in terms of AUPEC, both XTREE and BELLTREE are similar to each other.
Fig. 8 tabulates the AUPEC scores for the comparison between the use of within-project XTREE (see Fig. 8LABEL:sub@subfig:wp) and cross-project BELLTREE(see Fig. 8LABEL:sub@subfig:cp) for reducing the number of defects. We note that, out of 18 datasets from 10 projects, the AUPEC scores are quite comparable. In 5 cases XTREE performs better than BELLTREE, in 7 cases BELLTREE outperforms XTREE, and in 6 cases the performance is the same. In summary, we make the following observations:
RQ4: How many changes do the planners propose?
This question naturally follows the findings of the previous research questions. Here, we ask how many changes each of the planners recommend. This is important because having plans recommend far too many changes would make it challenging for practical use.
Our findings for XTREE tabulated in Fig. 9333Space limitations prohibit showing results of BELLTREE. We notice a very similar trend to XTREE. Interested readers can use our replication package (https://git.io/fNcYY) to further evaluate these results.
show that XTREE (BELLTREE) proposes far fewer changes compared to other planners. This is because, both XTREE and BELLTREE operate based on supervised learning incorporating two stages of data filtering and reasoning: (1) Discretization of attributes based on information gain, and (2) Plan generation based on contrast sets between adjacent branches. This is different to the other approaches. The operating principle of the other approaches is that attribute values larger than a certain threshold must always be reduced. Hence, they usually propose plans that use all attributes in an unsupervised manner, without first filtering out the less important attributes based on how they impact the quality of software. This leads to those planners being far more verbose and, possibly, harder to operationalize.
When discussing these results with colleagues, we are often asked the following questions.
1. Why use automatic methods to find quality plans? Why not just use domain knowledge; e.g. human expert intuition? Recent research has documented the wide variety of conflicting opinions among software developers, even those working within the same project. According to Passos et al. , developers often assume that the lessons they learn from a few past projects are general to all their future projects. They comment, “past experiences were taken into account without much consideration for their context”. Jorgensen and Gruschke  offer a similar warning. They report that the supposed software engineering “gurus” rarely use lessons from past projects to improve their future reasoning and that such poor past advice can be detrimental to new projects . Other studies have shown some widely-held views are now questionable given new evidence. Devanbu et al. examined responses from 564 Microsoft software developers from around the world. They comment programmer beliefs can vary with each project, but do not necessarily correspond with actual evidence in that project . Given the diversity of opinions seen among humans, it seems wise to explore automatic oracles for planning.
2. Does using BELLTREE guarantee that software managers will never have to change their plans? No. Software managers should evolve their policies when the evolving circumstances require such an update. But how to know when to retain current policies or when to switch to new ones? Bellwether method can answer this question.
Specifically, we advocate continually retesting the bellwether’s status against other data sets within the community. If a new bellwether is found, then it is time for the community to accept very different policies. Otherwise, it is valid for managers to ignore most the new data arriving into that community.
6 Threats to Validity
Sampling Bias: Sampling bias threatens any classification experiment; what matters in one case may or may not hold in another case. For example, data sets in this study come from several sources, but they were all supplied by individuals. Thus, we have documented our selection procedure for data and suggest that researchers try a broader range of data.
Evaluation Bias: This paper uses one measure for the quality of the planners: AUPEC (see Fig. 7). Other quality measures may be used to quantify the effectiveness of planner. A comprehensive analysis using these measures may be performed with our replication package. Additionally, other measures can easily be added to extend this replication package.
: Theoretically, with prediction tasks involving learners such as random forests, there is invariably some degree of randomness that is introduced by the algorithm. To mitigate these biases, researchers, including ourselves in our other work, report the central tendency and variations over those runs with some statistical test. However, in this case, all our approaches aredeterministic. Hence, there is no need to repeat the experiments or run statistical tests. Thus, we conclude that while order bias is theoretically a problem, it is not a major problem in the particular case of this study.
7 Conclusions and Future Work
Most software analytic tools that are currently in use today are mostly prediction algorithms. These algorithms are limited to making predictions. We extend this by offering “planning”: a novel technology for prescriptive software analytics. Our planner offers users a guidance on what action to take in order to improve the quality of a software project. Our preferred planning tool is BELLTREE, which performs cross-project planning with encouraging results. With our BELLTREE planner, we show that it is possible to reduce several hundred defects in software projects.
It is also worth noting that BELLTREE is a novel extension of our prior work on (1) the bellwether effect, and (2) within-project planning with XTREE. In this work, we show that it is possible to use bellwether effect and within-project planning (with XTREE) to perform cross-project planning using BELLTREE, without the need for more complex transfer learners. Our results from Fig. 8 show that BELLTREE is just as good as XTREE, and both XTREE/BELLTREE are much better than other planners.
Further, we can see from Fig. 9 that both BELLTREE and XTREE recommend changes to very few metric, while other unsupervised planners such as Shatnawi, Alves, and Olivera, recommend changing most of the metrics. This is not practical in many real world scenarios.
Hence our overall conclusion is to endorse the use of planners like XTREE (if local data is available) or BELLTREE (otherwise).
The work is partially funded by NSF awards #1506586 and #1302169.
-  J. Czerwonka, R. Das, N. Nagappan, A. Tarvo, and A. Teterev, “Crane: Failure prediction, change analysis and test prioritization in practice – experiences from windows,” in Software Testing, Verification and Validation (ICST), 2011 IEEE Fourth Intl. Conference on, march 2011, pp. 357 –366.
-  T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Where the bugs are,” in ISSTA ’04: Proc. the 2004 ACM SIGSOFT Intl. symposium on Software testing and analysis. New York, NY, USA: ACM, 2004, pp. 86–96.
-  T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald, “Problems with Precision: A Response to ”Comments on ’Data Mining Static Code Attributes to Learn Defect Predictors’”,” IEEE Transactions on Software Engineering, vol. 33, no. 9, pp. 637–640, sep 2007.
-  B. Turhan, A. Tosun, and A. Bener, “Empirical evaluation of mixed-project defect prediction models,” in Software Engineering and Advanced Applications (SEAA), 2011 37th EUROMICRO Conf. IEEE, 2011, pp. 396–403.
-  E. Kocaguneli, T. Menzies, A. Bener, and J. Keung, “Exploiting the essential assumptions of analogy-based effort estimation,” IEEE Transactions on Software Engineering, vol. 28, pp. 425–438, 2012.
-  A. Begel and T. Zimmermann, “Analyze this! 145 questions for data scientists in software engineering,” in Proc. 36th Intl. Conf. Software Engineering (ICSE 2014). ACM, June 2014.
-  C. Theisen, K. Herzig, P. Morrison, B. Murphy, and L. Williams, “Approximating attack surfaces with stack traces,” in ICSE’15, 2015.
-  J. Hihn and T. Menzies, “Data mining methods and cost estimation models: Why is it so hard to infuse new ideas?” in 2015 30th IEEE/ACM Intl. Conf. Automated Software Engineering Workshop (ASEW), Nov 2015, pp. 5–9.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
-  S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings,” IEEE Trans. Softw. Eng., vol. 34, no. 4, pp. 485–496, jul 2008.
-  R. Krishna, T. Menzies, and W. Fu, “Too much automation? the bellwether effect and its implications for transfer learning,” in Proc. 31st IEEE/ACM Intl. Conf. Automated Software Engineering - ASE 2016. New York, New York, USA: ACM Press, 2016, pp. 122–131.
-  R. Krishna, T. Menzies, and L. Layman, “Less is more: Minimizing code reorganization using XTREE,” Information and Software Technology, mar 2017.
-  S. Mensah, J. Keung, S. G. MacDonell, M. F. Bosu, and K. E. Bennin, “Investigating the significance of the bellwether effect to improve software effort prediction: Further empirical study,” IEEE Transactions on Reliability, no. 99, pp. 1–23, 2018.
-  R. Shatnawi, “A quantitative investigation of the acceptable risk levels of object-oriented metrics in open-source systems,” IEEE Transactions on Software Engineering, vol. 36, no. 2, pp. 216–225, March 2010.
-  T. L. Alves, C. Ypma, and J. Visser, “Deriving metric thresholds from benchmark data,” in 2010 IEEE Int. Conf. Softw. Maint. IEEE, sep 2010, pp. 1–10.
-  P. Oliveira, M. T. Valente, and F. P. Lima, “Extracting relative thresholds for source code metrics,” in 2014 Software Evolution Week - IEEE Conf. Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, feb 2014, pp. 254–263.
-  R. Krishna and T. Menzies, “Bellwethers: A baseline method for transfer learning,” IEEE Transactions on Software Engineering, pp. 1–1, 2018.
-  T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE transactions on software engineering, vol. 33, no. 1, pp. 2–13, 2007.
-  B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009.
-  J. Nam, W. Fu, S. Kim, T. Menzies, and L. Tan, “Heterogeneous defect prediction,” IEEE Transactions on Software Engineering, vol. PP, no. 99, pp. 1–1, 2017.
-  R. Krishna, “Learning effective changes for software projects,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 2017, pp. 1002–1005.
-  K. Stroggylos and D. Spinellis, “Refactoring–does it improve software quality?” in Software Quality, 2007. WoSQ’07: ICSE Workshops 2007. Fifth Intl. Workshop on. IEEE, 2007, pp. 10–10.
-  B. Du Bois, “A study of quality improvements by refactoring,” 2006.
-  Y. Kataoka, T. Imai, H. Andou, and T. Fukaya, “A quantitative evaluation of maintainability enhancement by refactoring,” in Software Maintenance, 2002. Proceedings. Intl. Conf. IEEE, 2002, pp. 576–585.
-  S. Bryton and F. B. e Abreu, “Strengthening refactoring: towards software evolution with quantitative and experimental grounds,” in Software Engineering Advances, 2009. ICSEA’09. Fourth Intl. Conf. IEEE, 2009, pp. 570–575.
-  K. Elish and M. Alshayeb, “A classification of refactoring methods based on software quality attributes,” Arabian Journal for Science and Engineering, vol. 36, no. 7, pp. 1253–1267, 2011.
-  ——, “Using software quality attributes to classify refactoring to patterns.” JSW, vol. 7, no. 2, pp. 408–419, 2012.
-  M. Nayrolles and A. Hamou-Lhadj, “Clever: Combining code metrics with clone detection for just-in-time fault prevention and resolution in large industrial projects,” in Mining Software Repositories, 2018.
-  S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Prentice-Hall, Egnlewood Cliffs, 1995.
-  M. Ghallab, D. Nau, and P. Traverso, Automated Planning: theory and practice. Elsevier, 2004.
M. Wooldridge and N. R. Jennings, “Intelligent agents: Theory and practice,”
The knowledge engineering review, vol. 10, no. 2, pp. 115–152, 1995.
-  R. Bellman, “A markovian decision process,” Indiana Univ. Math. J., vol. 6, pp. 679–684, 1957.
Constrained Markov decision processes. CRC Press, 1999, vol. 7.
-  X. Guo and O. Hernández-Lerma, “Continuous-time markov decision processes,” Continuous-Time Markov Decision Processes, pp. 9–18, 2009.
T. C. Son and E. Pontelli, “Planning with preferences using logic programming,”Theory and Practice of Logic Programming, vol. 6, no. 5, pp. 559–607, 2006.
-  S. S. J. A. Baier and S. A. McIlraith, “Htn planning with preferences,” in 21st Int. Joint Conf. on Artificial Intelligence, 2009, pp. 1790–1797.
-  S. Tallam and N. Gupta, “A concept analysis inspired greedy algorithm for test suite minimization,” ACM SIGSOFT Software Engineering Notes, vol. 31, no. 1, pp. 35–42, 2006.
-  S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,” Software Testing, Verification and Reliability, vol. 22, no. 2, pp. 67–120, 2012.
-  D. Blue, I. Segall, R. Tzoref-Brill, and A. Zlotnick, “Interaction-based test-suite minimization,” in Proc. the 2013 Intl. Conf. Software Engineering. IEEE Press, 2013, pp. 182–191.
-  G. Ruhe and D. Greer, “Quantitative studies in software release planning under risk and resource constraints,” in Empirical Software Engineering, 2003. ISESE 2003. Proceedings. 2003 Intl. Symposium on. IEEE, 2003, pp. 262–270.
-  G. Ruhe, Product release planning: methods, tools and applications. CRC Press, 2010.
-  M. Harman, S. A. Mansouri, and Y. Zhang, “Search based software engineering: A comprehensive analysis and review of trends techniques and applications,” Department of Computer Science, King’s College London, Tech. Rep. TR-09-03, 2009.
-  M. Harman, P. McMinn, J. De Souza, and S. Yoo, “Search based software engineering: Techniques, taxonomy, tutorial,” Search, vol. 2012, pp. 1–59, 2011.
J. Krall, T. Menzies, and M. Davies, “Gale: Geometric active learning for search-based software engineering,”IEEE Transactions on Software Engineering, vol. 41, no. 10, pp. 1001–1018, 2015.
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A Fast Elitist Multi-Objective Genetic Algorithm: NSGA-II,”
IEEE Transactions on Evolutionary Computation, vol. 6, pp. 182–197, 2002.
-  E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the Strength Pareto Evolutionary Algorithm for Multiobjective Optimization,” in Evolutionary Methods for Design, Optimisation, and Control. CIMNE, Barcelona, Spain, 2002, pp. 95–100.
-  E. Zitzler and S. Künzli, “Indicator-Based Selection in Multiobjective Search,” in Parallel Problem Solving from Nature - PPSN VIII, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2004, vol. 3242, pp. 832–842.
-  K. Deb and H. Jain, “An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints,” Evolutionary Computation, IEEE Transactions on, vol. 18, no. 4, pp. 577–601, Aug. 2014.
-  X. Cui, T. Potok, and P. Palathingal, “Document clustering using particle swarm optimization,” …Intelligence Symposium, 2005. …, 2005.
-  Q. Zhang and H. Li, “MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition,” Evolutionary Computation, IEEE Transactions on, vol. 11, no. 6, pp. 712–731, Dec. 2007.
W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, “Automatically finding patches using genetic programming,” inProc. Intl. Conf. Software Engineering. IEEE, 2009, pp. 364–374.
-  C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer, “A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,” in 2012 34th Intl. Conf. Software Engineering (ICSE). IEEE, jun 2012, pp. 3–13.
-  C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu, S. Forrest, and W. Weimer, “The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs,” IEEE Transactions on Software Engineering, vol. 41, no. 12, pp. 1236–1256, dec 2015.
-  A. S. Sayyad, J. Ingram, T. Menzies, and H. Ammar, “Scalable product line configuration: A straw to break the camel’s back,” in Automated Software Engineering (ASE), 2013 IEEE/ACM 28th Intl. Conf. IEEE, 2013, pp. 465–474.
-  A. Metzger and K. Pohl, “Software product line engineering and variability management: achievements and challenges,” in Proc. on Future of Software Engineering. ACM, 2014, pp. 70–84.
-  C. Henard, M. Papadakis, M. Harman, and Y. L. Traon, “Combining multi-objective search and constraint solving for configuring large software product lines,” in 2015 IEEE/ACM 37th IEEE Intl. Conf. Software Engineering, vol. 1, May 2015, pp. 517–528.
-  J. H. Andrews, F. C. H. Li, and T. Menzies, “Nighthawk: A Two-Level Genetic-Random Unit Test Data Generator,” in IEEE ASE’07, 2007.
-  J. H. Andrews, T. Menzies, and F. C. H. Li, “Genetic Algorithms for Randomized Unit Testing,” IEEE Transactions on Software Engineering, Mar. 2010.
-  T. Menzies, O. Elrawas, J. Hihn, M. Feather, R. Madachy, and B. Boehm, “The business case for automated software engineering,” in Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. ACM, 2007, pp. 303–312.
-  B. Boehm, Software Engineering Economics. Prentice Hall, 1981.
-  B. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K. Clark, B. Steece, A. W. Brown, S. Chulani, and C. Abts, Software Cost Estimation with Cocomo II. Prentice Hall, 2000.
-  P. G. Ii, T. Menzies, S. Williams, and O. El-Rawas, “Understanding the value of software engineering technologies,” in Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 2009, pp. 52–61.
-  B. Lemon, A. Riesbeck, T. Menzies, J. Price, J. D’Alessandro, R. Carlsson, T. Prifiti, F. Peters, H. Lu, and D. Port, “Applications of Simulation and AI Search: Assessing the Relative Merits of Agile vs Traditional Software Development,” in IEEE ASE’09, 2009.
-  U. Fayyad and K. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” NASA JPL Archives, 1993.
-  J. Nam, S. J. Pan, and S. Kim, “Transfer defect learning,” in Proc. Intl. Conf. Software Engineering, 2013, pp. 382–391.
-  J. Nam and S. Kim, “Heterogeneous defect prediction,” in Proc. 2015 10th Jt. Meet. Found. Softw. Eng. - ESEC/FSE 2015. New York, New York, USA: ACM Press, 2015, pp. 508–519.
-  X. Jing, G. Wu, X. Dong, F. Qi, and B. Xu, “Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning,” in FSE’15, 2015.
-  E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation?” in Empirical Software Engineering and Measurement (ESEM), 2011 Intl. Symposium on. IEEE, 2011, pp. 255–264.
-  E. Kocaguneli, T. Menzies, and E. Mendes, “Transfer learning in effort estimation,” Empirical Software Engineering, vol. 20, no. 3, pp. 813–843, jun 2015.
-  F. Peters, T. Menzies, and L. Layman, “LACE2: Better privacy-preserving data sharing for cross project defect prediction,” in Proc. Intl. Conf. Software Engineering, vol. 1, 2015, pp. 801–811.
-  R. Bender, “Quantitative risk assessment in epidemiological studies investigating threshold effects,” Biometrical Journal, vol. 41, no. 3, pp. 305–319, 1999.
-  M. Jureczko and L. Madeyski, “Towards identifying software project clusters with regard to defect prediction,” in Proc. 6th Int. Conf. Predict. Model. Softw. Eng. - PROMISE ’10. New York, New York, USA: ACM Press, 2010, p. 1.
-  T. Menzies, R. Krishna, and D. Pryor, “The promise repository of empirical software engineering data. north carolina state university, department of computer science,” 2016.
-  B. Cheng and A. Jensen, “On the use of genetic programming for automated refactoring and the introduction of design patterns,” in Proc. 12th Annual Conf. Genetic and Evolutionary Computation, ser. GECCO ’10. New York, NY, USA: ACM, 2010, pp. 1341–1348.
-  M. O’Keeffe and M. O. Cinnéide, “Search-based refactoring: An empirical study,” J. Softw. Maint. Evol., vol. 20, no. 5, pp. 345–364, Sep. 2008.
-  M. K. O’Keeffe and M. O. Cinneide, “Getting the most from search-based refactoring,” in Proc. 9th Annual Conf. Genetic and Evolutionary Computation, ser. GECCO ’07. New York, NY, USA: ACM, 2007, pp. 1114–1120.
-  I. H. Moghadam, Search Based Software Engineering: Third Intl. Symposium, SSBSE 2011, Szeged, Hungary, September 10-12, 2011. Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, ch. Multi-level Automated Refactoring Using Design Exploration, pp. 70–75.
-  M. W. Mkaouer, M. Kessentini, S. Bechikh, K. Deb, and M. Ó Cinnéide, “Recommendation system for software refactoring using innovization and interactive dynamic optimization,” in Proc. 29th ACM/IEEE Intl. Conf. Automated Software Engineering, ser. ASE ’14. New York, NY, USA: ACM, 2014, pp. 331–336.
-  J. Bansiya and C. G. Davis, “A hierarchical model for object-oriented design quality assessment,” IEEE Trans. Softw. Eng., vol. 28, no. 1, pp. 4–17, Jan. 2002.
-  T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, and T. Zimmermann, “Local versus Global Lessons for Defect Prediction and Effort Estimation,” IEEE Transactions on Software Engineering, vol. 39, no. 6, pp. 822–834, jun 2013.
-  R. L. Burden and J. D. Faires, Numerical Analysis: 4th Ed. Boston, MA, USA: PWS Publishing Co., 1989.
-  C. Passos, A. P. Braun, D. S. Cruzes, and M. Mendonca, “Analyzing the impact of beliefs in software project practices,” in ESEM’11, 2011.
-  M. Jørgensen and T. M. Gruschke, “The impact of lessons-learned sessions on effort estimation and uncertainty assessments,” Software Engineering, IEEE Transactions on, vol. 35, no. 3, pp. 368 –383, May-June 2009.
-  P. Devanbu, T. Zimmermann, and C. Bird, “Belief & evidence in empirical software engineering,” in Proc. 38th Intl. Conf. Software Engineering. ACM, 2016, pp. 108–119.