The field of fault prediction has brought up many different metrics to base predictions on. From simple size-based metrics like lines of code (LOC), complexity metrics like McCabe and Halstead, object-oriented metrics such as coupling and cohesion, all the way to process metrics like code churn, socio-technical networks, and development history [11, 30]. In this paper we will focus on one specific category of temporal metrics, i.e., past faults [12, 15]. Rahman et al.  showed that they perform similar to the more complex BugCache algorithm and are easier to compute. Since then, multiple studies have found past faults to be good predictors for future faults [7, 8].
Based on the idea by Rahman et al. , Lewis et al.  developed the Bugspots algorithm, which added a weight decay for older faults. In this paper, we present Linespots, a novel fault prediction algorithm based on Bugspots. As the name hints, Linespots runs on a line-level granularity instead of files, as is the case with Bugspots. This focus on lines follows from the insights from Kochhar et al.  who showed that developers prefer more granularities in results, and Hata et al. , who showed a trend in better performance when using finer granularities.
In this paper, we present an empirical study of Linespots’ predictive performance and runtime compared to Bugspots. Focusing on predictive performance and runtime is due to them being considered the main benefits of Bugspots. In order to conduct such a comparative analysis, we have spent considerable effort in collecting representative data.
While there are pre-built datasets to benchmark fault prediction metrics and algorithms (e.g., PROMISE ), many of them do not provide the necessary information for Linespots. Additionally, more involved datasets, like defects4j 
, lack in sample size. To avoid such limitations, this study uses an open-source dataset and self-built validation data. While self-built validation data can be unreliable, we used high-quality samples to vet the collected validation data.
As a novelty in the field of fault prediction, we use Bayesian data analysis to model the effects of different algorithms and data quality and find improvements in critical areas for Linespots over Bugspots. When applying Bayesian data analysis, we follow the recommendations of Wasserstein et al.  to abandon the use of
-values and point estimates.
The key contributions of this paper are:
The proposal of Linespots, a novel fault prediction algorithm.
Presentation of the results from a comparative study analyzing performance and runtime using Bugspots and Linespots as artifacts.
Use of Bayesian data analysis and Directed Acyclic Graphs (DAGs) to model different algorithms’ effects on evaluation metrics.
Validation of a self-built dataset with a high-quality sub-sample.
Finally, to our knowledge, this study uses the largest dataset published, based on the revision count of different projects in the field, as of now .
The research questions we pose are:
RQ1: How does the predictive performance of Linespots compare to Bugspots?
RQ2: How does the runtime performance of Linespots compare to Bugspots?
In the following background section (Section 2), we will present the history and details of Bugspots and Linespots. Section 3 provides details concerning study design, including the research questions, our process for gathering and preparing the dataset, and, finally, a high-level explanation of our analysis.
We present our results in Section 4 and explain threats to the study validity in Section 5. Following that, we discuss the implications, of said results, in Section 6. In Related Work (Section 7), we explain the connection to other work. Finally, we , before presenting a summary and our conclusions, together with suggestions for future work in Section 8.
Before going into details about the Bugspots and Linespots algorithms, we want to present our use of the terms metric, model, and algorithm to avoid confusion. As the terms have not been used with the same meaning consistently throughout past works in fault prediction and localization, these are the definitions for how we use the terms:
Fault prediction metric: These are metrics like LOC, complexity, class size, and past-faults, that can be used to predict faults. Linespots and Bugspots both output a version of the past-faults fault prediction metric.
Evaluation metric: Evaluation metrics are the metrics we calculate to compare different fault prediction metrics’ performance. The evaluation metrics we use are described in Section 3.1.
Model: These are the statistical models we built to analyze the effects of the fault prediction metrics on the evaluation metrics, as described in Section 3.4.
Fault prediction model: A combination of multiple fault prediction metrics that ideally achieves better predictive performance than the individual fault prediction metrics.
Using past faults to predict future faults was pioneered by Hassan et al.  and Kim et al. . Rahman et al.  showed that ranking files by the number of past faults could perform as well as their more complex FixCache algorithm when predicting future faults and D’Ambros et al.  compared past faults metrics with other fault prediction metrics and showed that they could perform well. To the basic idea of ranking files by the number of past faults, Lewis et al.  added a weight decay so that more recent faults would impact the ranking more than faults further in the past. This was done to allow once faulty files to move down in rank if they had been fixed with no new faults appearing. The algorithm by Lewis et al.  has since been used in other studies [8, 39, 42, 43]. Finally, Grigorik  wrote the Bugspots implementation, which is the namesake of our newly proposed algorithm, Linespots. A pseudocode representation of Bugspots is shown in Algorithm 1.
The basic idea behind Linespots was the fact that Bugspots had too low resolution concerning its results. At the time of Bugspots’ creation, file-level granularity was still commonly used for fault prediction. Achieving more fine-grained results became popular after Hata et al.  showed that finer granularity gave better prediction results. One benefit of Bugspots is the fact that it is language-agnostic and only requires version history to work. We wanted to keep that aspect.
Each Git commit contains some metadata and the file diffs, which describe the changes that the commit applies per file. Each file diff is further split into hunks, sections of changed code with some padding before and after, for context. Linespots then treats all lines that are part of a hunk in a fault-fixing commit as being part of the fault. This decision was made out of convenience. To improve the reliability of our hunk-based approach, we use the Histogram diff algorithm by Nugroho et al. who found that it better reflects the intention of developers for code changes and recommend it over the default Myers diff algorithm  when mining repositories for code changes.
To keep track of the scores, Linespots has to track line movements across commits. This posed new challenges as Git does not offer support for modified lines, instead lines in a hunk are either untouched, added, or removed. Removed lines are straightforward to handle by deleting the corresponding score. For added lines, Linespots uses the old hunk’s average score as a past score. The added lines get the same weighted increase in score as the untouched lines, which Linespots treats as faulty.
For the identification of fix-inducing commits, Linespots relies on pattern matching of the commit messages. While D’Ambros et al. showed that simple string matching could be unreliable, it allows us to have a larger sample size. This sounds like a quantity/quality trade-off, but we try to improve on the quality by choosing individual regular expressions (regex) for each project and use projects that follow very stringent commit message conventions (CMC) as described in Section 3.2.
This study gathered samples from different projects for both Bugspots and Linespots and compared their predictive and runtime performance. This section describes the study design, our sampling strategy, and our modeling and analysis approach. Figure 1 shows an overview of the complete study.
3.1 Study design
The goal of this study is to evaluate and compare Bugspots and Linespots concerning predictive performance and runtime. To focus our efforts, we investigate the following research questions:
RQ1: How does the predictive performance of Linespots compare to Bugspots?
RQ2: How does the runtime performance of Linespots compare to Bugspots?
To evaluate the predictive performance, we use common evaluation metrics from the field of fault prediction:
3.1.1 Area under the receiver operating characteristic curve
(AUROC) . The AUROC
is a way to assess the performance of a ranking algorithm by calculating the precision and recall for all possible cut-off points and plotting them against each other. An optimal ranking would result in an area ofunder the curve, with only faulty elements at the top of the ranking. While we have not extensively tested this, we assume that it is not valid to compare AUROC values between studies that use different granularities. AUROC values are usually not normalized, so any differences in the artifacts might bias the comparison.
3.1.2 Area under the cost-effectiveness curve
(AUCEC) . The AUCEC is a performance measure that accounts for effort. The cost-effectiveness curve results from going through the result list of a fault prediction algorithm and plotting the proportion of LOC on the -axis and the proportion of found faults on the -axis. Better performing algorithms will result in a curve with a higher slope. To compare two curves, the area under the curves can be calculated as a summarization, as a higher sloped curve will also lead to a larger area under the curve. To improve the reliability when comparing different projects, Arisholm et al.  proposed normalizing the AUCEC to the optimal curve achievable with the project as such:
where is the area under the curve (baseline, model, or optimal) for a given percentage of LOC. The baseline curve represents faults randomly distributed throughout the results such that the cost-effectiveness curve matches . The model curve is the one derived from the results of Bugspots and Linespots in this case.
The optimal curve has all faults at the top of the result list such that the faults with the least lines come first. With the high number of faultless lines compared to faulty ones, the optimal AUCEC approaches zero. As the optimal curve depends on the used granularity, the normalized result depends on it as well. Hence, these normalized AUCEC values can not directly be compared between studies using different granularities.
As mentioned, AUCEC values are often calculated for only parts of the -axis, e.g., AUCEC1 is the area up to % of LOC and AUCEC5 up to % of LOC. The reasoning behind focusing on early parts of result lists relies on results from Parnin and Orso , as well as Long and Rinard . They found that developers would only inspect the first few elements of a result list, and even automated tools performed best with result numbers in the low hundreds.
. measures the expected rank of the first faulty element in a ranked list. Assuming a group of tied elements starting at that contains faulty elements and there is no faulty element before , is defined as:
Based on and the acc@n measure by Le et al. , counts the number of faults that were within the top positions of the ranked list. A value of is reasonable for users, as most users will only inspect the first few entries . For automated tools, Long and Rinard  propose , which they found to work well.
We propose the evaluation metric as the lowest value of all faults or, the absolute number of LOC that has to be inspected to encounter the first fault. If a developer inspects the first elements of the result list, that is the threshold to meet. The same holds for automated tools. While proportional evaluation metrics can be useful to compare different fault prediction metrics and models across multiple projects, we believe the absolute numbers are more relevant for most use cases.
. The EXAM score is defined as the proportions of a project to be examined to find the first line of a fault (averaged across all faults). Thus, an algorithm with a lower exam score finds faults faster, on average, than one with a higher exam score. In terms of , the EXAM score can be defined as,
where are the faults.
From our experience, the EXAM score is a good approximation for the AUCEC with having a mean error of in our dataset. For this reason, we only report AUCEC1 and AUCEC5, and use the EXAM score instead of AUCEC100.
3.1.5 Research question mapping.
While the AUROC and EXAM values are averaging the performance across the entire result list, in realistic scenarios the performance in the early parts of the results are more important. This is why we also use the AUCEC5, AUCEC1, , , and measures. These metrics allow us to answer the first research question more nuanced and with more certainty.
For the second research question, we measure the runtime of just Bugspots and Linespots, excluding the remaining parts of the evaluation code.111The evaluation suite’s code, including metric calculation, is available together with the reference implementation .
3.1.6 Comparing granularities.
As Bugspots and Linespots report their results on different granularity levels, we have to be careful when comparing them. One way to compare them is to transform the file-based results of Bugspots into line-based results, as proposed by Zou et al. . We do this by setting each line’s score to the corresponding file’s score. This results in lists of ranked lines for both Bugspots and Linespots and, thus, all metrics can be calculated. Depending on the exact way of how metrics are calculated, this kind of transformation can impact the results for Bugspots. We argue that this does not put Bugspots at a disadvantage, as we found that past research had calculated results so that files could only be inspected as a whole. Transforming the results to the line granularity metrics like EXAM will show better performance for Bugspots due to the way blocks with the same score are handled. While we do not think this will influence the results, Bugspots might perform better with this kind of transformation.
There are two options for what kind of dataset to use in the field of fault prediction, either a released repository of artifacts and faults or a self-built dataset based on repository mining. Examples of released sets are PROMISE  or Defects4J 
. The benefit of using such repositories is the ease of use, reliability, and the comparability they offer between studies using the same repositories. The drawback, however, is the limitation of the content. Usually, only a limited number of projects, revisions, and information is available. Be it the limitation to a single programming language, a specific granularity, or lack of history and process information. Moreover, while faults in those repositories usually are verified, that can lead to a false sense of security, with artifacts marked as non-faulty. This could lead to an inflation of wrongly classified false positives as algorithms find real faults that are not marked as such in the repository.
The repository mining approach comes with a different set of trade-offs. While the dataset’s size can be almost arbitrarily big, it is usually limited to open-source projects. The quantity of data is usually higher than for a pre-built dataset, but the quality might suffer. For this study, we wanted to build a representative sample for, at least, open-source projects. While there have been alternatives growing in recent years, GitHub remains the biggest collection of open source projects, which is why we collected our sample of projects from there.222 repositories on Github vs. on GitLab at the time of writing. Ultimately, we applied a combination of different sampling strategies, as described by Baltes and Ralph .
First, we applied cluster sampling by randomly choosing projects from the top- starred projects on GitHub. While limited to the top-starred projects, we assume that this improves our generalizability.
We then used purposive sampling by collecting a number of projects from related studies such as the works of Rahman et al. , Tóth et al. , D’Ambros et al.  and Zou et al. . These projects would allow us to better connect this study’s results with the broader field of fault prediction and localization. To those, we also added additional projects to receive a more disparate sample, e.g., from the web-commerce domain.
All of the projects were filtered to meet our requirements concerning commit count and concerning the reliability of commit messages. As the reliability of commit messages is a threat to our validity, we then searched for projects that followed strict CMCs (while also enforcing this!) In the end, projects as shown in Table 1, were added to the sample.
3.2.1 Identification of fix-inducing commits.
To identify fix-inducing commits, we performed pattern matching on commit messages. While D’Ambros et al.  showed that simple string matching could be unreliable, we believe that is at least partially due to poorly chosen regular expressions (regexes), since we see evidence from other studies catching unnecessary false positives this way [39, 42]. To improve reliability, we used separate regexes for each project, derived from studying past commits (also shown in Table 1). The table also shows that while the three CMCs differ in detail, they all use the same format to denote fix-inducing commits.
|broadleafcommerce||Java||fix —fixes —fixed||None||3||Author|
|ceylon-ide-eclipse||Java||fix —fixed||None||3||Past Work|
|closure-compiler||Java||fix —fixed||None||3||Past Work|
|coala666The main author is a maintainer of coala||Python||Fixes||None||2||Author|
|evolution||C||fix —fixes||None||3||Past Work|
|httpd||C||fix||None||3||Random 777httpd was chosen randomly, but is used in past work as well|
|jfreechart||Java||fix —fixed||None||1||Past Work|
|server888MariaDB Server||C++||fix —fixed —fixing||None||3||Author|
We use a depth of commits for both algorithms as it is the default of the Bugspots reference implementation used by other studies . A preliminary investigation resulted in no improved performance for increased depths. While Lewis et al.  suggest tuning the weighting function for each project, we kept the default values to allow for better comparability with other studies using them. However, we calculate the ages of commits based on the index instead of the Unix timestamps. The use of timestamps resulted in some corner case problems with rewritten or merged histories. (We did not see a significant difference in results due to the change in time stamps.) As our origins, the commit to run the algorithms from, we randomly chose up to three commits from the entire project history, under the condition that both the depth commits and the pseudo-future commits used for validation did not overlap between samples (i.e., stratified random sub-sampling).
3.3 Validation data
As we do not have a pre-built repository of faults to validate our predictions, we had to develop our own. We used the commits after the origin commit as a pseudo future. We then used the same pattern matching and line tracking as Linespots uses, to identify lines that already existed at the origin commit and were later removed during a fix-inducing commit. This follows Pearson et al.  and Le et al. , in that faulty lines are those that get removed; remember git does not modify lines during a fix-inducing commit. In case only new lines are added, Pearson et al.  proposes to tag the line immediately following the newly added lines as faulty. Based on this collection of faulty lines, we could calculate the evaluation metrics.
3.3.1 Determining when a fault was predicted.
The last missing piece for evaluation is to decide at what point a fault counts as predicted. With rougher granularities like modules, it is rather straightforward. If you propose a module it can either be faulty or not, although faults involving multiple modules would pose problems.
It becomes even more complicated with lines, as most faults will consist of multiple lines, and it is not clear if evaluating on line- or fault-level is the right choice if such a thing even exists. Following Zou et al. , we evaluate on the fault level, determining how well the algorithms predicted faults instead of individual faulty lines. Both Pearson et al.  and Meng et al.  argue that a fault is predicted if any individual line is predicted. An alternative was proposed by Rahman et al.  in the form of partial credit, where for each line, a partial prediction credit is given. This might be a more honest representation of performance. Similarly, Pearson et al.  present their results for the best case, average case, and worst-case scenario. The best case is similar to what we use, where a fault counts as predicted with just a single line. The worst-case requires all lines for a fault to count as predicted while the average requires half of the lines.
While the idea of giving partial credit, or presenting performance with different thresholds, could paint a more balanced picture of performance, it is not compatible with all fault prediction metrics that are commonly used. Furthermore, this can add substantial computational requirements, especially when working on line-level granularity.
To analyze Bugspots’ and Linespots’ effects on the evaluation metrics, we use Bayesian data analysis as outlined by Schad et al. . By doing so, we follow the recommendation to move away from -value based reporting as called for by Wasserstein et al. . In this paper, we only give a summary of the process. Documenting the entire process in detail would exceed this paper’s scope; thus, we instead provide a replication package .
Based on the work of Pearl [28, 27], we started by building a Directed Acyclic Graph (DAG) to represent the causal relationships of our measured variables, as shown in Figure 2. We can then query the graph and use do
-calculus to check if the query can be answered. Below is the query for a statistical model that tries to estimate the algorithms’ causal effect on the evaluation metrics controlling for project, language, LOC, and fix count. Or more precise, the probability of EM given treatment A and controlling for P, L, LOC and FC.
The do-operator indicates an intervention or treatment; in our case, the use of either Bugspots and Linespots. The goal of the do-calculus application is to eliminate the do-operator, so we are left with classical probability expressions. In our case, we will apply the second rule that states,
if Z satisfies the back-door criterion. The back-door criterion is satisfied if a set Z of variables blocks all back-door paths from X to Y. Back-door paths are paths that start with an arrow pointing into X (or algorithm in our case). With only one outgoing arrow for algorithm, there are no possible back-door paths in the graph. Hence, the backdoor-criterion is satisfied for any combination of the four control variables. If we apply the rule to our query, we get the following,
With no do-operator left, we confirm that we can measure the causal effect of the algorithm on the evaluation metrics controlling for project, language, LOC, and fix count. (Disregarding omission bias, which we will discuss later.)
This technique contrasts with the practice of just adding all available predictors to a model, and reduces the risk of confounding and bias, based on the assumptions. For a good primer on the use of DAGs, we recommend .
With the DAG’s possible models, we then chose the likelihood according to the assumed underlying data generation process. Finally, we sampled from the models using Hamiltonian Monte Carlo. Concerning priors, which we set on the model, we aimed to follow recommended practices. We verified that the combinations of priors were uniform on the outcome scale by conducting prior predictive checks.
After compiling and sampling multiple models per evaluation metric using brms  and Stan , we ensured all necessary model diagnostics passed and then compared models’ relative out of sample prediction capabilities (for details please see the replication package) . The results we report are from the models that showed the best relative out of sample prediction capabilities.
Below is the definition of one of the models used in our analysis. We will next use it to explain our approach.
EXAM_i & ∼& Beta(μ_i, ϕ) &
logit(μ_i) & = & α+ β_A Algorithm_i + β_L LOC_i &
& + & β_F FixCount_i + α_PROJECT[i] &
α& ∼& Normal(-1, 1) &
β_A, β_L, β_F & ∼& Normal(0, 0.15) &
α_PROJECT & ∼& Weibull(2,1) &for PROJECT = 1,…,32
log(ϕ) & ∼& Normal(50, 20) &
, we define the use of a Beta distribution to model the individual exam scores. We use a Beta distribution, as exam scores can be a real number. We further use a parameterization of the beta distribution with a mean and a precision parameter . The next two lines are the linear model that we use to approximate . As the Beta distribution’s mean has to be , we use a link function to transform the results of the linear model to the interval.
In the linear model, we then use a global intercept , slopes for algorithm (), LOC (), and the fix count (). Finally, we have varying intercepts per project. The idea is that we will model a deviation from the global intercept for each project. Some projects we will be able to estimate quite precisely and other projects will then learn something from these projects (a concept called partial pooling, used mainly to avoid overfitting, i.e., learn too much from the data).
is the prior for the global intercept, in this case, a normal distribution with mean
and standard deviation of. We derived the mean from experience with our past work, where the mean EXAM value was around (i.e., approximately on the logit scale).
The priors on Lines and were chosen iteratively using prior predictive checks to allow the model to explore the outcome space uniformly. The prior for on Line uses a link by default, and we set a wide prior, as we assume most of the values to be small and somewhat concentrated, which leads to higher values on the scale.
Before presenting the results, we want to provide some details on how to interpret the results, as the use of Bayesian data analysis is a novelty in the field of fault prediction, as far as we know. We present interesting effects in two ways: 1. the effects of a model as-is and, 2. as conditional effects.
While, usually, one can just present model effects as-is, some link functions increase the possibility for misinterpretations. For all models where the outcomes range between and , we use Beta likelihoods and a link function, to map the result of the linear model to the scale. However, the effect a parameter has on the outcome scale depends on where the model lies on the scale (as it is not linear).
To illustrate this, assume on the scale, which is equal to on the outcome scale. With an effect of , the result becomes on the or on the outcome scale—a difference of . However, if on the scale, or on the outcome scale, and we have the same effect of , the result equals on the scale or on the outcome scale. Now the effect only leads to a difference of on the outcome scale.
Figure 3 shows a logistic curve with the example points. It shows how the -axis’s differences become smaller with a larger distance from the middle. So while effects on the logit scale can still be used to make statements such as “Algorithm has a positive effect on metric ”, the size of that effect on the outcome scale is not as intuitive as when designing models employing an identity link. (Some of the models use a link instead of , and while the link does not behave exactly like the , it shares the nonlinearity.)
One pragmatic approach to solve this is conditional effects. Conditional effects set all model predictors, besides the one we are interested in, to their mean value or reference category for factors. This allows us to see the effect of the predictor of interest on the outcome scale for the average sample, i.e., in our case, this tells us the effect of the algorithm on the average project. Worth mentioning in this context is that the effect of interest very often has broad uncertainty connected to its interpretation. This is due to the uncertainty propagated by the model, and, generally speaking, we consider it to be a good thing. The conditional effects plots we show in this paper show the median and 95% credible intervals. A summary of the conditional effects for all evaluation metrics is shown in Table2.
4.1 RQ1: Predictive performance comparison
To answer our research question, we collected seven evaluation metrics. We start with AUROC and EXAM as more high level, averaging evaluation metrics. Figure 6 shows the effect of Bugspots compared to Linespots on the scale and the conditional effects of both algorithms with median and 95% credible intervals. While there is overlap between the two algorithms on the outcome scale (b), the effect on the logit scale (a), shows us that Bugspots produces lower AUROC values than Linespots. The uncertainty in the conditional effects is caused by the uncertainty propagated by the model, and not the necessarily by the algorithm.
The EXAM score shows a similar picture in Figure 9. Again, the difference between Linespots and Bugspots on the scale is entirely positive, while there is overlap on the outcome scale. As in the previous example, this points towards Linespots producing lower EXAM scores than Bugspots.
AUCEC1 and AUCEC5 show apparent negative effects for Bugspots on the scale, with no overlap for the algorithms on the outcome scale. This is a clear indicator that Linespots consistently produces higher AUCEC1 and AUCEC5 values than Bugspots.
Next, we look at the very first parts of the result lists with the and evaluation metrics. The results of , see Figure 21, show some overlap with for the effect of Bugspots, while the outcome scale shows much overlap. This is probably caused by the zero-inflation that both algorithms produce, i.e., their overall low performance. However, Linespots can produce higher values than Bugspots and does so on average. The results for , see Figure 18, are similar, albeit more distinct. On the scale, there is just barely some overlap in the tail, and the overlap on the outcome scale is less pronounced. There is still some uncertainty here, but the trend seems to be for Linespots to outperform Bugspots in .
Finally, we compared the position of the first fault in the result list with the scores. Again, the results in Figure 24, indicate that the effect of Bugspots on the scale is entirely positive, and there is no overlap on the outcome scale. Linespots consistently produces lower scores than Bugspots.
4.2 RQ2: Runtime comparison
The results of the runtime model need some additional information. Initially, we opted for designing relatively simple generalized linear models (GLMs). However, when we conducted posterior predictive checks, they indicated an inferior fit. The main reason was that the posterior consisted of three modes (i.e., three distinct peaks in the posterior probability distribution).
Based on the results of the simple GLMs, and our experience when developing and testing the algorithms, we assumed that two different phenomena caused the three modes. The first two modes would be the well-behaving samples that would execute quickly for both algorithms. The third mode would be caused by samples that include big commits, likely with formatting changes or generated code across many files, that take substantially longer to generate diffs for and parse them. Based on these assumptions, we built a mixture model for the runtime using two shifted-lognormal likelihoods. Looking at the effects in Figure 27, both components estimate Bugspots to have reduced runtime compared to Linespots with no overlap. (The runtime effect sizes are the only ones with an identity link, meaning the effects are on the outcome scale.) The conditional effects show the same, with Linespots consistently having longer runtimes than Bugspots. This follows our expectation, as Linespots should have longer execution time. After all, Linespots does everything Bugspots does but adds diff generation and parsing on top of it.
Table 2 shows the summary of the model results concerning runtime performance. In the table, one will find a summary of the estimated conditional effects for all eight evaluation metrics for both algorithms. We were able to show that Linespots outperforms Bugspots consistently for different ranges of the result list while taking substantially longer to compute.
5 Threats to validity
As the results of this study are based on the confidence we have in the dataset, and especially the validation data we collected ourselves, it is necessary to ensure their accuracy. Recall, that we added projects to our dataset that used stringent commit message guidelines and consistently enforced them. We call the group of samples from those projects ‘good’, while we call the group of samples from projects using no CMC ‘base’.
To ensure that our conclusions are not based on a subpar data collection procedure, we can compare the results of the good samples with those of the base samples. We also show the individual CMCs to show how the individual CMCs compare to the good and base groups.
Figure 30 shows a simple comparison for the EXAM score between the two CMC groups and the three individual CMCs. There are differences between the individual CMCs and the base group, especially for the discourse samples. But when combined into the good group, the differences average out and the result looks very similar to the base group. This is probably caused by the relatively small number of samples per CMC compared to the base group.
The similarity between the base and the good group does support our method of data collection. While we would expect more uncertainty in the good group compared to the base group based on the smaller sample size, compared to , respectively, the higher quality could cause the reduction. Nevertheless, we also built a model to investigate the effect of the CMC on the result, and the effect of the algorithm on the result.
Figure (a)a shows the conditional effects of the CMCs on the EXAM score for Bugspots and Linespots , and while there is variation in the means and tails, the overlap between CMCs is considerable. The small sample size could simply cause the higher uncertainty in the good quality groups. It is also easy to see that Linespots produces smaller EXAM scores for every case. The scale effect of the CMCs on the algorithm effect in Figure (b)b shows similar behavior. Only the conventional commits convention is close to having a significant effect.
Another possible threat is the assumptions we made when developing the DAG. The two critical assumptions are that there is no incoming arrow for algorithm and only a single outgoing one, towards the evaluation metrics. We are confident that there are no incoming arrows to algorithm, as we ran every sample with both Bugspots and Linespots, so nothing influenced which sample used which algorithm. The second assumption would not hold if there were an unknown variable that the entity algorithm influences. We argue that the only output of the algorithms is the result list, which directly leads to all the performance-related evaluation metrics and the runtime of the algorithm itself. While there could be an interaction with the memory use of the algorithm, we assume it to be neglectable unless it would lead to out-of-memory problems.
Finally, when we checked the bad smells of software analytics by Menzies and Shepperd 
, we find that all of the frequentist statistics problems are non-issues due to our use of Bayesian data analysis and reporting of posteriors and credible intervals. Furthermore, with us publishing both our entire code, data, and complete replication packages[19, 35, 34], we actively prevent any of the presentation related problems as listed by Menzies and Shepperd.
Starting with the performance, Linespots does perform better than Bugspots for all evaluation metrics in the average case scenario. The summary for conditional effect sizes is shown in Table 2. Starting with AUROC and EXAM, as both are averaging across the entire result list, there is much overlap for the two algorithms on the outcome scale. As the effects on the scale, shown in Figures 6–9, have no overlap in both cases, the overlap on the outcome scale is caused by uncertainty propagated by the model. Both algorithms still perform poorly, and the absolute differences are small, with a increase in AUROC and a decrease in EXAM for Linespots. However, the relative improvements of Linespots over Bugspots are substantial, with a % increase in AUROC and a % decrease in EXAM.
Moving to AUCEC5 and AUCEC1, the differences between the algorithms become even clearer as Linespots outperforms Bugspots consistently in both measures (both on the and outcome scale) as shown in Figures 12–15. Here the trend starts to show that Linespots’ relative performance to Bugspots improves, the further up the result list one measure. With an advantage of or % for AUCEC1, or % for AUCEC5 and with both effects being far from on the scale.
When looking at the very first entries on the result list with the and evaluation metrics, the picture becomes less certain again. While Linespots still outperforms Bugspots on average, with or % improved and or % improved on average, there is now some zero overlap in the tails on the scale and wide overlap on the outcome scale as Figures 18–21 show. We assume that this is due to both algorithms just not performing well enough to make the evaluation metrics reliable with such small values. There are many cases where both algorithms predict zero faults in the first or results. However, Linespots still performs better on average than Bugspots. As the zero overlap on the scale is only in the tails, we still conclude that Linespots offers better and performance than Bugspots.
Finally, we look at , which again shows a definite advantage to Linespots in Figure 24. Bugspots has a clear negative effect on the log scale and Linespots predicts the first fault lines before Bugspots on average, or in just % of the lines of what Bugspots needs. While lines on average is still a lot, it moves a lot closer to the lines threshold Long and Rinard  found to be of value.
Together, all seven evaluation metrics show clear improvements for Linespots over Bugspots in predictive performance. Both, when averaged across all faults, but especially for the result list’s critical early parts.
The second aspect that made Bugspots an interesting algorithm was that it was simple and easy to compute. As Linespots do more computations than Bugspots, we expect the runtime to increase, which is reflected in the results. Bugspots is many times faster than Linespots, even with the more complex effects from the mixture model. Moreover, the results of Linespots are somewhat bloated by a smaller number of very slow running samples, which is a drawback of Linespots. Ultimately, we believe that the runtime of Linespots can be improved by refactoring the code with a particular emphasis on performance; however, it will likely never be as fast as Bugspots, and perhaps that is not even important.
While fast runtimes are usually desirable, there is some leeway in how long an analysis can take before it becomes a burden to developers. This heavily depends on where in the workflow, something is done. Some things, like syntax highlighting, auto-completion, or simple linting, are usually expected to give the developer real-time feedback. A simple unit test suite could take a few seconds or minutes, while more complex tests or analyses can take a lot longer. The usual scenario for fault prediction algorithms is to focus code review or testing efforts. In both of these cases, the algorithm would run once, and then, based on the result, the next steps will be decided. For this kind of work, that could easily be added to a continuous integration pipeline, even the most extended runtimes of a few minutes for Linespots would be feasible. This assessment is supported by the findings of Kochhar et al.  who showed that runtimes of under one minute are acceptable by more than % of developers.
7 Related Work
Before presenting specific studies, we want to point out that a comparison of results between studies can be complicated. One of the main reasons is that the datasets differ between studies, and without standardized implementations, the calculation of both fault prediction and evaluation metrics can deviate.
This can range from slightly different implementations of the same algorithm to problems like differing granularities. When Bugspots predicts a fault by proposing a file that potentially contains a fault, it is not clear how to compare that to Linespots predicting a fault by pointing to an individual line in a file. We tried to reduce these problems by mapping Bugspots’ results to lines, following Zou et al. .
When comparing our results to Zou et al. , we can see that not all of the results match. Starting with the runtime, Bugspots performed a lot faster on our test system than it did for them. However, this could simply be down to the used hardware or implementation. In both cases, the runtime is under a second. When comparing results, the average that Bugspots achieved in our study is not far from the by them. The difference can easily come from the dataset. When we ran a model with the projects we have in common, Bugspots did not find a single fault as well.
However, when comparing EXAM results, our values are roughly half of what Zou et al.  report. This persists when running a model with just the common projects. We assume that this is due to the differences in validation data. While we have not tested how similar our approach is to the Defects4J set, we expect there to be more defects in our approach. While it is not clear if those are false positives due to ambiguous commit messages, or true faults that are not part of Defects4J, it could explain why our study showed better EXAM results. If we assume that there are no significant differences between the two Bugspots implementations and the difference between Bugspots’ results to scale to our Linespots’ results, Linespots would still be between Bugspots and BugLocator in performance.
D’Ambros et al.  compared different fault prediction metrics, including the number of past faults metric, which Bugspots is based on. We expect Bugspots to perform better than the NFIX-ONLY metric, due to the added weight decay. The mean result of , by D’Ambros et al. , is within the range of what we see reasonable when compared to the result of the AUCEC100 model (not reported due to redundancy with EXAM results). While not all of the differences can be attributed to the weight decay, i.e., different projects and granularities used in the studies, it allow us to set Linespots in perspective. The best performing fault prediction metric LDHH has a mean of , which is close to Linespots’ result in our study.
It is important to remember that we can not merely compare these values as they are gathered with different methods. However, the results indicate that Linespots is competitive with some of the better performing fault prediction metrics in the field and should serve as part of fault prediction models.
In this study, we proposed Linespots, a novel variant of the Bugspots fault prediction algorithm, and evaluated it by investigating two research questions.
RQ1: How does the predictive performance of Linespots compare to Bugspots? We found that Linespots outperform Bugspots on all evaluation metrics regarding predictive performance, especially when focusing on the result lists’ earlier parts. This is important as developers and tools will only look at the very early parts of a result list .
While the overall performance is not good enough to be useful in a code review scenario, in our opinion, Linespots can serve as an improved Baseline when evaluating new techniques and as an essential part of fault prediction models. We also agree with Lewis et al.  in their assessment that fault prediction is more suitable to support focusing testing efforts instead of code reviews. This will be true until fault prediction can offer actionable results to developers, in our opinion.
RQ2: How does the runtime performance of Linespots compare to Bugspots? We found that Linespots can take a lot longer to run than Bugspots, as was somewhat expected. While the runtime might be too long in some extreme cases, we assume that projects with that kind of size will already have substantial testing suites and continuous integration pipelines, so even a runtime of a few minutes might not affect the possibility to include Linespots in their setup. However, it might be interesting to investigate what exactly causes the big spikes in runtime and how much the runtime can be reduced. When looking at the comparison by Zou et al. , Linespots would still be substantially faster than most fault prediction models, which is why we conclude that it does fill a similar role to Bugspots in regards to runtime.
Based on the related work comparison and our findings, we argue that Linespots is a well-performing fault prediction metric. Combined with the performance improvements over Bugspots, we recommend using Linespots over Bugspots in all cases where real-time performance is not needed.
We want to thank the Stan (https://discourse.mc-stan.org) community for their valuable feedback and support during the analysis.
-  (2007-11) Data mining techniques for building fault-proneness models in telecom Java software. In 18th IEEE International Symposium on Software Reliability (ISSRE ’07), pp. 215–224. External Links: Cited by: §3.1.2.
-  (2010-01) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software 83 (1), pp. 2–17 (en). External Links: Cited by: §3.1.2.
-  (2016) A learning-to-rank based fault localization approach using likely invariants. In 25th International Symposium on Software Testing and Analysis (ISSTA), Saarbrücken, Germany, pp. 177–188 (en). External Links: Cited by: §3.1.3, §3.3.
-  (2020-02) Sampling in software engineering research: A critical review and guidelines. arXiv:2002.07764 [cs]. External Links: Cited by: §3.2.
-  (2017-08) Brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software 80 (1), pp. 1–28 (en). External Links: Cited by: §3.4, §3.4.
-  (2017-01) Stan: A probabilistic programming language. Journal of Statistical Software 76 (1) (English). External Links: Cited by: §3.4.
-  (2010-05) An extensive comparison of bug prediction approaches. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), Cape Town, South Africa, pp. 31–41. External Links: Cited by: §1, §1, §2.2, §3.2.1, §3.2.
-  (2012-08) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering 17 (4-5), pp. 531–577 (en). External Links: Cited by: §1, §2.1, §7.
-  (2004-01) ROC graphs: Notes and practical considerations for data mining researchers. ReCALL 31, pp. 1–38. Cited by: §3.1.1.
Implementation of simple bug prediction hotspot heuristic: igrigorik/bugspots. Cited by: §2.1, §3.2.2.
-  (2012-11) A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering 38 (6), pp. 1276–1304. External Links: Cited by: §1.
-  (2005-09) The top ten list: dynamic fault prediction. In 21st IEEE International Conference on Software Maintenance (ICSM’05), pp. 263–272. External Links: Cited by: §1, §2.1.
-  (2012-06) Bug prediction based on fine-grained module histories. In 34th International Conference on Software Engineering (ICSE), pp. 200–210. External Links: Cited by: §1, §2.2.
-  (2014-07) Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, San Jose, CA, USA, pp. 437–440. External Links: Cited by: §1, §3.2.
-  (2007-05) Predicting faults from cached history. In 29th International Conference on Software Engineering (ICSE’07), pp. 489–498. External Links: Cited by: §1, §2.1.
-  (2016-07) Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, pp. 165–176. External Links: Cited by: §1, §6.
-  (2013-05) Does bug prediction support human developers? Findings from a Google case study. In 35th International Conference on Software Engineering (ICSE), pp. 372–381. External Links: Cited by: §1, §2.1, §3.2.2, §8.
-  (2019) Evaluating software defect prediction performance: An updated benchmarking study. SSRN Electronic Journal (en). External Links: Cited by: 5th item.
-  Linespots Reference Implementation. (en). Note: https://gitlab.com/sims1253/linespots-lib Cited by: §1, §2.2, §5, footnote 1.
-  (2016-05) An analysis of the search spaces for generate and validate patch generation systems. In 38th International Conference on Software Engineering (ICSE), pp. 702–713. External Links: Cited by: §3.1.2, §3.1.3, §6.
-  (2020-03) Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press (en). External Links: Cited by: §3.4.
-  (2011) Systematic editing: generating program transformations from an example. ACM SIGPLAN Notices 46 (6), pp. 14 (en). Cited by: §3.3.1.
-  (2019) “Bad smells” in software analytics papers. Information and Software Technology 112, pp. 35–47. External Links: Cited by: §5.
-  (1986-11) AnO(ND) difference algorithm and its variations. Algorithmica 1 (1-4), pp. 251–266 (en). External Links: Cited by: §2.2.
-  (2020-01) How different are different diff algorithms in Git?. Empirical Software Engineering 25 (1), pp. 790–823 (en). External Links: Cited by: §2.2.
-  (2011) Are automated debugging techniques actually helping programmers?. In International Symposium on Software Testing and Analysis (ISSTA), Toronto, Ontario, Canada, pp. 199 (en). External Links: Cited by: §3.1.2, §3.1.3, §8.
-  (2009) Causality: Models, reasoning and inference. Second edition, Cambridge University Press, New York, NY, USA. External Links: Cited by: §3.4.
The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62 (3), pp. 54–60 (en). External Links: Cited by: §3.4.
-  (2016) Evaluating & improving fault localization techniques. University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, Tech. Rep. UW-CSE-16-08-03, pp. 27 (en). Cited by: §3.3.1, §3.3.
-  (2013) Software fault prediction metrics: A systematic literature review. Information and Software Technology 55 (8), pp. 1397–1418. External Links: Cited by: §1.
-  (2014) Comparing static bug finders and statistical prediction. In 36th International Conference on Software Engineering - ICSE 2014, Hyderabad, India, pp. 424–434 (en). External Links: Cited by: §3.3.1.
-  (2011) BugCache for inspections: Hit or miss?. In 19th ACM SIGSOFT Symposium and 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11, New York, NY, USA, pp. 322–331. External Links: Cited by: §1, §1, §2.1, §3.2.
-  (2020) Toward a principled Bayesian workflow in cognitive science. Psychological Methods, pp. . External Links: Cited by: §3.4.
-  Analysis repository. Note: https://github.com/sims1253/linespots-analysis Cited by: §1, §3.4, §4, §5.
-  Replication package. Note: https://github.com/sims1253/linespots-docker Cited by: §1, §4, §5.
-  (2005) The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada 24. Cited by: §1, §3.2.
-  (2016) A public bug database of GitHub projects and its application in bug prediction. In Computational Science and Its Applications – ICCSA, O. Gervasi, B. Murgante, S. Misra, A. M. A.C. Rocha, C. M. Torre, D. Taniar, B. O. Apduhan, E. Stankova, and S. Wang (Eds.), Vol. 9789, pp. 625–638 (en). External Links: Cited by: §3.2.
-  (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, pp. 1413–1432. External Links: Cited by: §3.4.
-  (2014-06) Version history, similar report, and structure: putting them together for improved bug localization. In 22nd International Conference on Program Comprehension, ICPC 2014, Hyderabad, India, pp. 53–63. External Links: Cited by: §2.1, §3.2.1.
-  (2019-03) Moving to a world beyond “”. The American Statistician 73 (sup1), pp. 1–19. External Links: Cited by: §1, §3.4.
-  (2008-04) A crosstab-based statistical method for effective fault localization. In 1st International Conference on Software Testing, Verification, and Validation, pp. 42–51. External Links: Cited by: §3.1.4.
-  (2017-02) Improved bug localization based on code change histories and bug reports. Information and Software Technology 82, pp. 177–192 (en). External Links: Cited by: §2.1, §3.2.1.
-  (2019) An empirical study of fault localization families and their combinations. IEEE Transactions on Software Engineering, pp. 1–1. External Links: Cited by: §2.1, §3.1.3, §3.1.6, §3.2, §3.3.1, §7, §7, §7, §8.