Maintaining a software project is a substantial exercise in the software developing process (zhang2012empirical). Due to its complexity, it is often time-consuming and therefore an expensive process (marks2011studying; barbosa2019bulner). For example, the Mozilla project receives hundreds of bug reports per day and each of them needs manual investigation (thung2014buglocalizer).
Bug localization is a tedious activity in which the software developer is searching for the part of the source code that contains the bug that is described in the bug report (wang2016amalgamplus). This is a difficult and possibly time consuming task, especially in large projects (wang2016amalgamplus; thung2014buglocalizer; saha2013improving). Since bugs only affect few files (lucia2012faults), Wang et al. (wang2014version) compared the problem of localizing this small number of buggy files to search for the proverbial needle in a haystack.
There is a family of bug localization approaches (e.g., (wong2014boosting; wang2016amalgamplus; saha2013improving)) that propose information retrieval techniques to support software developers. In these approaches, an algorithm computes a ranked list of files that potentially contain the bug described in the bug report (lee). The files are sorted by the likelihood that they contain the bug.
However, a replication study by Lee et al. (lee)
reveals that the current approaches all exhibit a similar average performance on a larger data set. Furthermore, their results show that there is a large variance in the performance between different projects, which leads to an instability of the results. Lee et al.(lee)
argue that despite recent efforts in bug localization, precision and recall values between 35 – 50% may not be acceptable for a software developers. Additionally, recent research also mentioned major weaknesses in the way information retrieval techniques are evaluated, e.g., because of noise in the data(kochhar2014potential). However, Mills et al. (mills2020relationship) show that it is possible to locate bugs with an appropriate score using only the bug report. They stress the importance of the query formulation for locating the bug.
An aspect that is currently ignored by the bug localization literature are off-the-shelf search engines. Such search engines have been successfully applied to different domains, for example Semantic web (lei2006semsearch), bio-informatics (letunic2012smart), and 3D modeling (funkhouser2003search). Search engines use information retrieval to automatically optimize queries to achieve high-quality search results. They connect different techniques of information retrieval and are able to analyze query based data sets (bialecki2012apache). General search engines like Apache Lucene (bialecki2012apache) and Elasticsearch111https://www.elastic.co/ are not developed for a particular domain, which indicates that they may also be suitable for bug localization.
Considering the lack of performance and stability of current bug localization techniques on the one hand, and the successful application of search engines in different domains on the other hand, we propose our new bug localization algorithm Broccoli, which combines the commonly used techniques for bug localization with an off-the-shelf search engine. One important property of search engines is that they internally reformulate queries for efficiency, which may be valuable for bug localization following the results of Mills et al. (mills2020relationship). In order to explore the impact of search engines, our study answers the following research question:
RQ1: How good are general search engines for bug localization?
Based on the expectations from the literature, we derived the following hypothesis with respect to our research question:
HT1: Search engines are able to improve the mean performance of bug localization.
We derived HT1 from the successful application of search engines in many fields (e.g., (lei2006semsearch; letunic2012smart; funkhouser2003search)). Search engines were developed by a community that is dedicated to information retrieval and they include years of incremental improvements into their implementation. From our perspective, this is similar to relational databases: while other databases are better for certain use cases, the years of optimizations that went into relational databases are hard to outperform. We study HT1 through the definition of our novel bug localization approach called Broccoli which combines existing bug localization techniques with an search engine. We accept HT1 if Broccoli significantly outperforms existing bug localization approaches.
Through our work, we identified an issue within the benchmarking strategy by Lee et al. (lee) that could affect the results of experiments. When replicating the benchmark, we found that often the bug locations could not possibly be found. A closer investigation revealed that this was due to the decision to use only the state of the software at major revisions to generate the ranking of potential bug locations. If file names changed between the major revision and the reporting of the bug, finding this bug is impossible. Thus, the benchmark is not aware of the state of the software at the time of reporting, to which we refer in the following as time-aware approach. This finding led to our second research question.
RQ2: How does the usage of major releases affect the evaluation of bug localization approaches in comparison to a time-aware approach?
Based on our expectations, we derived the following hypothesis regarding this research question.
HT2: A time-aware approach can reduce the number of impossible to locate bugs which leads to a higher mean performance of bug localization approaches.
We derived HT2 from the expectation that a time-aware bug localization that uses the state of the repository at the time a bug is reported not only provides a more realistic use case, but should also be more reliable when it comes to containing matching the bug descriptions to the files where the bugs are fixed, as they may have changed since a release. Only few cases should remain where localizing the bug is not possible, e.g., when the bug is not fixed on the main development branch and the other branch has different names, or when the file name changed between the reporting of the bug and the bug fix. If our assumptions that finding more bugs is possible, this should directly translate to an improvement of the performance values. Consequently, we accept HT2 if we observe a significant difference when we compare the major release to the time-aware benchmark.
Through our study of these research questions, we provide the following contributions to the state-of-the-art.
We extend the existing bug localization techniques with a search engine and found that this improves the results and that the search engine is the most important component among all aspects from the literature that were considered for bug localization.
We found that a time-aware bug localization leads to significantly different results such that benchmarks that do not account for the actual time of the bug report underestimate the performance of bug localization approaches.
The remainder of this paper is structured as follows. In the Section 2, we summarize prior work on bug localization. Then, we present present our bug localization approach Broccoli in Section 3. In Section 4, we present the experimental evaluation of our approach. In Section 5, discuss the implications regarding our research questions in Section 5. Afterwards, we discuss the threats to the validity of our results in Section 6 before we conclude in Section 7.
2. Related work
|TFIDF-DHbPd (sisman2012incorporating)||The first approach for bug localization used a simple matching based on term frequencies. While this work pioneered bug localization, the performance of this simple approach was consistently worst in the related work. Thus, we will not considered this in our experiments.|
Zhou et al. (zhou2012should)
compute a vector space model from the source code in their approach BugLocator. The location of a bug is calculated from two scores: 1) the score from a query against the revised vector space model and 2) the similarity to previous bug reports.
|BLUiR+ (saha2013improving)||The approach uses an abstract syntax tree to extract source code terms such as classes, methods, and variable names. Then, it employs an open source information retrieval toolkit to find potential bug locations. The authors further extend this with the detection of similar bug reports from BugLocator.|
|BRTracer+ (wong2014boosting)||Based on the idea that bug reports often contain stack traces, the authors of BRTracer+ divide each source code file into a series of segments and try to match these to the stack trace. This is combined with the vector space model and a parameterized version of BugLocator, that accounts for the different lengths of bug reports.|
|AmaLgam+ (wang2016amalgamplus)||Wang et al. combined the following five information retrieval techniques in their approach AmaLgam+: version history, similar reports, structure, stack traces and reporter information (wang2016amalgamplus). The version history component is based on the BugCache algorithm by Rahman et al. (rahman2011bugcache). The similar report component is adapted from BugLocator (zhou2012should). Moreover, AmaLgam+ uses the structure component of BLUiR+ (saha2013improving) to find package names, class names or method declarations. Similarly, AmaLgam+ detects stack traces in bug reports using the approach of BLUiR+ (saha2013improving).|
Youm et al. (youm2015bug) build their approach based on vector space model of BugLocator (zhou2012should). Moreover, BLIA tries to find similar bug reports from the bug repository. For the calculation of the similarity Youm et al. (youm2015bug)
use the bug comments of older reports as well as the bug summary and description. Then, the similarity score is computed via the cosine similarity. Additionally, BLIA analyzes the bug report for stack traces with the algorithm of BRTracer+(youm2015bug). In the last step of the analytic process, Youm et al. (youm2015bug) compute a score from the commit logs of the version history. Finally, all scores of the previous steps are combined by a linear expression with predefined control parameters.
|Locus (wen2016locus)||Wen et al. proposed Locus, which offers a finer granularity in bug localization. Internally, Locus creates two corpora to predict potential bug locations: NL (Natural Language) and CE (Code Entities). The NL corpus contains the natural language tokens extracted from the selected hunks. In contrast, the CE corpus contains only package names, class names and method names. To calculate the ranked list of source code files, Locus constructs two queries for the two corpora (NL and CE). The final score is then computed using a linear combination of the two query results.|
Rahman et al. proposed a modified technique for bug localization that performs a query reformulation. The goal was to mitigate the noise in bug reports that inhibit the natural language processing. Blizzard determines whether there are excessive program entities or not in a bug report (query), and then applies appropriate reformulations to the query for bug localization.
Within this section, we summarize the prior work on bug localization. Table 1 lists and describes the previously suggested bug localization approach from the literature. A notable strength of the bug localization literature is that new approaches are usually the extension of a prior approach with an additional aspect that is considered. BLUiR+ and BRTracer+ both extend BugLocator. AmaLgam+ reuses components from BugLocator and BLUiR+. BLIA uses components from BugLocator and BRTracer+. Only the approaches Locus and Blizzard are mostly independent from the other approaches. In general, the information used for locating the bugs can be categorized into the following components according to the literature:
Similar reports: using the similarity between the text of a bug report and code files to identify possible locations.
Version history: exploitation of hot spots within the codes, i.e., the observation that files that contained bugs in the past often also contain bugs in the future.
Structure: the direct identification of file names that may be affected through exploiting the frequent patterns of file names.
Stack trace: exploiting information from stack traces as source for the bug localization.
Reporter information: use the assumption that the same reporter will use the same functionality of the software to find the package of the current bug.
Bug report comments: finding earlier bug reports by comparing the current bug with prior bugs, including their discussion. The location of the prior bug fixes can then be used to locate the current bug.
Table 2 summarizes which of the past approaches uses which of these techniques and also how our approach Broccoli fits within this picture. Similar to the related work, we did not develop a new approach from scratch. Instead, we rather defined an approach that re-uses components from the state of the art and includes a search engine as an additional component.
|Version history||Similar reports||Structure||Stack trace||Reporter information||Bug report comments|
In addition to the bug localization approaches, there are also general studies on the feasibility of bug localization. Kochhar et al. (kochhar2014potential) studied potential issues that could affect bug localization studies. They found that the biggest risk for bug localization studies is that they may overestimate the usefulness because some bugs may be trivial to locate, because the exact bug location is already part of the bug report. Most bug localization approaches directly harness this information through the structure component. Thus, while this may be relevant for the overall evaluation of the usefulness of bug localization approaches, this should not have a strong effect on benchmarks. Moreover, while such bugs may be simple to find, we note that developers are likely to reject a bug localization approach that misses the simple cases, which means that they should also be respected in the evaluation.
Another general study was conducted by Mills et al. (mills2018bug; mills2020relationship), who investigated if optimally formulated queries are able to find the locations of bugs. Through this, they studied if the information from the bug report is in theory sufficient to find a bug, given the optimal query would be found. Their results demonstrate that the text from the bug report contains a sufficient amount of information. However, since the approach optimizes the queries using the knowledge on the bug location, their study does not provide an approach that is feasible for the practical application of bug localization, which is why we cannot directly utilize their work in our comparison regarding the usefulness of search engines.
3. Broccoli: Using text search engines for bug localization
In this chapter, we introduce our bug localization approach Broccoli, which can be considered as an extension of AmaLgam+ (wang2016amalgamplus) with a text search engine. Figure 1
shows the design of Broccoli. First, we preprocess the available data and prepare the index of the search engine. Second, we use different approaches to determine the possible location of a bug, such as structural analysis, the identification of stack traces, source code risk analysis, and the search engine. Each of these components computes a score for every source file of the software repository. Third, we combine these scores into a single score as final result. We use a random forest regression to aggregate the separate scores. The final output is a ranked list of bug locations that can be shown to developers.
3.1. Text Preprocessing
The goal of the preprocessing steps is to prepare the textual data of the bug reports and source code files for the analysis. This is done once at the beginning, because some components require the same preprocessing, e.g., the generation of an AST of a source code file. We apply three different preprocessing methods to the data, that are then used by the components of Broccoli.
First, we create the AST of all source code files. This enables components to easily extract entity names, such as class names and method names. Second, we apply a text processing steps to generate a harmonized representation of the bug descriptions and the source code files.
We convert all characters to lower case.
We drop all non-alphanumeric characters.
We remove English stop words, i.e., words that occur very often and are, therefore, not suitable for localization.
We remove key words from the Java programming language. These can be considered as stop words within code, as they occur too often to be specific enough for localization.
We reduce the words to their word stem with a Porter stemmer (porter1980algorithm).
Third, we feed the issue reports, the source code files, and the entities from the AST parsing into the search engine to prepare search indices.
Overall, Broccoli has seven scoring components that we describe in the following.
3.2.1. File size
Lines Of Code (LOC) is a metric that represents the number of lines in a source code file. Since the literature on defect prediction (e.g., (nagappan2005static), (shihab2013lines)) suggests that there is a correlation between the size of code files and the number of defects they contain, we can exploit this property for the localization of bugs. Thus, we define
where is a file and are the non-empty lines of the file.
Our analysis of the structure is based on the approach suggested by Saha et al. (saha2013improving) as part of BLUiR. Similar to BLUiR, we match based on file names, class names, and method names. We infer the identifiers from the source code through the generation of the AST. We then search the bug summary and description for these identifiers to create three scores, i.e., one for matched file names, one for the class names, and one for the method declarations
where is a file, is a bug report, is the name of the file, and are the names of all classes and methods defined within . The function evaluates how often the file name (with or without ending), respectively the defined classes and methods, occur within the bug report.
3.2.3. Stack trace
Bug reports may contain a stack trace in case an application crashed as a result of an unhandled exception. The stack trace contains detailed data about the location in the source code where the unhandled exception was generated, including the method names, class names, and line numbers, as well as of the complete call stack. Since the stack trace is automatically generated, it contains reliable data that can be exploited for bug localization. However, the stack trace information is mixed with natural language. To detect stack traces in bug reports, we create an index, which contains all Java classes and file names. Then, we use a regular expression that matches phrases containing “.java”. These phrases are checked against the file index. This removes all file names that are not in the code base (anymore). As a result, we have a list of files that may contain the bug. However, a stack trace reveals more information than just the Java classes, since the root cause of the bug may lie in some methods, which are used by the classes (wong2014boosting). We follow Wong et al. (wong2014boosting) and add all directly mentioned source code files in the bug report to the set . Then, we add all source code files from the import statement as well as all files from the same package to set . However, the relevance for the bug localization differs between these sets (wong2014boosting). Schröter et al. (schroter2010stack) found that only the first 10 frames of a stack trace are relevant for bug localization. Therefore, the stack trace score is calculated as
where is a file, and is the rank of the file within the stack trace (wong2014boosting).
3.2.4. Version history
The version history component is similar to the version history component from AmaLgam+ (wang2014version), which is based on the FixCache algorithm by Rahman et al. (rahman2011bugcache). The algorithm takes the commit history of the source files as input and calculates a score for each file. Only commits that fulfill one of the following two criteria are considered.
The commit message matches the regular expression: .
The commit was made in the last days before the reporting date of the issue.
AmaLgam+ uses a fixed value of . Since this value may be too low for projects with a low activity, we modified this condition: if there are less than 15 commits within the last 15 days, we increment until we have at least 15 commits.
The version history score of each source file is then calculated as
where is the relevant set of commits and is the number of days that have elapsed between a commit and the bug report.
3.2.5. Bug report similarity
Since similar errors is may be reported multiple times, it is useful to find similar and already fixed bug reports (wang2016amalgamplus). The similar bug report component assigns scores to files from fixed bug reports that are similar to the current bug report. We adopt the technique of BugLocator (zhou2012should) and AmaLgam+ (wang2016amalgamplus). The bug report similarity considers all fixed bug reports that have been closed before the new bug report is submitted. We use the bug summary and description to find similar bug reports. Youm et al. (youm2015bug) show that the comments of previous bug reports can improve the localization. Therefore, we add the comments to the corpus. Then, we compute a similarity score between each prior bug report and the current report using .
The score for source code file , can then be computed as
where is the set of previous bug reports from the database and is the set of files that are modified to fix the bug report . The bug report similarity can be computed in several ways. We use two different measurements to define the and : cosine similarity, reporter similarity, and similarity using a search engine. The cosine similarity is defined as
where are the vectors with the term frequencies of the bug reports. The reporter similarity
is defined as
This may help locating the bug, since a user may focus on using either one or a partial component of a software system, especially if the software system is large and provides several functionalities (wang2016amalgamplus).
3.2.6. Search engine
Search engines use indices to find relevant documents based on a search query. Internally, the search engine uses information retrieval techniques, including but not limited to stemming, lemmatization, and word embeddings (bialecki2012apache). The main contribution of our bug localization approach is the use of such a search engine as part of the bug localization process.
We use the search engine in two ways. First, we utilize the search engine directly on the source code files. We create a document from each source code file. For each document, we generate three fields: the textual content of the source file, the path of the file, and the method names within the source file which are extracted as described in Section 3.1. We then conduct three queries to calculate scores. We calculate , , resp. by querying the content, method, resp. path field with the preprocessed summary and description of the bug (see Section 3.1). The score is only calculated if the bug summary or description contain the substring ”.java” or ”/”, i.e., we find an indication for file names within the description. Otherwise, we set this score to 0.
The is the basic application of a search engine for bug localization, while the other two scores highlight semantic properties of the file that are often important, i.e., the contained methods and path of the file.
The second way we use a search engine is to find similar bug reports. A search engine is able to present a more accurate result, since it goes beyond cosine similarity for text matching. We create a document for each bug report with three fields: the closing date, the summary, and the content. We query this search index twice: we use the bug summary to calculate and the bug description to calculate . The closing date field is used to define a condition that excludes all documents for bugs that were not fixed upon the time of the creation of the bug that should be localized. The query evaluates the summary and the content field at once, i.e., we let the search engine combine the results of the search on the two fields. We then use these similarities in the same way as described in Section 3.2.5 to calculate according to Equation (7).
3.3. Scoring: Combination of components
We described how different approaches can be used to create scores for each file that indicate their likelihood to contain the bug in Section 3.2. We now combine these scores into a single score. The resulting score can then be used in descending order to sort files and thereby provide possible bug locations to the software developer.
In comparison to the related work that used linear combinations to combine scores (wang2016amalgamplus; wong2014boosting)
, we use a random forest regression to determine the final scores. Since Random Forest uses decision trees, it can also capture non-linear relationships(breiman2001random). The training data can be collected from past closed bug reports and their linked commits. All files that are modified in the linked commits have a 1 in the result column, 0 otherwise.
We now describe the experiments we conducted to compare the bug localization approaches and to explore how the different scores affect the results of Broccoli. The experiments are based on the benchmarking tool and data of Bench4BL (lee). We extended this benchmark with a second data set and also evaluate how information leakage due to removed or renamed files affects the results. The complete code to conduct our experiments is provided as a replication package 222https://github.com/benjaminLedel/broccoli_replicationkit. In the following, we describe the data, baselines for a comparison with Broccoli, performance criteria, evaluation methodology, and results.
We use two data sources to prepare into three data sets to conduct our experiments. The first data set was published by Lee et al. (lee) as part of Bench4BL and consists of 51 open source projects written in Java from Apache Foundation333http://www.apache.org, Spring444http://spring.io and JBoss555http://www.jboss.org. Table 3 gives an overview of the data. We include all source files as well as the test files from the project.
The second data set is extracted from the SmartSHARK database (trautsch2021smartshark). The data set consists of a convenience sample of projects from the Apache Software Foundation. We used all 38 projects with manually validated issue types to ensure that our data contains only bug reports (herbold2020issues) this excludes feature request issues. Thus, this second data set avoids problems due to mislabeled issue types described by Kochhar et al. (kochhar2014potential). Table 4 gives an overview of the SmartSHARK data. There is an overlap of seven projects with the Bench4BL data, which we highlight in the table. However, the data for these projects are not identical, due to the different times of data collection and the manual validation by Herbold et al. (herbold2020issues). We followed the approach from the Bench4BL data and used the latest release before the bug was reported to collect the files to which the bug reports could potentially be matched.
The third data set contains all projects from the SmartSHARK data source. In addition to the second data set, this data set also includes the italic projects. In our experiments we use this data set to investigate the influence of different matching strategies. Thus, we apply our time-aware matching strategy, because the major release matching as used in the state-of-art has some disadvantages. In a practical application, the date of the a release usually does not match the date of the bug report creation. Moreover, the bug could even refer to a yet unreleased version of a project. In case new behavior was added, e.g., new GUI components, the bug description may not make sense if the latest release is matched based on the reporting date. However, this is the case in the matching strategy of Lee et al. (lee), who suggest to use the code base at the time of the latest release. To mitigate this issue, we propose the following time-Aware matching rules:
Check the availability of the fixed versions field of the bug report:
if the field is available, execute the bug localization on each of the fixed versions.
otherwise, execute the bug localization against all prior versions that contain the modified files.
Calculate the score as
where is the bug report under investigation, is the -th nearest subscore of the bug localization algorithm, and is the total number of version for the bug report.
We use the state-of-the-art of bug localization as baselines for the evaluation of Brocolli. Thus, we compare Brocolli to BugLocator (zhou2012should), BLUiR (saha2013improving), BRTracer+ (wong2014boosting), AmaLgam (wang2014version), BLIA (youm2015bug), Locus (wen2016locus) and Blizzard (rahman2018improving). Where possible, we re-use the implementations from Bench4BL to avoid potential issues due to implementation errors. For BLUiR+ and AmaLgam+ we extended the implementation of Bench4BL with the bug report similarity, as Bench4BL contained only BLUiR and AmaLgam, which do not use these components. Because the publications do not indicate how the scores from these components are weighted, we assume equal weighting. Since Blizzard is not available in Bench4BL, we re-use the implementation the authors provided as part of their replication kit (masud_rahman_2018_1297907). We use the parameters specified in the original work of the bug localization techniques.
4.3. Performance metrics
We use two performance metrics to evaluate the quality of the estimated bug locations, again following the the Bench4BL benchmark(lee).
The Mean Average Precision (MAP) considers the ranks of all buggy files (lee). Therefore, MAP prioritizes recall over precision and is mostly relevant in scenarios where the software developer analyzes the whole ranked list to find many relevant results or buggy files (wang2016amalgamplus). The average precision of a single query is computed as:
is the number of ranked files, is the rank in the ranked file list. is 1 if the file is included in the fixed file list, otherwise is 0. is the precision and computed as:
Using the average precision, MAP can be computed as:
The Mean Reciprocal Rank (MRR) considers the ranks of the first buggy file. The rank of this file is called the reciprocal rank for that query (wang2016amalgamplus). MRR is the mean of the reciprocal ranks over all queries and is computed as:
The MRR value is based one the principle that a software developer will look at each hit until the first relevant document appears (lee). The MRR value is identical to MAP in cases where each query has exactly one relevant buggy file (Craswell2009).
Our general methodology for the experiments consists of three phases. In Phase 1, we conduct a leave-one-project-out cross validation experiment with the Bench4BL data. For this, we use one project as test data and all other projects as training data. We then run the bug localization with Brocolli and our baselines and measure the MAP and MRR. Thus, the first phase is an replication of the Bench4BL benchmark, with the addition of Brocolli.
In Phase 2, we use the Bench4BL data to train a prediction model and evaluate how well this model generalizes to other data on the SmartSHARK data without the projects that are already part of Bench4BL. This design allows us to evaluate the performance of bug localization in a realistic scenario, where a model trained by a vendor is deployed for bug localization within products without local re-training or data collection at the customer site.
In Phase 3, we use the time-aware variant of the SmartSHARK data. Through this, we evaluate the performance of the bug localization in a more realistic setting. The comparison of these results with the results from Phase 2 allows us to see if the additional effort to create the time-aware data is relevant for benchmarking bug localization approaches, or if the version matching is sufficient.
In all three phases, we choose for Broccoli to train Random Forest with 1000 trees. The number of trees is selected to find an reasonable tradeoff between the training time and the prediction performance (oshiro2012many)
. We evaluate the results of all three phases following the guidelines for the comparison of classifiers by Benavoliet al. (benavoli2017time) as implemented in the Autorank package (Herbold2020). These guidelines are an updated version of the popular guidelines by Demsar (demvsar2006statistical). In comparison to the original guidelines, the authors suggest to use a Bayesian approach for the statistical analysis instead of a frequentist approach. Following the guidelines by Benavoli et al. (benavoli2017time), we use the Bayesian signed rank test (benavoli2014bayesian) with a uniform prior. We use the Shapiro-Wilk test (razali2011power) to determine if the data is normal. For normal data, we follow Kruschke and Liddell (kruschke2018bayesian), we set the Region of Practical Equivalence (ROPE) as , where is the effect size according to Cohen (Cohen’s ) (cohen2013statistical). For two populations and , the effect size is defined as
is the pooled standard deviation of population
and and the number samples of the populations A and B. In our case the populations are the results for a bug localization approach and the sample size the number of projects to which we apply the approach. For brevity, we omit how Autorank treats non-normal data.666As reported later, all our results are normal.
The ROPE defines a region around the mean value in which differences are considered so small, that they would not have a practical impact. Setting the rope to means that the differences in values is not even as large a half the size of a small effect are considered to have no practical effect and are, therefore, considered as practically equivalent. For two approaches and
, the Bayesian signed rank test determines the posterior probabilities, and
they describe the probability thatis larger, the populations are practically equivalent, or that is larger. Following Benanvoli et al. (benavoli2017time), we accept the hypothesis that a population is larger/equal/smaller than population if //. If none of the three probabilities is larger than
, the result of the comparison is inconclusive. We apply the Bayesian signed rank test to all pairs of approaches in order to determine a ranking. Report the mean value, standard deviation, and confidence interval of the mean value for all populations. We use Bonferroni correction(napierala2012bonferroni) for the calculation of the Shapiro-Wilk test and the confidence intervals to ensure a family-wise confidence level of 95%. Additionally, we rank the populations by the mean value and report the effect size in comparison to the best ranked population, as well as the probability of being practically equivalent or smaller than the best ranked population, and the decision made based on the posterior probabilities.
In this section, we present the results of our empirical study.
4.5.1. Results for Phase 1
|Approach||Mean MAP||Best MAP||Mean MRR||Best MRR|
Table 5 and Figure 2 summarize the results of the first phase, i.e., our replication of the Bench4BL benchmark protocol. We observe that the overall performance is relatively uncertain, i.e., the confidence intervals of the mean performance are relatively large the violin plots 2
further demonstrate the spread of the results. Regardless, Broccoli achieves the best mean MAP. and MRR. The difference to Locus is not significant in either metric. However, the Bayesian signed ranked tests indicates that there is a probability of about 90% that the mean value of Broccoli is larger than Locus. Thus, while not (yet) significant, this provides a strong indication that this is only due to the large standard deviation and the small amount of data. For all other approaches, Broccoli achieves significantly higher MAP. With MRR, the difference to BRTracer+, BugLocator, and BLIA is not significant. We observe that, with the exception of Blizzard, the effect sizes are relatively small. Additionally, we observe that Broccoli is the best approach for about half of the projects, while the other approaches only sporadically rank first. This indicates that while there is also a high uncertainty in the results of Broccoli, there seems to be a small but consistent mean improvement with Broccoli and that Broccoli mostly does not rank first when other approaches have a positive outlier. Detailed results for each project are in AppendixA.
4.5.2. Results for Phase 2
|Approach||Mean MAP||Best MAP||Mean MRR||Best MRR|
Table 6 and Figure 3 summarize the results of the second phase, i.e., the generalization of a bug localization model that was created using Bench4BL applied to the SmartSHARK data. Similar to the first phase, all performance estimates have a high variance and Broccoli achieves the best mean MAP. With MAP, Broccoli dominates the other approaches with the best result for 16 of the 31 projects. The difference to the other approaches is significant. The results for MRR diverge from the results on the Bench4BL data, due to a very good performance by BRTracer+, which has the the best mean performance and yields the best overall result on 16 projects. BLIA is relatively similar to Broccoli, with a slightly lower mean performance, but higher ranks results. Both BRTracer+ and BLIA are not significantly different from Broccoli. Overall, this indicates that Broccoli is consistently and significantly better with a medium effect with respect to MAP and that there is a no improvement over the state of the art with respect to MRR.
4.5.3. Results for Phase 3
In the third phase of the experiment, we evaluate the impact of time-awareness to evaluate if the possible noise in the data due to the attempt to determine the number of impossible matches. Table 7 summarizes how the data is affected. Overall, there are 18,690 files that are modified for the correction of 5,190 bugs. With the approach to use the state of the project at major releases, as suggested by Bench4BL and as we used so far, only 14.830 files are detectable, for the remaining 4,160 the localization is impossible, because the files that were corrected did not exist at the time when the data was collected during the last major release. For 562 bug reports, none of the affected files can be detected. Thus, for about 10.8% of the bugs the localization is impossible with this approach.
This changes with the time-aware strategy: the number of undetectable files drops from 4,160 to 2,556, that means that about 61% of the undetectable files can now be found within the repository. Due to this improvement, there are only 130 bugs left, for which none of the affected files can be detected. Hence, there is a reduction from 10.8% to 0.3% of the impossible to located bugs with the time-aware approach.
|Multiversion approach||Time-aware approach|
|Files||4,160 (22.6%)||2,556 (13.7%)|
|Bugs||562 (10.8%)||130 (0.3%)|
|Approach||Mean MAP||Best MAP||Mean MRR||Best MRR|
Table 8 summarizes how the evaluation changes with the more accurate time-aware data. The results show that the time-aware performance is significantly different: the performance is significantly higher with a effect size of about 0.56 for both MAP and MRR. Figure 4 provides a detailed view of the deviations. We observe that while the performance estimate is sometimes accurate (on the diagonal), there are also many cases where the release approach underestimates the performance and there are no cases where the performance is overestimated by the release approach.
4.5.4. Results for the Feature Importance
Figure 5 illustrates the feature permutation importance of the Random Forest used by Broccoli. We observe that there is almost no difference between the two phases, i.e., the results are stable across data sets. The feature is the most relevant feature in both. The and are almost equally important. All other scores play a relatively minor role.
In this section, we discuss the results of our experiments, the practicality of Broccoli and provide recommendations for Researchers and Practitioners.
5.1. Search engines for bug localization
In summary, we accept HT1, since Broccoli did perform significantly better than the other approaches in 3 of 4 experiments and in the reminding case it is in the leading group. Thus, the additional search engine reveals information from the bug report and the source code, which is not provided by the other bug localization techniques. A general search engine can be used for locating potential bug locations. Using the search engine without common bug localization techniques like stack trace detection, the performance would be lower compared to the state-of-the-art techniques. However, the MRR can only be increased in one of two experiments. Considering the intuition behind the metrics, the search engine is able to provide the software developer more relevant matches in the top-10 hits, while the first match is at least at the same position compared to the other approaches. Overall, adding a search engine can only increase the result of the bug localization.
We validated our findings by calculating the feature importance in both data sets. The search engine component that uses the content of our source code files is the most relevant feature. This underlines that the search engine is able to find bug locations and furthermore it is the most relevant component in the Random Forest. Thus, the search engine can locate bugs more precisely than previous approaches. However, the feature importance of other components can be reduced by the search engine component, if the search engine tries to find the same amount of information. For example, the search engine finds class, method and file names better than the regular expression. This indicates that the search engine is a better feature for than the regular expression for class, method and file name detection.
A potential reason for the different result in the first and second phase of our experiment can be a disparate data collection technique. Lee et al. (lee) emphasized the importance of the correct source code selection and the quality of the data set for benchmarking. In the second data set, we used manually validated data of the SmartSHARK data set, which raised the quality of the data. Other studies, that underline that wrongly classified bug reports or bloated ground truths do not statistically significantly impact the bug localization results (kochhar2014potential) (mills2020relationship). Nevertheless, both data sets contain different projects. The way how bug reports are managed could also influence the performance of the bug localization. For example, if the bug reports in one project contain many stack traces or class names that helps to find the bug location.
RQ1 and HT1: We accept HT1 that a search engine is able to improve mean performance of the bug localization. However, the improvement is only for significant for MAP. For MRR, there is a comparable performance.
5.2. Impact of Time-awareness
Considering the data and Figure 4, it is noticeable, that for some projects, in which the Time-Aware matching outperforms the major version matching strategy, the Time-Aware matching values are substantially greater. However, if the release approach matching strategy outperforms the Time-Aware matching, the differences between the two values are usually rather small. A possible explanation could be the number of bug reports that are corrected and how the fixed files of the bug report are divided among the different source code versions.
In summary, the statistical analysis and the results of the two experiments indicate that we can accept HT2, because we can reduce the number of non-detectable source code files of bug reports. The results indicate that the usage of major releases is a lower bound to the performance of the bug localization approach in an application. Therefore, the strategy of using major releases is a good indicator for the practical performance of the bug localization approach. The exact number differs for each project and depends for example on the number of release versions and bug distribution. It is noticeable that factors like massive refactoring or package renaming influence the benchmarking dramatically. The main disadvantage of the Time-Aware strategy is the runtime. In our experiment on the SmartSHARK data, the runtime increased by factor 3-4. However, this is only a problem for researchers, as there would be no need for similar considerations in an actual deployment of a bug localization model in practice.
RQ2 and HT2: We can accept HT2, since there is a a significant increase in the performance metrics MAP and MRR.
5.3. Recommendations for Researchers
Our results show that there are disadvantages in the commonly used data for bug localization benchmarking. However other studies emphasize that wrongly classified bug reports or bloated ground truths do not impact the bug localization results in a statistically significant way (kochhar2014potential) (mills2020relationship). We recommend that potential future work should investigate:
Number of features: Figure 5 shows that the version history component is just an overfitting for a project rather than a bug localization technique that should be used in general. A future bug localization approach could try to reduce the number of components in the Random Forest to only use the most relevant components. This follows the principle of simplicity.
Relevance of MAP and MRR: The literature did not specify, which metric is more relevant for the quality of the result for a software developer to locate bugs. A user study would be helpful to investigate the relevance of the metrics and help to rank future approaches in a more practical orientated metric.
Extend the time-aware strategy: Due to computing complexity, we were not able to use all revisions of the software project as a potential search space. Thus, the Time-Aware matching strategy was not able to find all files mentioned in the bug reports. However, to use a more realistic approach for benchmarking, we recommend to use any revision prior to the bug report date as a potential bug location. Extending this idea, it would be helpful to track file name changes or at least provide unified identifiers for each file in a project.
5.4. Recommendations for Practitioners
Generally, the results of our experiment show that none of the algorithms can be selected without additional knowledge of the project in order to efficiently locate a bug. Each of the algorithm depends on the availability of historical data and how a bug report is typically structured. In detail, the result of our second phase shows that the quality of a bug report is relevant for the software developer as well as for the bug localization approach. We recommend that practitioners will consider the following ideas in their operationalizations of bug localization:
Bug report quality: As indicated by our results, the general requirement, a good structure and as much as possible information about the bug in the report is directly correlated to the quality of the bug localization result. For example stack traces or method names improved the result of the bug localization.
Project specific adjustment: The scoring part of the bug localization approach should be trained directly on the project data set. This helps to adjust the bug localization algorithm for the project by selecting relevant components. Additionally, we recommend to select the relevant branches beforehand of the software project to reduce runtime issues by simultaneously providing a large, searchable corpus.
6. Threads to validity
In this section, we evaluate threats to validity of our experiment. We separate between the construct, the internal and the external validity.
6.1. Construct Validity
The main threat to construct validity lies in the implementation of the other algorithms and the metrics we used to compare the results of our experiments. We tried to reduce this threat by using a replication kit and the implementation from the benchmark study Bench4BL (lee). All implementations of the approaches were taken from the study, except Blizzard, since they provided their own replication kit (masud_rahman_2018_1297907). We carefully reviewed the replication kits and only refactored the source code to fit in our benchmark tool, without changing the functionality of the algorithms. Moreover, we used two relevant information retrieval metrics, which were already used by many past bug localization approaches, for example (lee) (rahman2018improving) (wen2016locus). Furthermore, the benchmark Bench4BL (lee) uses these metrics to compare the approaches. Hence, we believe to have reduced the threat to construct validity to minimum.
6.2. Internal Validity
The main threat to internal validity lies in the power of the statistical procedures used to measure the difference between the approaches. However, this would not affected the general results of our work, rather the ranking of the approaches. Moreover, there may be some noise in the data due to wrong links between bug fixing commits and bug issues. However, this was manually validated for the SmartSHARK data and the authors of Bench4BL designed their data collection to err on the side of caution and avoid wrong links. Similarly, not all issues reported as bug are actually bugs (Herzig2013). While the types of issues in the SmartSHARK data were validated, no such manual validation was conducted for the Bench4BL data. Kochar et al. (kochhar2014potential) showed that this may have a small but significant impact on our results for the Bench4BL data.
6.3. External Validity
The main threat to external validity lies in the two data sets that we used in our experiments. The data sets contains only open source projects and are restricted to Java programming language. To reduce the influence, we selected a huge number of open source projects. Overall, we executed our benchmark tool on 82 open source projects, which is currently the largest data set in the area of bug localization. Therefore, the empirical results should not be too specific. Moreover, Ma et al. (ma2013commercial) outlined that many issues in open source software projects are reported by commercial developers. Therefore, the performance of the bug localization approaches should be similar on closed source projects. The programming language restriction
Our contribution consists of three parts: First, we presented Broccoli, a new bug localization approach using a off-the-shelf search engine to improve overall performance. Second, we performed benchmarking of eight different bug localization approaches using data from 82 open source projects. Third, we outlined a new strategy for correct source code selection, called Time-Aware matching, which allows a supplementary real-world benchmarking and validated the strategy with a data set of 38 projects.
Our bug localization approach, Broccoli, uses well-known techniques of information retrieval to locate buggy files. These include, regular expressions for stack trace retrieval and natural language preprocessing like stemming or lemmatization. In addition, Broccoli uses advanced techniques like building an abstract syntax tree to extract information. On top of that, Broccoli uses a search engine to locate the source code files most likely to contain the bug. Finally, we trained our approach with a Random Forest algorithm using unrelated software projects to get a joint decision based on the scores of the different information retrieval components.
To evaluate our approach we performed two experiments with state-of-the art bug localization algorithms and compare their performance against Broccoli using the common metrics MAP and MRR. We performed a statistical analysis of the results to compare the approaches. For the first phase of the experiment, we used the proposed data set of Lee et al. (lee), who already compared the performance of six bug localization algorithms on 51 open source software projects. We extended the study by adding Blizzard and Broccoli to the comparison. The result reveals that none of the approaches works best across all projects. However, the data indicates that Broccoli outperforms other approaches when the history of the software project is sufficient. Furthermore, the Random Forest of Broccoli can be adjusted to the software project statistics to achieve a higher performance, which makes this algorithm even more relevant in a practical setting. On average, Broccoli accomplished a higher MRR and MAP value than state-of-the-art approaches. However, the indicated difference is not significant.
In the second phase of the experiment, we collected data using the SmartSHARK platform of 31 additional open source software projects. Then, we compared the eight bug localization approaches regarding their MAP and MRR value on the data using a Random Forest model trained from phase 1. Similar to our first experiment, the report of the statistical analysis did not propose a single approach to be the statistically better than the others. However, the second experiment reveals different groups between the approaches, which are significantly different to each other. The leading group consists of Broccoli, Blizzard and BRTracer+ for the MAP metric and Broccoli and BRTracer+ for the MRR metric. The results show that Broccoli dominates the results for the MAP value, since the approach outperforms the other approaches in 16 of 31 projects. Our results indicate a significant difference between Broccoli and all other approaches regarding the MAP metric. The performance of BRTracer+ regarding the MRR metric is superior to Broccoli.
The Time-Aware matching strategy helps selecting a more accurate source code version regarding a bug report. Lee et al. (lee) already outlined the relevance of the correct selection regarding the performance of bug localization. With our experiment we have shown that the Time-Aware matching is able to influence the MAP value and the MRR value of Broccoli significantly. Generally, the experiment results show that approx. 23% of the source code files can not be detected by using the common strategy, since the source files are not included in the search space. Moreover, 11% of the bug reports can not be predicted, since none of the fixed files are included in the source code of the corresponding release version. Time-Aware matching was able to decrease the non-detectable files to 14% and the non-detectable bug reports to 0.3%. This increase comes with the price of extended runtime to calculate the appropriate data. Depending on the data set, the runtime can increase three to four times.
8. Outlook and future work
Broccoli can be extended, for example, by improving the scoring step. Our experiment results indicate that a second machine learning model could improve the localization even more. This model should be able to assess project specific properties and adjust the scoring to the actual project. The second model and the general Random Forest model are then combined by a new function. Furthermore, the search engine can be improved by using word embeddings that are trained with the programming language. The principle of query reformulation can also be used to improve the result of the search engine. Mills et al.(mills2020relationship) already examined the relevance of the query reformulation.
Additionally, we would like to increase the granularity of Broccoli. Thus, we want to provide the software developer with information about the code range where the bug is located. In order to implement this, additional experiments regarding method and variable names as well as Java documentation are necessary. Moreover, the quality of the ground truth in the data set must be extended to the code line level (mills2020relationship). Hence, future work should provide a large data set of manually validated bug fixing changes.
Furthermore, the Broccoli approach can be extended to other programming languages, since we do not use special properties of the Java language in our approach. A possible language should be object-oriented and have a package or folder structure. It would be helpful, if the language reports crash reports or stack traces. These criteria are met by most of the popular programming languages. Future work could then also reveal differences between programming languages regarding the relevance of the components.
Regarding the benchmarking of bug localization approaches, different strategies to create a data set that reflects the real application should be analyzed. We already proposed the Time-Aware strategy to find better source code versions that correspond to the bug report. However, not all modified files of the bug reports are included in one of the source code versions. Therefore, future work can extend the search space of source code files to a finer granularity than the set of release versions. Additionally, the source code selection strategy needs further improvements regarding the run time. In the future, it is necessary to evaluate and analyse a more sophisticated strategy, which does not depend on release version but rather on the actual commit history. Moreover, we would like to extend the experiments of the Time-Aware strategy to the data set of Lee et al. (lee).
Furthermore, we like to reduce the threats to external validity further by extending the benchmarking to more bug reports and other software projects. Additionally, we would like to reduce the threats to construct validity by performing a user study on how well a bug localization approach helps to locate the bug in a real-world application.
Appendix A Detailed Results