During debugging, fault localization is one of the most difficult and time-consuming tasks, particularly for large-scale software systems. Therefore, there is a high demand for automatic fault localization techniques that can help software engineers effectively find the locations of faults with minimal human intervention . This has led to propose and implement different types of such techniques. Spectrum Based Fault Localization (SBFL) is considered amongst the most prominent techniques in this respect due to its efficiency and effectiveness . In SBFL, the probability of each program entity (e.g., statements, blocks, or methods) of being faulty is calculated based on test cases, their results, and their corresponding code coverage information.
Unfortunately, SBFL techniques are not yet widely adopted in the industry [31, 10, 8] because they pose a number of issues and their performance is affected by several influencing factors. One of these factors is the following. In SBFL, program statements are ranked in order of their suspiciousness from most suspicious to least. To decide whether a statement is faulty or not, programmers examine each statement starting from the top of the ranking. In order to help developers discover the faulty statement early in the examination process and with minimal effort, the faulty statement should be put near to the highest place in the ranking. However, ranking based only on suspiciousness scores inevitably involves a problem called rank ties . When different code elements (such as statements or methods) are tied this means that they have the same suspiciousness scores, so they are indistinguishable from each other in this respect. If the fauly element falls within a tie (this is called a critical tie) then the overall performance of the SBFL method will be reduced.
Probably none of the known SBFL formulae are guaranteed to produce different scores for all the program elements, hence ties inevitably emerge between the code elements. In fact, as we shall see in this paper, ties in SBFL are prevalent regardless of the underlying formula. In this paper, we propose a tie-breaking strategy to improve the performance of SBFL by utilizing contextual information extracted from method call chains (our strategy is at method-level granularity, meaning that the basic program element considered for fault localization is a method). Method call chains are the call sequences of methods in the call stack during their executions. Both call chains and call stack traces can provide valuable context to the fault being traced. For example, a method may fail if called from one place and performs successfully when called from another.
The proposed strategy is based on how often a method has been called, directly or indirectly, during the execution of failed test cases in different contexts. However, here we do not count all occurrences of a method call but only those that occur in unique call contexts. Thus, repeating sequences of method calls due to, e.g., loops are not considered. The intuition is that if a method is present in many different calling contexts during a failing test case, it will be more suspicious and get a higher rank position compared to other methods with the same scores. The strategy can be applied to any underlying SBFL formula, and, as we will see, it can favourably break the occurring ranks in the ties in many cases.
We empirically evaluated the approach using 411 real faults from the Defects4J dataset and five well-known spectra formulae. The obtained results indicate that for all the selected formulae, the call frequency chain based tie-breaking strategy can improve the localization effectiveness in many ways. For example, it completely eliminated 72–73% of the critical ties over the full dataset. In other cases, it reduced their sizes significantly. Ranks of buggy elements improved by two positions on average, the approach achieved positive movement of bug ranks in most Top-3/Top-5/Top-10 rank categories, and in particular, the number of cases where the faulty method became the top ranked element increased by 23–30%.
The main contributions in the paper can be summarized as follows:
Analysis of rank tie prevalence in the benchmark programs.
A new tie-breaking algorithm that successfully breaks critical ties in many cases.
The analysis of the impact of tie-breaking on the overall SBFL effectiveness.
In terms of the concrete research goals, we defined the following Research Questions (RQs) for this paper:
How prevalent are rank ties when applying a selection of different SBFL formulae? In particular:
How common are rank ties in the Defects4J benchmark and what are their sizes?
What would be the theoretically achievable maximum improvement if all critical ties were broken?
What level of tie-breaking can we achieve using the call-frequency based strategy?
What is the overall effect of the proposed tie-breaking on SBFL effectiveness in terms of global rank improvement?
The remainder of the paper is organized as follows. Section II presents an overview of the related work. Section III describes the tie problem in software fault localization. Section IV deals with RQ1, while Section V introduces our novel tie-breaking approach and answers RQ2. Section VI presents the description of our empirical evaluation of RQ3. Section VII reports the potential threats to validity, finally we provide our conclusions in Section VIII.
Ii Related Work
Software fault localization is a significant research topic in software engineering. Despite having started in the late 1950s, software fault localization research has gained more attention in the last couple decades. This is reflected in the increase in the number of techniques, tools, and publications. The main reason for the increased attention is the dramatic increase in software systems size due to the newly added functionalities and features they provide. This also has led to an increase in the complexity of these systems. As a result, more faults have also been reported. Here, software fault localization is a good approach to reduce the number of faults and to ensure software quality. Many fault localization techniques, in addition to the ones used in this paper, have been proposed and discussed in the literature. There have been several surveys written [4, 25, 8] and various empirical studies [18, 32] performed to compare the effectiveness of various techniques. However, a systematic research work on the problem of addressing ties in the context of fault localization is still modest. The most related publications are presented here.
Yu et al.  proposed a tie-breaking strategy that firstly sorts program statements based on their suspiciousness and then breaks ties by sorting statements based on applying a confidence metric. The metric is intended to assess the degree of certainty in a given suspiciousness value. For example, when two or more statements are assigned the same level of suspicion, the suspiciousness assigned to the statements with a higher level of certainty is more reliable. As a result, the corresponding statements are more likely to be faulty.
Xu et al.  have presented the most systematic analysis of the problem associated with critical ties (ties with faulty statements) where four tie-breaking strategies were considered and evaluated via experimental case studies. Their results indicated that some of the strategies can reduce ties without having an adverse impact on fault localization effectiveness. Besides, they proposed some other tie-breaking techniques to be studied and evaluated in the future such as slicing-based approach to breaking ties.
Debroy et al.  proposed a grouping-based strategy that employs another influential factor alongside statements’ suspiciousness. This strategy groups program statements based on the number of failed tests that execute each statement and then sorts the groups that contain statements that have been executed by more failed tests. Afterwards, it ranks the statements within each group by their suspiciousness to generate the final ranking list. Thus, the statements are examined firstly based on their group order and secondly based on their suspiciousness. Their results show that ranking based on several factors can improve the SBFL effectiveness. Thus, the grouping-based strategy could be effective in tie-breaking as well.
Laghari et al.  employed the idea of utilizing method calls to improve the performance of SBFL. In their proposed approach, they combined method calls and their sequences with program slicing to extract spectra patterns from different contexts that can be used to effectively locate faults compared to only using the standard SBFL formulae.
It can be noted that utilizing method calls to improve the performance of SBFL is not new. However, using method calls frequency for tie-breaking is a novel approach which has not been investigated by other researchers previously.
Iii Spectrum-Based Fault Localization and Ties
In this section, we present how SBFL techniques are used to locate faults by ranking program elements based on their suspiciousness of being faulty and what are the steps to do so. Also, we introduce the problem of ties among program elements in the rankings that these techniques produce. This is achieved by a simple code example that illustrate the aforementioned concepts.
Iii-a Fault Localization Formulae
SBFL is a dynamic program analysis technique which is performed through program execution. In SBFL, code coverage information (also called spectra) obtained from executing a set of test cases and test results are used to calculate the probability of each program entity (e.g., statements, blocks, or methods) of being faulty 
. Code coverage provides information on which program entity has been executed and which one has not during the execution of each test case; while tests results are classified as passed or failed test cases. Passed test cases are executions of a program that output as expected, whereas failed test cases are executions of a program that output as unexpected.
To illustrate the work of SBFL, assume a simple Java program, which is adopted from Vancsics et al. , that comprises of four main methods (, , , and ), and its four test cases (, , , and ) as shown in Figure 1.
Suppose that the tests have been executed on the program and the program spectra (the execution information of the four program methods in passed and failed test cases) have been recorded. Table I presents this information. An entry of 1 in the cell corresponding to the method and the test case means that the method has been executed by the test case , and 0 otherwise. This is also known as the hit-based SBFL. An entry of 1 in the row labeled “results” means that the corresponding test case resulted in failure, and 0 otherwise. For example: test case calls the methods , and and it failed because its expected value is and not .
The program spectra are then used by a spectra formula to compute the suspiciousness of each program element of being faulty. Often, a spectra formula is expressed in terms of four counters that are calculated from the program spectra as follows:
: set of failed test cases that executed m.
: set of passed test cases that executed m.
: set of failed test cases that not executed m.
: set of passed test cases that not executed m.
The last four columns of Table I represent these values. For example, of contains two tests because failed tests and are executed by , and of includes only one test () because it is not run by .
Most formulae use these four values to determine the location of the bugs as accurately as possible. In this paper, we use five popular formulae for quantitative evaluation as presented in Table II. The DStar , Ochiai  and Tarantula  can be seen as the most popular ones. While the Confidence  was used to give importance to suspicious program elements especially for tie-breaking purposes. Finally, the GP13  is a “generated”
formula by genetic algorithm which is one of the best performing formulae of this kind.
By applying these formulae on the coverage hit spectra of our Java program example in Table I, we can obtain the suspiciousness scores of each method as presented in Table III. It can be noted that in this example, each SBFL formula produces the same suspiciousness score for more than one method. In other words, SBFL formulae in this case cannot distinguish the methods from each other based on their pure scores. Hence, the tie problem among program elements affects the SBFL effectiveness in this case.
Iii-B Rank Calculations and Ties
When different elements are assigned the same suspiciousness score, we treat these elements score tied to each other, and we call any such set of code elements rank ties. Clearly, rank ties have at least two elements. Since the output of SBFL algorithms should be a (weakly) monotone list of ranked code elements according to their suspciousness scores, there are various strategies for dealing with rank ties. This is especially important when evaluating the effectiveness of an SBFL method in terms of the location of the actually faulty element in the rank list. In this situation all elements in a rank tie are assigned the same rank value, based on one of these approaches :
minimum (MIN): it refers to the top most position of the elements sharing the same suspiciousness value (optimistic or the best case),
maximum (MAX): it refers to the bottom most position (pessimistic or the worst case), or
average (MID): it refers to the medium position of the elements sharing the same suspiciousness value (average case).
As a general way of assessing SBFL effectiveness, we will use the average rank approach (Equation 1), but we will use the other two options as well to examine tie properties in the sections that follow. If there are multiple bugs for a program version, the highest rank of faulty elements is used.
Where S is the tie starting position and E is the tie size.
Table IV presents the average ranks of the example program based on several fault localization algorithms. Ranks that belong to a tie are marked in gray. It can be stated that two algorithms ( and ) cannot distinguish the methods from each other at all based on ranks, and the other three approaches result a tie-group that contain 3 methods.
Thus, such methods get tied in the ranking and cannot be differentiated from each other in terms of which one has to be examined first. Therefore, tie-breaking strategies are required to break these ties. Tie-breaking strategies are not only important to measure the effectiveness of a fault localization formula, they are also important for designing an efficient algorithm. For example, with an effective strategy, the buggy method can be moved up (i.e. to a better position) in the suspicious list.
SBFL formulae that do not deal with the issue of rank ties do not take into account other suspiciousness factors derived from the context in the ranking. It is quite frequent that ties include faulty elements and it is not limited to any particular localization technique or target program. As a result, such elements are tied to the same position in the ranking. Also, it gives an indication that the used technique cannot distinguish between the tied elements in terms of their likelihood of being faulty. Thus, no guidance is provided to developers on what to examine first . In addition, the greater the number of ties involving faulty elements, the more difficult it is to predict at what point the fault will be found during the examination.
Rank ties can be divided into two important types: non-critical and critical. Non-critical ties refer to the case where only non-faulty elements are tied together for the same score in the ranking. Here, if the tied elements have a higher suspiciousness than the actual faulty element, then every element will be examined before finding the fault, regardless of the ties. On the other hand, if the tied elements have a lower suspiciousness than the actual faulty element, then the faulty element will be examined before the tied ones. Thus, there is no need to continue examining the ranking. In either case, the internal order in which the non-critical tied elements are examined does not affect the performance of fault localization in terms of the number of elements that must be examined before finding the fault.
Critical ties, on the other hand, refer to the case when a faulty element is tied with other non-faulty elements . In this type, the internal order of examination affects the performance. In the case of tie-breaking approaches, critical ties are the main target as it can bring improvement to the efficiency of the SBFL algorithm. But, unfortunately, we do not know which code element is faulty, so all ties have to be dealt with by the tie-breaking strategy.
Iv Tie Statistics
In this section, we analyze the existence of rank ties in a set of benchmark programs. We present the subject programs that were used in our experimental results alongside their properties and the granularity of data collection that was employed as a program spectra for our selected SBFL formulae. Then, we present the properties of ties we obtained before applying our tie-breaking strategy and to which extent they can be reduced.
Iv-a Subject Programs
An appropriate dataset is required to examine fault localization. One of the most popular bug dataset is Defects4J. It is a database of non-trivial real faults which is used to enable reproducible studies in software fault localization for Java programs . Besides, it is the most frequently used benchmark in the fault localization literature 
as it provides a high-level framework interface to easily access faulty and fixed program versions and their corresponding test suites. The version we used in this study is v1.5.0 and consists of 6 open-source Java programs and 438 real faults which were identified and extracted from the projects’ repositories111https://github.com/rjust/defects4j/tree/v1.5.0. However, a few faults were excluded in this study due to instrumentation errors or unreliable test results. Thus, a total of 411 faults were included in the final used dataset. Table V presents each program and its main properties.
Iv-B Granularity of Data Collection
In this paper, method-level granularity was employed as a program spectra or coverage type. Compared to statement-level granularity, the widely used level, it has several advantages : (a) it provides more comprehensive contextual information about the program entity under investigation, (b) it can handle (i.e., scales well to) large programs and executions, (c) some studies report that it is a better granularity-level for the users too [32, 5]. Nevertheless, there is no theoretical obstacle to investigate lower levels of granularity as well in terms of rank ties in the future.
Iv-C Evaluation Baselines
In this paper, five standard SBFL formulae, which are presented in Table II, were used as the baselines to evaluate and compare our proposed method against. The reasons behind this are: (a) there is no other proposed tie-breaking approach that works on the method-level granularity as our method does; (b) our goal was to use contextual information from program executions only to break ties, and not as the underlying SBFL formula for all program elements (this was done by Vancsics et al. ). The authors in  used two confidence formulae to break ties and data-dependency among program statements as well, but these approaches are not directly comparable to ours.
Iv-D Basic Statistical Analysis
As mentioned earlier, there is no guarantee that SBFL formulae produce unique suspiciousness scores for all the elements of a program under test. As a result, many elements may share the same scores and get tied with each other. Here, we present brief yet informative statistics on the number of ties that the selected five SBFL formulae produce when applied on the Defects4J dataset (see Table VI). It can be noted that all the selected SBFL formulae produce ties across all the target programs. This may indicate different things: (a) ties are not rare in fault localization, (b) ties can be formed regardless of which subject program is under consideration, c) different SBFL formulae are affected.
Table VII presents the number of critical ties. An interesting observation is that the number of ties is not related to program size. For example, smaller programs may have more critical ties than larger programs as in the case of the Lang (22 KLOC) program having more critical ties compared to the Chart (96 KLOC) program. The average number of critical ties per bug is an important indicator, as it means in essence the probability that a buggy element will be tied (assuming a single-bug scenario).
We can conclude that more than half of the bugs (54-56%) are within critical ties, i.e. in most cases there is at least one method whose suspiciousness score is the same as the score of the faulty method.
The sizes of ties is another important factor when considering the potential improvements by tie-breaking. This can be investigated by looking at the differences between the MIN (best case) and the MID (average case) approaches described in the previous section. Consider Table VIII which shows the number of critical ties for which MIN and MID values are different in columns 2 and 3 (essentially, the critical tie numbers as shown above), and also the sum of the corresponding rank differences (column 4), and its average per critical tie (column 5). Put it differently, the double of the average difference is the average critical tie size in the benchmark, which is around 7 methods. The difference between the different formulae is not notable.
It also follows that, ideally, the best improvement we could achieve using a tie-breaking technique is these averages. From Table VIII, we can see in how many cases there is any improvement possible, so we can use these numbers as a baseline for evaluating our tie-breaking approach in subsequent sections.
We examined the distribution of the critical tie sizes as well, which is shown in Figure 2
. The X-axis represents the number of methods involved in critical ties and the Y-axis represents the percentage of method groups that have the same tie size. As expected, most ties are relatively small (2–4 elements), 67% of the critical ties contain 5 or less methods, and sizes above 15 are rare (the average is 7.8, the median 3 and the maximum 128). Interestingly, there are some outlier cases where the tie sizes are very big, which is the explanation of the relatively large average number.
RQ1: Overall, it can be said that the ties and critical ties are very common (for the bugs in our benchmark). Each of the examined SBFL algorithms created critical ties for more than half of bugs, and on average, the ranks could potentially be improved by around 3.5 positions by eliminating the ties.
V Call Frequency-Based Tie-Breaking
In this section, we present the concepts of our proposed tie-breaking strategy and how it works. Then, we present its effectiveness in reducing critical ties when applied on our bug benchmark.
V-a Frequency-based Tie Reduction
In Section III, we introduced the basic concept of hit-based SBFL. One disadvantage of this approach is that it does not take into account the frequency of executing the program elements, in our case methods (also known as the count-based SBFL). There have been studies that used counts [14, 13], but recent results  have shown that these are unable to improve efficiency of the algorithms.
Vancsics et al.  proposed a technique to replace the simple count-based approach that proved to enhance hit-based spectra while eliminating the problems of the naive counts. It is based on replacing the value of in the SBFL formulae with the frequency of different call contexts in the call stack for failing tests. The basic intuition is that if a method participates in many different calling contexts (both as a caller and as a callee), it will be more suspicious. In other words, the frequency of methods occurring in the unique call stacks belonging to failing test cases can effectively indicate the location of the bugs. In the present research, we will employ this concept for the purpose of tie-breaking.
To illustrate the basic concept of frequency-based tie reduction, first we define the frequency-based SBFL matrix that replaces the traditional hit-based one. In it, each element will get an integer instead of indicating the number of occurrences of the particular code element in the unique call stacks (effectively, the different contexts) when executing the given test cases . Table IX shows the frequency-based matrix for the example. For instance, the call stacks of are , and , so the frequency of will be 2 for test .
In the next step, we define our metric to be used as a discriminating factor for tie-breaking. The corresponds to the “frequency-based ” and is calculated by summarizing the corresponding frequency-based values in the matrix for the failing test cases (see Equation 2). The values for our example are shown in the last row of Table IX.
Figure 3 shows our tie-breaking process which can be seen as a two-stage process. In the first stage, we compute the suspiciousness scores of program methods and their ranks via applying different SBFL formulae on the program spectra (test coverage and test results). The output of this stage is an initial ranking list of program methods including critical and non-critical ties. In the second stage, we trace the execution of program methods to obtain the , i.e. frequency-based . This will then be used as a tie-breaker after re-arranging the order of the critical tied methods in the initial ranking list based on the value of for each method. The output of this stage is a final ranking list, where many critical ties either eliminated completely or their sizes were reduced.
Our proposed tie-breaking method uses the obtained call frequency values to break the methods sharing the same score, by putting the methods with higher upper in the rank. Thus, the most suspicious one will be the method that was called in more different call stacks from failing test cases. The rationale behind using the rather than other contexts (such as the context of method calls in passing test cases) is the intuition that a method is more suspicious to contain a fault when executed by more failing test cases than passing ones, while non-suspicious when mostly executed by passing tests. However, other different contexts could be considered in the future and investigate their impacts on breaking ties.
The ranks without (columns B) and with tie-breaking (columns A) with our approach for the example are presented in Table X. Ties marked in gray were eliminated with the use of call frequency. As a result, we were able to differentiate between the faulty method () and the other suspicious ones using all of the SBFL formulae (the faulty method got the highest rank in all cases).
V-B Reduction of the Critical Ties
The metric called Tie-reduction was defined by Xu et al. to measure how much a critical tie has been reduced/broken in terms of size . Here, size simply means the number of code elements sharing the same score value, and obviously, the minimum tie size is 2. The goal of any tie-breaking strategy is to reduce the size of the tie or completely eliminate it (when the resulting size is 1). We modified the original definition of this metric to better reflect the actual gain in terms of what portion of the “superfluous” elements in a tie can be eliminated (see Equation 3).
Here, size is the size of a critical tie after applying a tie-breaking strategy and size is the size of a critical tie before applying a tie-breaking strategy. In an ideal case, the critical tie is completely eliminated, in which case the value of the tie-reduction is . If no reduction can be obtained, this metric will be , and in all other cases it will show the percentage of the removed elements that share the same score value as the faulty element.
In Figure 4, we visualized the amount of critical tie-reduction on our benchmark using the Tie-Reduction metric. Each dot represents one bug in the dataset and the violin plot offers a more general picture about the distribution of the data points. It can be seen from the shape of the plots that in several cases, reduction was not possible but the majority of the ties was completely eliminated. Similar to the number and size of critical ties, there was no significant difference in this aspect depending on what SBFL formula was used, we obtained very similar results.
shows some important statistical values for this dataset: mean, median and quartile 1 (the value in the middle between the smallest and the median points). Since the median is 100%, we can state that the critical ties are eliminated by our method in more than half of the cases (72–73%, as detailed below), and the reduction is between 83.9–91.5% for three-quarters of the bugs, while the average rate of reduction is greater than 80% in all cases.
Table XII presents the number of remaining critical ties for each program and SBLF formula after applying the tie-reduction algorithm. The difference to the previous values (shown in Table VII) is also included. For example, 15 bugs of Chart were in critical ties with the formula, but after applying the call frequency-based tie-reduction 11 critical ties are eliminated, which is 73.3% of the initial ties. Overall, we achieved 72-73% improvement in the number of critical ties over the full dataset, the best case being Mockito with over 84.6% and the worst result was 54.5% on Lang using and .
The sizes of the critical ties determine the level of achievable improvement after applying a tie-breaking approach. However, it is also important in which direction in the new ranking the faulty element moved after tie-breaking. Using the terminology from the previous section, moving from the MID position towards MIN means improvement. In the previous section, in Table VIII; we presented the maximum potential improvement that sets a theoretical constraint on SBFL effectiveness after tie-reduction. Table XIII presents what we actually achieved using our proposed algorithm (the meaning of the data is the same as in Table VIII).
We examined whether our method was able to reduce the number of cases where the MIN (best case) and the MID (average case) approaches give different results. If there were no such cases that would mean that the obtained new ranking after tie-break would always be the best possible, MIN case. Table XIII shows that, after our approach, only around 15% of the bugs contained critical ties (column 3), compared to around 55% before tie-breaking.
Comparing this with the result of Table VIII, we find that in more than 160 cases we managed to achieve the ideal result with our method where the original algorithm was not able to do so. It means that for nearly three quarters of bugs in critical ties (72–73%), the non-optimal result was improved to optimal.
If we compare the sum of the rank differences (column 4) and their averages (column 5) in Tables VIII and XIII, it can be seen that our approach was able to reduce the sum significantly (by 89–93%), and the average by 59–74%.
Put it differently, the overall rank positions from the ideal case improved from around 3.5 to 1 in the cases when we achieved optimal result, which essentially means rank improvement between and .
RQ2: Using the call-frequency based tie-breaking strategy, we achieved a significant reduction in both size and number of critical ties in our benchmark. In 72-73% of the cases the ties were completely eliminated, the average reduction rate being more than 80%. In nearly three quarters of the cases (72–73%), the faulty element got the highest rank among the tie-broken code elements, and here it improved its position by 59–74% on average.
Vi Effect of Tie-Breaking on SBFL Performance
In this section, we analyze what is the overall effect of the proposed tie-breaking strategy on SBFL effectiveness in terms of global ranks. For that purpose we use several evaluation metrics that were employed in the literature[15, 6, 28].
Vi-a Achieved improvements and the average ranks
Average rank is used to rank the program elements that share the same suspiciousness value by considering the average of their positions after they get sorted, in a descending order, by the level of their suspiciousness. And, it is calculated using Equation 1. Table XIV presents the average ranks before (column 2) and after (column 3) applying our tie-breaking strategy and it shows the difference between the average ranks before and after tie-reduction (column 4). If the difference is negative then this means that we could achieve improvement with our proposed strategy.
We can see that our strategy achieved improvements with all the selected SBFL formulae: the average rank reduced by more than in all cases, which corresponds to 3.1–4.1% with respect to the total number of elements. Note, that this average is similar to what we got for RQ2, but it is not the same because for RQ2 we examined only the cases when we achieved the optimal result, while in this section we are interested in the global results.
We also examined how many times our tie-breaking strategy changed the rank of bugs (in positive and negative directions) and what was the impact of the changes. Table XV presents the possible changes in several categories, as follows (B means before, A means after applying tie-breaking):
it has moved up in the rankings (column: better), when and
it remained in the same position (column: same), when
we worsened the result (column: worse), when
it slipped back to the worst place (column: worst), when
In addition, column “improve” represents improvements in rank modifications (i.e., best+better), while “deteriorate” is worse+worst. The table also includes the average differences in rank positions for the given categories.
The results indicate that in about 3–4 times more cases we achieved improvement than deterioration of the ranking results. Moreover, the improvement differences are much higher than the deterioration differences (compare, for example, the better cases of around -7 to worse cases of around 2). Other interesting insight is that in the case of best, the difference is relatively small as the size of the ties broken in this category were small as well (they contained 3-4 methods). Looking at the overall result, the average rate of improvement ranged from -3.73 to -3.86, while the deterioration was only between 1.34 and 1.54 rank positions on average.
The overall rank position improvement might seem modest, but we must consider the fact that the improvement can be achieved only by rearranging the positions in the critical ties. Thus, the sizes of the critical ties serve as a hard constraint (as discussed in the previous section). However, there is a class of improvements which are probably more important than the general case: improvements in the Top-N rank positions, and here the benefits are more pronounced, as presented below.
Vi-B Top-N categories
Several studies [19, 27] report that developers think that inspecting the first 5 program elements in the ranks list produced by a fault localization technique is acceptable and that the first 10 elements are the upper bound for inspection before ignoring the ranks list. Hence, we verified the results by focusing on these rank positions only (collectivelly called Top-N). We will use five cases: where a fault is ranked first (Top-1), it is less or equal to three (Top-3), less or equal to five (Top-5), less or equal to ten (Top-10), and when it is more than ten (Other).
We also used a special non-accumulating variant of Top-N categories, in which case we counted cases where the bug fell into a non-overlapping intervals of , , , or . The goal of the evaluation in this part was to see in how many cases our approach “moves” a bug into a better (for example, from to ) or a worse (for example, from to ) group. In other words, in how many cases do the bugs get into a higher-rank group (this kind of improvement is also known as enabling improvement ) and in how many cases do we downgrade the category.
Table XVI presents the number of bugs belonging to the corresponding Top-N categories (cumulative) with their percentages, for the whole dataset, before and after applying our tie-breaking strategy, as well as the differences between them. A decrease in the number of cases of the Other category and increase in any Top-N means improvement.
It can be clearly seen that our proposed tie-breaking strategy achieves improvements in all categories by moving many bugs to higher ranked categories. On the lower end of the scale (Other category with rank ), 5–8 bugs were moved into one of the Top-N categories. This is important as it brings a “new hope” that a bug could be found by the user with the proposed strategy while it was not very probable without it. We can see a quite large number of improvements in higher categories as well, around 18 bugs moved to Top-1, for instance. Note that the percentages of bugs in each category before and after applying the strategy were calculated with respect to the total number of bugs in the dataset. While the difference percentage was calculated with respect to the number of bugs before applying the strategy.
To better understand the actual changes between the different Top-N categories we should use the non-accumulating variant of these categories. This shows whether there has been a beneficial change in the rank category. These moves between the Top-N categories are presented by Table XVII. The sign ✗ indicates the number of changes in the negative direction (worsening result), and ✓ marks improvement. For example, there were a total of 2 bugs with a rank greater than 1 but less than or equal to 3 before reduction by , but our method resulted in a rank value greater than 3 (this is a negative result). In contrast, our method gave a rank of 1 for the faulty method 15 times which was previously greater than 1 but smaller than 3 (using ).
These numbers clearly show that improvement was dominant: degradation by the proposed method was observable only for 2–3 bugs in the dataset, while we observed positive changes for 36–44 bugs.
RQ3: The efficiency of all investigated SBFL formulae could be improved by using the proposed tie-breaking strategy: the average improvement of rank values in the benchmark was about two positions, and about 3-4 times more frequenty we observed improvement than detoriation, such improvements being much higher as well. Considering the Top-N categories, notable improvements could be observed: all Top-N categories showed positive results (improvements in 36–44 cases), and at the same time, in only a few (2–3) cases Top-N categories worsened. We were able to increase the number of cases where the faulty method became the top ranked element by 23–30%.
Vii Threats to validity
Various threats may affect the validity of experimental studies in software engineering. In our work, we considered the following actions to avoid or minimize such threats:
Selection of evaluation metrics: to ensure that our results and corresponding conclusions are valid, we selected several evaluation metrics that are also used by previous research to ensure multiple-dimension comparisons. Besides, all the evaluation metrics employed in this study were reported and described in detail.
Correctness of implementation: to ensure that our experiment implementation is correct and accurate, code review was conducted by the authors. Furthermore, we have run our proposed approach several times to ensure that it is implemented correctly.
Selection of subject programs: in our experiment, we evaluated the effectiveness of the proposed tie-breaking strategy on fault localization using only six Java subject programs. Thus, we cannot generalize our findings to other programs in general. However, we believe that the selected subject programs are representative to others as they have real faults, varying in size and complexity, and the benchmark containing them, Defects4J, is used commonly in other studies on software fault localization.
Exclusion of faults: in our experiment, 27 faults (about 6% of the total faults) of the Defects4J dataset were excluded because we could not compute their call stack information due to technical limitations. The issue here is whether other researchers using the same dataset will be able to replicate our findings. This exclusion was in no ways influenced by the results of the used metrics and the excluded faults are distributed in the dataset approximately evenly, so we believe that this risk can be considered minimal.
Selection of SBFL formulae: to evaluate the effectiveness of our proposed tie-breaking strategy on fault localization, we selected a set of five SBFL formulae in our experiment, which is just a fraction of the proposed techniques in literature. The obtained results show improvements with all of them. However, we cannot guarantee that the same improvements can be obtained by using other SBFL formulae. To mitigate the effect of this issue, we selected three SBFL formulae that are commonly used in other studies on software fault localization, which we extended with two special kinds, one of which was especially designed with tie-breaking in mind.
Rank ties in SBFL are very common regardless of the formula employed, and by breaking these ties, improvements to the localization effectiveness can be expected. This paper proposes the use of method call contexts for breaking critical ties in SBFL. We rely on instances of call stack traces, which are useful software artifacts during run-time and can often help developers in debugging. The frequency of the occurrence of methods in different call stack instances determines the position of the code elements within the set of other methods tied together by the same suspiciousness score.
Experimental results show that the proposed tie-breaking strategy, using the Defects4J benchmark, (a) completely eliminated many critical ties with significant reduction of others, and (b) achieved improvements in average rank positions for all investigated SBFL formulae with moving many bugs to the highest Top-N rank positions. However, there are limits to how much improvement one can expect from tie-breaking alone (we analyzed this limit in the paper and compared to the results achieved). This means that no matter how clever a tie-breaking method is, it cannot rearrange code elements outside of the tied ranking positions. Since ties seem to be prevalent, it could be an interesting further research to devise specific tie-aware approaches or modified formulae that minimize ties in the scores and/or break them automatically.
As other future work, we would like to measure the effectiveness of the proposed tie-breaking strategy on other levels of granularity such as statement, branch, etc. Employing other SBFL formulae across a much broader range of programs in terms of numbers, types, sizes, and used programming languages, to capture the ties problem characteristics and identify what factors affect them would be interesting for further investigation. We also would like to tackle the ties problem by employing other contextual factors beyond method calls and to measure their impacts on the SBFL.
The results of our experimental study can be found at ”https://bit.ly/3qFQmof”.
The research was supported by the Ministry of Innovation and Technology, NRDI Office, Hungary within the framework of the Artificial Intelligence National Laboratory Program, and by grant NKFIH-1279-2/2020 of the Ministry for Innovation and Technology. Qusay Idrees Sarhan was supported by the Stipendium Hungaricum scholarship programme.
-  (2006) An evaluation of similarity coefficients for software fault localization. In 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC), Vol. , pp. 39–46. External Links: Cited by: §III-A.
-  (2007) On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007), Vol. , pp. 89–98. External Links: Cited by: §III-A.
-  (2010) Exploiting count spectra for bayesian fault localization. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp. 1–10. Cited by: §V-A.
-  (2014-09) Fault-localization techniques for software systems: A Literature Review. 39 (5), pp. 1–8. External Links: Cited by: §II.
-  (2016) A learning-to-rank based fault localization approach using likely invariants. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, New York, NY, USA, pp. 177–188. External Links: Cited by: §IV-B.
-  (2020) Leveraging contextual information from function call chains to improve fault localization. In IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), Vol. , pp. 468–479. External Links: Cited by: §VI-B, §VI.
-  (1989) Evaluating the effectiveness of reliability-assurance techniques. Journal of Systems and Software 9 (3), pp. 191–195. External Links: Cited by: §I.
-  (2016-07) Spectrum-based Software Fault Localization: A Survey of Techniques, Advances, and Challenges. pp. 1–46. External Links: Cited by: §I, §II.
-  (2010) A Grouping-Based Strategy to Improve the Effectiveness of Fault Localization Techniques. In 10th International Conference on Quality Software, pp. 13–22. External Links: Cited by: §I, §II.
-  (2017-03) Challenges of Operationalizing Spectrum-Based Fault Localization from a Data-Centric Perspective. In IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 379–381. External Links: Cited by: §I.
-  (2013) Using html5 visualizations in software fault localization. In First IEEE Working Conference on Software Visualization (VISSOFT), Vol. , pp. 1–10. External Links: Cited by: §I.
-  An empirical investigation of the relationship between spectra differences and regression faults. Software Testing, Verification and ReliabilityACM SIGSOFT Software Engineering NotesIEEE Transactions on Software EngineeringIEEE Transactions on Software EngineeringLecture Notes in Computer ScienceSoftware Testing, Verification and Reliability 10 (3), pp. 171–194. External Links: Cited by: §III-A.
-  (2000) An empirical investigation of the relationship between spectra differences and regression faults. 10 (3), pp. 171–194. Cited by: §V-A.
-  (1998) An empirical investigation of program spectra. In Proceedings of the 1998 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pp. 83–90. Cited by: §V-A.
-  (2019) Combining spectrum-based fault localization and statistical debugging: an empirical study. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), Vol. , pp. 502–514. External Links: Cited by: §VI.
-  (2002) Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering (ICSE), Vol. , pp. 467–477. External Links: Cited by: §III-A.
-  (2014) Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In International Symposium on Software Testing and Analysis (ISSTA), pp. 437–440. External Links: Cited by: §IV-A.
-  (2014-02) Empirical evaluation of existing algorithms of spectrum based fault localization. In The International Conference on Information Networking (ICOIN), pp. 346–351. External Links: Cited by: §II.
-  (2016) Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, New York, NY, USA, pp. 165–176. External Links: Cited by: §VI-B.
-  (2016) Fine-tuning spectrum based fault localisation with frequent method item sets. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), Vol. , pp. 274–285. External Links: Cited by: §II.
-  (2012) Enhancing Contexts for Automated Debugging Techniques. In The Seventh International Conference on Software Engineering Advances Enhancing (ICSEA), pp. 1–7. Cited by: §III-A.
-  (2017-10) Transforming programs and tests in tandem for fault localization. Proc. ACM Program. Lang. 1 (OOPSLA). External Links: Cited by: §IV-A.
-  (2013) MFL: method-level fault localization with causal inference. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, Vol. , pp. 124–133. External Links: Cited by: §IV-B.
-  (2021-03) Call frequency-based fault localization. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’21), pp. 365–376. Cited by: §III-A, §IV-C, §V-A, §V-A.
-  (2016-08) A Survey on Software Fault Localization. 42 (8), pp. 707–740. External Links: Cited by: §II.
-  (2007) Effective fault localization using code coverage. In Proceedings - International Computer Software and Applications Conference, pp. 449–456. External Links: Cited by: §III-B.
-  (2016) “Automated debugging considered harmful” considered harmful: a user study revisiting the usefulness of spectra-based fault localization techniques with professionals using real bugs from large systems. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Vol. , pp. 267–278. External Links: Cited by: §VI-B.
Ties within fault localization rankings: Exposing and addressing the problem.
International Journal of Software Engineering and Knowledge Engineering21 (6), pp. 803–827. External Links: Cited by: §II, §III-A, §III-B, §III-B, §IV-C, §V-B, §VI.
-  (2012) Evolving human competitive spectra-based fault localisation techniques. 7515 LNCS, pp. 244–258. External Links: Cited by: §III-A.
-  (2008) An empirical study of the effects of test-suite reduction on fault localization. In Proceedings of the 13th international conference on Software engineering (ICSE), pp. 201–2010. External Links: Cited by: §II.
-  (2020-03) Spectrum-based Fault Localization Techniques Application on Multiple-Fault Programs: A Review. Global Journal of Computer Science and Technology 20, pp. 41–48. External Links: Cited by: §I.
-  (2021) An empirical study of fault localization families and their combinations. 47 (2), pp. 332–347. External Links: Cited by: §II, §IV-B.