A Variability Fault Localization Approach for Software Product Lines

09/21/2021
by   Thu-Trang Nguyen, et al.
VNU
0

Software fault localization is one of the most expensive, tedious, and time-consuming activities in program debugging. This activity becomes even much more challenging in Software Product Line (SPL) systems due to variability of failures. These unexpected behaviors are induced by variability faults which can only be exposed under some combinations of system features. The interaction among these features causes the failures of the system. Although localizing bugs in single-system engineering has been studied in-depth, variability fault localization in SPL systems still remains mostly unexplored. In this article, we present VarCop, a novel and effective variability fault localization approach. For an SPL system failed by variability bugs, VarCop isolates suspicious code statements by analyzing the overall test results of the sampled products and their source code. The isolated suspicious statements are the statements related to the interaction among the features which are necessary for the visibility of the bugs in the system. The suspiciousness of each isolated statement is assessed based on both the overall test results of the products containing the statement as well as the detailed results of the test cases executed by the statement in these products. On a large dataset of buggy SPL systems, empirical evaluation shows that VarCop significantly improves two state-of-the-art fault localization techniques by 33 incorrect statements in the systems containing a single bug each. In about two-thirds of the cases, VarCop ranks the buggy statements at the top-3 positions in the resulting lists. For multiple-bug cases, VarCop outperforms the state-of-the-art approaches 2 times and 10 times in the proportion of bugs localized at the top-1 positions. In 22 correctly ranks at least one bug in a system at the top-1 and top-5 positions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

07/10/2021

Variability Fault Localization: A Benchmark

Software fault localization is one of the most expensive, tedious, and t...
11/18/2019

Configuration-dependent Fault Localization

In a buggy configurable system, configuration-dependent bugs cause the f...
02/27/2021

Fault Localization with Code Coverage Representation Learning

In this paper, we propose DeepRL4FL, a deep learning fault localization ...
12/09/2017

Inforence: Effective Fault Localization Based on Information-Theoretic Analysis and Statistical Causal Inference

In this paper, a novel approach, Inforence, is proposed to isolate the s...
08/27/2019

K-CONFIG: Using Failing Test Cases to Generate Test Cases in GCC Compilers

The correctness of compilers is instrumental in the safety and reliabili...
07/20/2021

A Hybrid Approach to Fine-grained Automated Fault Localization

Fault localization is to identify faulty source code. It could be done o...
02/19/2021

FLACK: Counterexample-Guided Fault Localization for Alloy Models

Fault localization is a practical research topic that helps developers i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many software projects enable developers to configure to different environments and requirements. In practice, a project adopting the Software Product Line (SPL) methodology [SPLBook] can tailor its functional and nonfunctional properties to the requirements of users [SPLBook, apel2016feature]. This has been done using a very large number of options which are used to control different features [apel2016feature] additional to the core software. A set of selections of all the features (configurations) defines a program variant (product). For example, Linux Kernel supports thousands of features controlled by +12K compile-time options, that can be configured to generate specific kernel variants for billions of possible scenarios.

However, the variability that is inherent to SPL systems challenges quality assurance (QA) [garvin2011feature, apel2013strategies, 42bugs, sampling_comparision, ase19prioritization]. In comparison with the single-system engineering, fault detection and localization through testing in SPL systems are more problematic, as a bug can be variable (so-called variability bug), which can only be exposed under some combinations of the system features [garvin2011feature, interaction_complexity]. Specially, there exists a set of the features that must be selected to be on and off together to necessarily reveal the bug. Due to the presence/absence of the interaction among the features in such set, the buggy statements behave differently in the products where these features are on and off together or not. Hence, the incorrect statements can only expose their bugginess in some particular products, yet cannot in the others. Specially, in an SPL system, variability bugs only cause failures in certain products, and the others still pass all their tests. This variability property causes considerable difficulties for localizing this kind of bugs in SPL systems. In the rest of this paper, variability bugs are our focus, and we simply call a system containing variability bugs a buggy (SPL) system.

Despite the importance of variability fault localization, the existing fault localization (FL) approaches [wong2016survey, arrieta2018spectrum, pearson2017evaluating] are not designed for this kind of bugs. These techniques are specialized for finding bugs in a particular product. For instance, to isolate the bugs causing failures in multiple products of a single SPL system, the slice-based methods [static_slicing, dynamic_slicing, wong2016survey] could be used to identify all the failure-related slices for each product independently of others. Consequently, there are multiple sets of large numbers of isolated statements that need to be examined to find the bugs. This makes the slice-based methods [wong2016survey] become impractical in SPL systems.

In addition, the state-of-the-art technique, Spectrum-Based Fault Localization (SBFL) [pearson2017evaluating, keller2017critical, naish2011model, abreu2009spectrum, abreu2007accuracy] can be used to calculate the suspiciousness scores of code statements based on the test information (i.e., program spectra) of each product of the system separately. For each product, it produces a ranked list of suspicious statements. As a result, there might be multiple ranked lists produced for a single SPL system which is failed by variability bugs. From these multiple lists, developers cannot determine a starting point to diagnose the root causes of the failures. Hence, it is inefficient to find variability bugs by using SBFL to rank suspicious statements in multiple variants separately.

Another method to apply SBFL for localizing variability bugs in an SPL system is that one can treat the whole system as a single program [ourdataset]. This means that the mechanism controlling the presence/absence of the features in the system (e.g., the preprocessor directives #ifdef) would be considered as the corresponding conditional if-then statements during the localization process. By this adaptation of SBFL, a single ranked list of the statements for variability bugs can be produced according to the suspiciousness score of each statement. Note that, we consider the product-based testing [do2012strategies, thum2014classification]. Specially, each product is considered to be tested individually with its own test set. Additionally, a test, which is designed to test a feature in domain engineering, is concretized to multiple test cases according to products’ requirements in application engineering [do2012strategies]. Using this adaptation, the suspiciousness score of the statement is measured based on the total numbers of the passed and failed tests executed by it in all the tested products. Meanwhile, the characteristics including the interactions between system features and the variability of failures among products are also useful to isolate and localize variability bugs in SPL systems. However, these kinds of important information are not utilized in the existing approaches.

In this paper, we propose VarCop, a novel fault localization approach for variability bugs. Our key ideas in VarCop is that variability bugs are localized based on (i) the interaction among the features which are necessary to reveal the bugs, and (ii) the bugginess exposure which is reflected via both the overall test results of products and the detailed result of each test case in the products.

Particularly, for a buggy SPL system, VarCop

detects sets of the features which need to be selected on/off together to make the system fail by analyzing the overall test results (i.e., the state of passing all tests or failing at least one test) of the products. We call each of these sets of the feature selections a

Buggy Partial Configuration (Buggy PC). Then, VarCop analyzes the interaction among the features in these Buggy PCs to isolate the statements which are suspicious. In VarCop, the suspiciousness of each isolated statement is assessed based on two criteria. The first criterion is based on the overall test results of the products containing the statement. By this criterion, the more failing products and the fewer passing products where the statement appears, the more suspicious the statement is. Meanwhile, the second one is assessed based on the suspiciousness of the statement in the failing products which contain it. Specially, in each failing product, the statement’s suspiciousness is measured based on the detailed results of the products’ test cases. The idea is that if the statement is more suspicious in the failing products based on their detailed test results, the statement is also more likely to be buggy in the whole system.

We conducted experiments to evaluate VarCop in both single-bug and multiple-bug settings on a dataset of 1,570 versions (cases) containing variability bug(s) [ourdataset]. We compared VarCop with the state-of-the-art approaches including (SBFL) [pearson2017evaluating, keller2017critical, naish2011model, abreu2009spectrum, abreu2007accuracy], the combination of the slicing method and SBFL (S-SBFL) [chaleshtari2020smbfl, li2020more], and Arrieta et al. [arrieta2018spectrum] using 30 most popular SBFL ranking metrics [keller2017critical, naish2011model, pearson2017evaluating].

For the cases containing a single incorrect statement (single-bug), our results show that VarCop significantly outperformed S-SBFL, SBFL, and Arrieta et al. [arrieta2018spectrum] in all 30/30 metrics by 33%, 50%, and 95% in Rank, respectively. Impressively, VarCop correctly ranked the bugs at the top-3 positions in +65% of the cases. In addition, VarCop effectively ranked the buggy statements first in about 30% of the cases, which doubles the corresponding figure of SBFL.

For localizing multiple incorrect statements (multiple-bug), after inspecting the first statement in the ranked list resulted by VarCop, up to 10% of the bugs in a system can be found, which is 2 times and 10 times better than S-SBFL and SBFL, respectively. Especially, our results also show that in 22% and 65% of the cases, VarCop effectively localized at least one buggy statement of a system at top-1 and top-5 positions. From that, developers can iterate the process of bugs detecting, bugs fixing, and regression testing to quickly fix all the bugs and assure the quality of SPL systems.

In brief, this paper makes the following contributions:

  1. A formulation of Buggy Partial Configuration (Buggy PC) where the interaction among the features in the Buggy PC is the root cause of the failures caused by variability bugs in SPL systems.

  2. VarCop: A novel effective approach/tool to localize variability bugs in SPL systems.

  3. An extensive experimental evaluation showing the performance of VarCop over the state-of-the-art methods.

2 Motivating Example

In this section, we illustrate the challenges of localizing variability bugs and motivate our solution via an example.

2.1 An Example of Variability Bugs in SPL Systems

1int maxWeight = 2000, weight = 0;
2int maxPersons = 20;
3//#ifdef Empty
4void empty(){ persons.clear();}
5//#endif
6void enter(Person p){
7    persons.add(p);
8    //#ifdef Weight
9    weight += p.getWeight();
10    //#endif
11}
12void leave(Person p){
13    persons.remove(p);
14    //#ifdef Weight
15    weight -= p.getWeight();
16    //#endif
17}
18ElevState stopAtAFloor(int floorID){
19    ElevState state = Elev.openDoors;
20    boolean block = false;
21    for (Person p: new ArrayList<Person>(persons))
22        if (p.getDestination() == floorID)
23            leave(p);
24    for (Person p : waiting)    enter(p);
25    //#ifdef TwoThirdsFull
26    if ((weight == 0                                             && persons.size() >= maxPersons*2/3)                 || weight >= maxWeight*2/3)
27        block = true;
28    //#endif
29    //#ifdef Overloaded
30    if(block == false){
31        if ((weight == 0                                       && persons.size() >= maxPersons)                 || weight == maxWeight)
32            //Patch: weight >= maxWeight
33            block = true;
34    }
35    //#endif
36    if (block == true)
37        return Elev.blockDoors;
38    return Elev.closeDoors;
39}
Fig. 1: An example of variability bug in Elevator System

Fig. 1 shows a simplified variability bug in Elevator System [ourdataset]. The overall test results of the sampled products are shown in Table I. In Fig. 1, the bug (incorrect statement) at line 31 causes the failures in products and .

Elevator System is expected to simulate an elevator and consists of 5 features: Base, Empty, Weight, TwoThirdsFull, and Overloaded. Specially, Base is the mandatory feature implementing the basic functionalities of the system, while the others are optional. TwoThirdsFull is expected to limit the load not to exceed 2/3 of the elevator’s capacity, while Overloaded ensures the maximum load is the elevator’s capacity.

However, the implementation of Overloaded (lines 30–34) does not behave as specified. If the total loaded weight (weight) of the elevator is tracked, then instead of blocking the elevator when weight exceeds its capacity (weight >= maxWeight), its actual implementation blocks the elevator only when weight is equal to maxWeight (line 31). Consequently, if Weight and Overloaded are on (and TwoThirdsFull is off), even the total loaded weight is greater than the elevator’s capacity, then (block==false) the elevator still dangerously works without blocking the doors (lines 37–39).

Base Empty Weight TwoThirdsFull Overloaded
T F T F F
T T T F F
T T F F F
T F T T F
T F T T T
[HTML]C0C0C0 T T T F T
[HTML]C0C0C0 T F T F T
  • and are the sampled sets of products and configurations.

  • and fail at least one test (failing products). Other products pass all their tests (passing products).

TABLE I: The sampled products and their overall test results

This bug (line 31) is variable (variability bug). It is revealed not in all the sampled products, but only in and (Table I) due to the interaction among Weight, Overloaded, and TwoThirdsFull. Specially, the behavior of Overloaded which sets the value of block at line 33 is interfered by TwoThirdsFull when both of them are on (lines 27 and 30). Moreover, the incorrect condition at line 31 can be exposed only when Weight = T, TwoThirdsFull=F, and Overloaded = T in and . Hence, understanding the root cause of the failures to localize the variability bug could be very challenging.

2.2 Observations

For an SPL system containing variability bugs, there are certain features that are (ir)relevant to the failures [98bugs, mordahl2019empirical, garvin2011feature]. In Fig. 1, enabling or disabling feature Empty does not affect the failures. Indeed, some products still fail ( and ) or pass their test cases ( and ) regardless of whether Empty is on or off. Meanwhile, there are several features which must be enabled/disabled together to make the bugs visible. In other words, for certain products, changing their current configurations by switching the current selection of anyone in these relevant features makes the resulting products’ overall test results change. For TwoThirdsFull, switching its current off-selection in the failing product, , makes the resulting product, , where , behave as expected (Table I). The reason is that in , the presence of Weight and TwoThirdsFull impacts Overloaded, and consequently Overloaded does not expose its incorrectness. In fact, this characteristic of variability bugs has been confirmed by the public datasets of real-world variability bugs [98bugs, mordahl2019empirical]. For example, in the VBDb [98bugs], there are 41/98 bugs revealed by a single feature and the remaining 57/98 bugs involved 2–5 features. The occurrence condition of these bugs is relevant to both enabled and disabled features. Particularly, 49 bugs occurred when all of the relevant features must be enabled. Meanwhile, the other half of the bugs require that at least one relevant feature is disabled.

In fact, the impact of features on each other is their feature interaction [apel2013exploring, garvin2011feature]. The presence of a feature interaction makes certain statements behave differently from when the interaction is absent. For variability bugs, the presence of the interaction among the relevant features111Without loss of generality, for the cases where there is only one relevant feature, the feature can impact and interact itself. exposes/conceals the bugginess of the statements that cause the unexpected/expected behaviors of the failing/passing products [98bugs, garvin2011feature]. Thus, to localize variability bugs, it is necessary to identify such sets of the relevant features as well as the interaction among the features in each set, which is the root cause of the failures in failing products.

In an SPL system containing variability bugs, there might be multiple such sets of the relevant features. Let consider a particular set of relevant features.

O1. In the features which must be enabled together to make the bugs visible, the statements that implement the interaction among these features provide valuable suggestions to localize the bugs. For instance, in Fig. 1, consists of Weight and Overloaded, and the interaction between these features contains the buggy statement at line 31 () in Overloaded. This statement uses variable weight defined/updated by feature Weight (lines 9 and 15). Hence, detecting the statements that implement the interaction among the features in could provide us valuable indications to localize the variability bugs in the systems.

O2. Moreover, in the features which must be disabled together to reveal the bugs, the statements impacting the interaction among the features in (if the features in and are all on), also provide useful indications to help us find bugs. In Fig. 1, although the statements from lines 26–27 in TwoThirdsFull (being disabled) are not buggy, analyzing the impact of these statements on the interaction between Overloaded and Weight can provide the suggestion to identify the buggy statement. The intuition is that the features in have the impacts of “hiding" the bugs when these features are enabled. In this example, when Weight, TwoThirdsFull and Overloaded are all on, if the loaded weight exceeds maxWeight*2/3 (i.e., the conditions at line 26 are satisfied), then block = true, and the statements from 31–33 (in Overloaded) cannot be executed. As a result, the impact of the incorrect condition at line 31 is “hidden". Thus, we should consider the impact of the features in (as if they are on) on the interaction among the features in in localizing variability bugs.

O3. For a buggy SPL system, a statement can appear in both failing and passing products. Meanwhile, the states of failing or passing of the products expose the bugginess of the contained buggy statements. Thus, the overall test results of the sampled products can be used to measure the suspiciousness of the statements. Furthermore, the bugginess of an incorrect statement can be exposed via the detailed (case-by-case) test results in every failing product containing the bug. In our example, is contained in two failing products and , and its bugginess is expressed differently via the detailed test results in and . Thus, to holistically assess the suspiciousness of a statement , the score of should also reflect the statement’s suspiciousness via the detailed test results in every failing product where contributes to.

Among these observations, O1 and O2 will be theoretically discussed in Section 4.2. We also empirically validated observation O3 in our experimental results (Section 9).

Fig. 2: VarCop’s Overview

2.3 VarCop Overview

Based on these observations, we propose VarCop, a novel variability fault localization approach. For a given buggy SPL system, the input of VarCop consists of a set of the tested products and their program spectra. VarCop outputs a ranked list of suspicious statements in three steps (Fig. 2):

  1. First, by analyzing the configurations and the overall test results of the sampled products, VarCop detects minimal sets of features whose selections (the states of being on/off) make the bugs (in)visible. Let us call such a set of selections a Buggy Partial Configuration (Buggy PC). In Fig. 1, , , is a Buggy PC.

  2. Next, for each failing product, VarCop aims to isolate the suspicious statements which are responsible for implementing the interaction among the features in each detected Buggy PC. Specially, the feature interaction implementation is a set of the statements which these features use to impact each other. For example, in , VarCop analyzes its code to detect the implementation of the interaction among Weight, Overloaded, and TwoThirdsFull (O1 and O2), and this interaction implementation includes the statements at lines 9, 15, and 31. Intuitively, all the statements in which have an impact on these statements or are impacted by them are also suspicious.

  3. Finally, the suspicious statements are ranked by examining how their suspiciousness exposes in both the overall test results of the containing products (product-based suspiciousness assessment) and these products’ detailed case-by-case test results (test case-based suspiciousness assessment). Particularly, for each isolated statement, the product-based assessment is calculated based on the numbers of the passing and failing products containing the statement. Meanwhile, the test case-based suspiciousness is assessed by aggregating the suspiciousness scores of the statement in the failing products which are calculated based on the detailed results of the tests executed by the statement. (O3).

3 Concepts and Problem Statement

A software product line is a product family that consists of a set of products sharing a common code base. These products distinguish from the others in terms of their features [SPLBook].

Definition 3.1.

(Software Product Line System). A Software Product Line System (SPL) is a 3-tuple , where:

  • is a set of code statements that are used to implement .

  • is a set of the features in the system. A feature selection of a feature is the state of being either enabled (on) or disabled (off) ( for short).

  • is the feature implementation function. For a feature , refers to the implementation of in , and is included in the products where is on.

A set of the selections of all the features in defines a configuration. Any non-empty subset of a configuration is called a partial configuration. A configuration specifies a single product. For example, configuration specifies product . A product is the composition of the implementation of all the enabled features, e.g., is composed of and .

We denote the sets of all the possible valid configurations and all the corresponding products of by and , respectively (). In practice, a subset of , (the corresponding products ), is sampled for testing and finding bugs. Unlike non-configurable code, bugs in SPL systems can be variable and only cause the failures in certain products. In Table I, the bug at line 31 only causes the failures in and . This bug is a variability bug. In the rest of this paper, we simply call an SPL system containing variability bugs a buggy system. In this work, we localize bugs at the statement-level [ieee1990ieee], which is the granularity widely adopted in the existing studies [pearson2017evaluating, zhang2017boosting, abreu2007accuracy, wen2019historical].

Definition 3.2.

(Variability Bug). Given a buggy SPL system and a set of products of the system, , which is sampled for testing, a variability bug is an incorrect code statement of that causes the unexpected behaviors (failures) in a set of products which is a non-empty strict subset of .

In other words, contains variability bugs if and only if is categorized into two separate non-empty sets based on their test results: the passing products and the failing products corresponding to the passing configurations and the failing configurations , respectively. Every product in passes all its tests, while each product in fails at least one test. Note that and . In Table I, , and . Besides the test results, the statements of every product executed by each test are also recorded to localize bugs. This execution information is called program spectra [harrold1998empirical].

From the above basic concepts, the problem of variability fault localization is defined as follows:

Definition 3.3.

(Variability Bug Localization). Given 3-tuple , where:

  • is a system containing variability bugs.

  • is the set of sampled products, , where and are the sets of passing and failing products of .

  • is the set of the program spectra of the products in , where is the program spectra of .

Variability Bug Localization is to output the list of the statements in ranked based on their suspiciousness to be buggy.

4 Feature Interaction

We introduce our feature interaction formulation used in this work to analyze the root cause of variability bugs.

4.1 Feature Interaction Formulation

Different kinds of feature interactions have been discussed in the literature [apel2013exploring, garvin2011feature, ase19prioritization, soares2018varxplorer]. In this work, we formulate feature interaction based on the impacts of a feature on other features. Specially, for a set of features in a product, a feature can interact with the others in two ways: (i) directly impacting the others’ implementation and (ii) indirectly impacting the others’ behaviors via the statements which are impacted by all of them. For (i), there is control/data dependency between the implementation of these features. For example, in , since the statement at line 26 () in is data-dependent on and in , there is an interaction between and in . For (ii), there is at least one statement which is control/data dependent on a statement(s) of every feature in the set. For instance, in , and interact by both impacting the statement at line 31. As a result, when these features are all on in a product, each of them will impact the others’ behaviors.

Without loss of generality, a statement can be considered to be impacted by that statement itself. Thus, for a set of enabled features , in a product, there exists an interaction among these features if there is a statement which is impacted by the implementation of all the features in , regardless of whether is used to implement these features or not. Formally, we define impact function, , to determine the impact of a statement in a product.

Definition 4.1.

(Impact Function). Given a system , we define impact function as . Specially, refers to the set of the statements of which are impacted by statement in product . For a statement in product , if satisfies one of the following conditions:

  • is data/control-dependent on

For our example, . Note that if statement is not in product , then .

Definition 4.2.

(Feature Interaction). Given a system , for product and a set of features which are enabled in , the interaction among the features in exists if and only if the following condition is satisfied:

where refers to the set of the statements in which are impacted by any statement in . The implementation of the interaction among features in product is denoted by .

For the example in Fig. 1, the features and interact with each other in product and the implementation of the interaction includes the statements at lines 31–33 and 36–38. Note that, without loss of generality, a feature can impact and interact with itself.

4.2 The Root Cause of Variability Failures

In this section, we analyze and discuss the relation between variability failures in SPL systems and the enabling/disabling of system features. In a buggy SPL system, the variability bugs can be revealed by set(s) of the relevant features which must be enabled and disabled together to make the bugs visible. For each set of relevant features, their selections might affect the visibility of the bugs in the system. For simplicity, we first analyze the buggy system containing a single set of such relevant features. The cases where multiple sets of relevant features involve in the variability failures will be discussed in the later part.

Let consider the cases where the failures of a system are revealed by a single set of the relevant features, , where the features in and must be respectively enabled and disabled together to make the bugs visible. Specially, the features in and must be respectively on and off in all the failing products. From a failing product , once switching the current selection of any switchable feature222In configuration , feature is switchable if switching ’s selection, the obtained configuration is valid regarding system’s feature model. in , the resulting product will pass all its tests, . In this case, the interaction of features in propagates the impact of buggy statements to the actual outputs causing the failures in . The relation between variability bugs (buggy statements) causing the failures in failing products and the interaction between the relevant features in will be theoretically discussed as following.

For a failing product , we denote the set of the buggy statements in by . From , disabling any feature would produce a passing product . In this case, every buggy statement can be either present or not in . First, if is not in after disabling from , then . The second case is that is still in . Due to the absence of , behaves differently from the way it does incorrectly in , and passes all its tests. This means, in , has impact on and/or the statements impacted by , . In other words, . In , and the statements impacted by together propagate their impacts to the unexpected outputs in . Thus, any change on the statements in can affect the bugs’ visibility.

These two above cases show that every incorrect statement in , only exposes its bugginess with the presence of all the features in . This demonstrates that the features in must interact with each other in , . Indeed, if there exists a feature which does not interact with the others in , , the incorrect behaviors of will only be impacted by either or the interaction among the features in {. As a result, will not require the presence of both and to reveal its bugginess. Moreover, since every has impacts on the behaviors of all the buggy statements, , , the interaction among the features in also has impacts on the behaviors of every buggy statement, , . In other words, the features interact with each other in , and the interaction implementation impacts the visibility of the failures caused by every buggy statement. Hence, the statements which implement the interaction among the features in are valuable suggestions to localize the buggy statements. This theoretically confirms our observation O1.

Similarly, from , turning on any disabled feature , the resulting product also passes all its tests. This illustrates that in , the behaviors of every buggy statement in are impacted by the presence of , thus the incorrect behaviors of cannot be exposed. Formally, . Intuitively, has impacts on interaction implementation of in as well as the impact of this interaction on . In other words, the interaction among impacts the behaviors of the buggy statements, , . Hence, investigating the interaction implementation of the features in and (if they were all enabled in a product) can provide us useful indications to find the incorrect statements. This explains our observation O2.

Overall, the interaction among the relevant features in reveals/hides the bugs by impacting the buggy statements and/or the statements impacted by the buggy ones in the products. Illustratively, this interaction implementation propagates the impact of all the buggy statements to the output of the failed tests in a failing product. This means that the buggy statements are contained in the set of statements which are impacted by or have impacts on the interaction implementation of the relevant features. Thus, identifying the sets of the relevant features whose interaction can affect the variability bugs visible/invisible and the implementation of the interaction is necessary to localize variability bugs in SPL systems.

In general, variability bugs in a system can be revealed by multiple sets of relevant features. In these cases, the visibility of the bugs might not be clearly observed by switching the selections of features in one set of relevant features. For instance, and are two sets of relevant features whose interaction causes the failures in the system. Once switching the selection of any feature in , the implementation of the interaction among the features in is not in the resulting product . Meanwhile, can still contain the interaction among the features in . Thus, can still fail some tests. However, if we can identify and/or or even their subsets, by examining the interaction among the identified set(s) of features, the bugs can be isolated. More details will be described and proved in the next section.

In spite of the importance of the relation between variability failures and relevant features, this information is not utilized by existing studies such as SBFL and S-SBFL. Consequently, their resulting suspicious spaces are often large. Meanwhile, by Arrieta et al. [arrieta2018spectrum], SBFL is adapted to localize bugs at the feature-level. Particularly, each sampled product is considered as a test (i.e., passed tests are passing products, and failed tests are failing products), and the spectra record the feature selections in each product. However, SBFL is used to localize the buggy features. By this method, all the statements in the same feature have the same suspiciousness level. Thus, this approach could be ineffective for localizing variability bugs at the statement-level. This will be empirically illustrated in our results (Section 9).

5 Buggy Partial Configuration Detection

In this section, we introduce the notions of Buggy Partial Configuration (Buggy PC) and Suspicious Partial Configuration (Suspicious PC). Specially, Buggy PCs are the partial configurations whose interactions among the corresponding features are the root causes making variability bugs visible in a buggy system. In general, Buggy PCs can be detected after testing all the possible products of the system. However, verifying all those products is nearly impossible in practice. Meanwhile, Suspicious PCs are the detected suspicious candidates for the Buggy PCs which can be practically computed using the sampled products.

5.1 Buggy Partial Configuration

For a buggy system , where all the possible configurations of , , is categorized into the non-empty sets of passing () and failing () configurations, , a Buggy Partial Configuration (Buggy PC) is the minimal set of feature selections that makes the bugs visible in the products. In Fig. 1, the only Buggy PC is .

Definition 5.1.

(Buggy Partial Configuration (Buggy PC)). Given a buggy system , a buggy partial configuration, , is a set of feature selections in that has both the following Bug-Revelation and Minimality properties:

  • Bug-Revelation. Any configuration containing is corresponding to a failing product, .

  • Minimality. There are no strict subsets of satisfying the Bug-Revelation property, .

Bug-Revelation. This property is equivalent to that all the passing configurations do not contain . Indeed,

For a set of feature selections , if there exists a passing configuration containing , , the interaction among the features in in the corresponding product cannot be the root cause of any variability bug. This is because there is no unexpected behavior caused by this interaction in the passing product 333Assuming that the test suite of each product is effective in detecting bugs. This means the buggy products must fail.. Hence, investigating the interaction between them might not help us localize the bugs. For example, is a subset of the failing configuration , however it is not considered as a Buggy PC, because also is a subset of which is a passing configuration. Indeed, in every product, the interaction between and , which does not cause any failure, should not be investigated to find the bug. Thus, to guarantee that the interaction among the features in a Buggy PC is the root cause of variability bugs, the set of feature selections needs to have Bug-Revelation property.

Minimality. If a set holds the Bug-Revelation property but not minimal, then there exists a strict subset of () that also has Bug-Revelation property. However, for any whose configuration contains both and , to detect all the bugs related to either or , the smaller one, , should still be examined rather than . The reason is, in , the bugs related to both and are all covered by the implementation of the interaction among the features in .

Particularly, let and , where , , , and are the sets of enabled and disabled features in and , respectively. Since , and , we have and . For the enabled features in and , the failures in can be caused by the interactions among the enabled features in or . The implementation of the interactions among the enabled features in and in are and , respectively. Then, we have because . As a result, the incorrect statements related to are all included in . Similarly, for the sets of the disabled features, the set of the statements in related to all the features of also includes the statements related to all the features of . In consequence, by identifying the interaction implementation of the features in , the bugs which are related to both and can all be found.

Furthermore, if both and are related to the same bug(s), the larger set could contain bug-irrelevant feature selections which can negatively affect the FL effectiveness. For example, both the entire configuration and its subset has Bug-Revelation property. Nevertheless, the interaction among , , and , which indeed causes the failures in , should be investigated instead of the interaction among the features in the entire . This configuration contains the bug-irrelevant selection, . As a result, Buggy PCs need to be minimal.

Buggy PC Detection Requirement. All possible configurations of a system are very rarely available for the QA process. Thus, in a buggy system, the sampled set is used to detect the Buggy PCs. Assuming that is the set of the candidates for Buggy PCs which is detected by an FL approach. For any , all the statements which implement the interaction among the features in and the statements which have impact on that interaction implementation (Section 4.2) are suspicious. Thus, for a Buggy PC , to avoid missing related buggy statements , the FL method must ensure that is covered by the suspicious statements identified from least one candidate: there must be at least one candidate in which is a subset of , , . Let us call this the effectiveness requirement.

Indeed, if there exists , such that , then in a product whose configuration contains (apparently contains ), the interaction implementation of the features in covers all the statements implementing the interaction of the features in , . As a result, the suspicious statements set of contains both the interaction implementation and the statements which have impact on this interaction in . In other words, the suspicious statements set of contains buggy statements . Hence, to guarantee the effectiveness in localizing variability bugs, we aim to detect the set of the candidates for Buggy PCs which satisfies the effectiveness requirement.

5.2 Important Properties to Detect Buggy PC

In practice, a system may have a huge number of possible configurations, . Consequently, only a subset of is sampled for testing and debugging, . A set of feature selections which has both Bug-Revelation and Minimality properties on the sampled set is intuitively suspicious to be a Buggy PC. Let us call these sets of selections Suspicious Partial Configurations (Suspicious PCs).

For example, in Fig. 1, is a Suspicious PC. All the configurations containing ( and ) are failing. Additionally, all the strict subsets of do not hold Bug-Revelation on , e.g., is in , and is in , which are passing configurations. Thus, is a minimal set which holds the Bug-Revelation property on .

Theoretically, the Suspicious PCs in can be detected by examining all of its subsets to identify the sets satisfying both Bug-Revelation and Minimality. The number of sets that need to be examined could be . However, not every selection in participates in Suspicious PCs. Hence, to detect Suspicious PCs efficiently, we aim to identify a set of the selections, , (Suspicious Feature Selections) of in which the selections potentially participate in one or more Suspicious PCs. Then, instead of inefficiently examining all the possible subsets of , the subsets of , which is a subset of , are inspected to identify Suspicious PCs.

Particularly, in failing configuration , there exist the selections such that switching their current states (from on to off, or vice versa) results in a passing configuration . In other words, the bugs in the product of are invisible in the product corresponding to , and the resulting product passes all its tests. Intuitively, each of these selections might be relevant to the visibility of the bugs. Each of these selections can be considered as a Suspicious Feature Selection. Thus, a selection, which is in a failing configuration yet not in a passing one, is suspicious to the visibility of the bugs.

Definition 5.2.

(Suspicious Feature Selection (SFS)). For a failing configuration , a feature selection is suspicious if is not present in at least one passing configuration, formally , specially, .

For example, the of contains the selections in the set differences of and the passing configurations, e.g., . Intuitively, the set difference must contain a part of every Buggy PC in , otherwise would contain a Buggy PC and fail some tests. Hence, for any failing configuration , we have the following property about the relation between the set of the Buggy PCs in and where is a passing configuration.

Property 5.1.

Given failing configuration whose set of the Buggy PCs is , the difference of from any passing configuration contains a part of every Buggy PC in . Formally, .

The intuition is that for a failing configuration and a passing one , a Buggy PC in can be either in their common part () or their difference part (). As is a passing configuration, the difference must contain a part of every Buggy PC in . Otherwise, there would exist a Buggy PC in both and , and should be a failing configuration (Bug-Revelation property). This is impossible.

For a configuration , we denote as the set of all the suspicious feature selections of , . To detect Buggy PCs in , we identify all subsets of which satisfy the Bug-Revelation and Minimality properties regarding to . The following property demonstrates that our method maintains the effectiveness requirement in detecting Buggy PCs (Section 5.1).

Property 5.2.

Given a failing configuration whose set of the Buggy PCs is , for any , there exists a subset of , , such that satisfies the Bug-Revelation condition in the sampled set and .

Proof.

Considering a Buggy PC of a failing configuration , , we denote by , where is a passing configuration (Property 5.1). Let consider . Note that, as , we have . In addition, since for every passing configuration , we have . Moreover, for every passing configuration , because (since ), does not contain any superset of , then . As a result, , which is a subset of both and , satisfies Bug-Revelation property.

As a result, given , there always exists a common subset of and , for any , and that set satisfies Bug-Revelation property on . Hence, according to the effectiveness requirement (Section 5.1), detecting Buggy PCs by examining the subsets of is effective to localize the variability bugs. Furthermore, as only contains the differences of and other passing configurations, . For example in Table I, SFSs of configuration is , so . Thus, should be used to detect Buggy PCs rather than .

Note that Suspicious PCs and Buggy PCs all hold the Bug-Revelation on the given sampled set of configurations, . However, because there are some (passing) configurations in but not in , it does not express that some selections must be in a Buggy PC. Hence, Buggy PCs might not hold Minimality property on . In Fig. 1, a Buggy PC is . However, does not satisfy Minimality on the set of the available configurations (Table I), because its subset, , satisfies Bug-Revelation on . The reason is that Table I does not show any product whose configuration contains both as well as passes or fails their tests. Therefore, is not expressed as a part of the Buggy PC.

5.3 Buggy PC Detection Algorithm

Algorithm 1 describes our algorithm to detect Buggy PCs and return the Suspicious PCs in a buggy system, given the sets of the passing and failing configurations, and .

1 Procedure DetectBuggyPCs(, )
2       for  do
3             for  do
4                  
5             end for
6            for  do
7                   for ( do
8                         if   then
9                              
10                         end if
11                        
12                   end for
13                  
14             end for
15            
16       end for
17      return
18
Algorithm 1 Buggy PC Detection Algorithm

In Algorithm 1, all the Suspicious PCs in the system are collected from the Suspicious PCs identified in each failing configuration (lines 3–17). From lines 4–8, the set of the suspicious selections in is computed. In order to do that, the differences of from all the passing configurations are gathered and stored in (lines 5–8).

Next, the Suspicious PCs in are the subsets of which have both Bug-Revelation and Minimality with respect to (lines 12–14). Each candidate, a set of feature selections , is checked against these properties by (line 12). In Algorithm 1, the examined subsets of have the maximum size of (lines 9–10). In other words, the considered interactions are up to -way. In practice, most of the bugs are caused by the interactions of fewer than 6 features [garvin2011feature, kuhn2004software]. Thus, one should set to ensure the efficiency. Specially, the function (line 10) returns all the subsets size of . Note that if a set is already a Suspicious PC, then any superset of it would not be a Suspicious PC (violating Minimality). Thus, all the supersets of the identified Suspicious PCs can be early eliminated (line 11).

In our example, from Table I, the detected Suspicious PCs are two sets and .

6 Suspicious Statements Identification

For a buggy SPL system, the incorrect statements can be found by examining the statements which implement interactions of Buggy PCs as well as the statements impacting that implementation, as discussed in Section 4.2. Thus, all the statements which implement the interactions of Suspicious PCs and the statements impacting them, are considered as suspicious to be buggy. In a product , for a Suspicious PC whose the sets of enabled and disabled features are and , respectively. Hence, in , the interaction implementation of includes the statements implementing the interaction of which can be impacted by the features in (if the disabled features were on in ).

In practice, the features in can be mutually exclusive with other features enabled in , which is constrained by the feature model [SPLBook]. Thus, the impact of on the implementation of the interaction of in , might not be easily identified via the control/data dependency in . In this work, the impacts of the features in on statements in are approximately identified by using def-use relationships of the variables and methods that are shared between the features in and  [ase19prioritization]. Formally, for a statement , we denote and to refer to the sets of variables/methods defined and used by , respectively.

Definition 6.1.

(Def-Use Impact). Given an SPL system and its sets of products , we define the def-use impact function as : , where refers to the set of the statements in which are impacted by any statement in the implementation of feature , , via the variables/methods shared between and . Formally, for a statement in product , if one of the following conditions is satisfied:

  • ,

  • s is data/control-dependent on s’, and

In summary, in a product , for a Suspicious PC whose the sets of enabled and disabled features are and , the suspicious statements satisfy the following conditions: (i) implementing the interaction of the features in and or impacting this implementation; (ii) executing the failed tests of . For a buggy system, the suspicious space contains all the suspicious statements detected for all Suspicious PCs in every failing product.

7 Suspicious Statements Ranking

To rank the isolated suspicious statements of a buggy system, VarCop assigns a score to each of these statements based on the program spectra of the sampled products. In VarCop, the suspiciousness of each statement is assessed based on two criteria/dimensions: Product-based Assessment and Test Case-based Assessment.

7.1 Product-based Suspiciousness Assessment

This criterion is based on the overall test results of the products containing the statement. Specially, in a buggy system, a suspicious statement could be executed in not only the failing products but also the passing products. Hence, from the product-based perspective, the (dis)appearances of in the failing and passing products could be taken into account to assess the statement’s suspiciousness in the whole system. In general, the product-based suspiciousness assessment for could be derived based on the numbers of the passing and failing products where is contained or not. Intuitively, the more failing products and the fewer passing products where is contained, the more suspicious tend to be. This is similar to the idea of SBFL when considering each product as a test. Thus, in this work, we adopt SBFL metrics to determine the product-based suspiciousness assessment for . Specially, for a particular SBFL metric , the value of this assessment is determined by which adopts the formula of with the numbers of the passing and failing products containing and not containing as the numbers of passed and failed tests executed or not executed by .

7.2 Test Case-based Suspiciousness Assessment

The test case-based suspiciousness of a statement is evaluated based on the detailed results of the tests executed by . Particularly, in each failing product containing , the statement is locally assessed based on the program spectra of the product. Then, the local scores of in the failing products are aggregated to form a single value which reflects the test case-based suspiciousness of in the whole system.

Particularly, the local test case-based suspiciousness of statement can be calculated by using the existing FL techniques such as SBFL [pearson2017evaluating, keller2017critical, naish2011model, abreu2009spectrum, abreu2007accuracy]. In this work, we use a ranking metric of SBFL, which is the state-of-the-art FL technique, to measure the local test case-based suspiciousness of in a failing product. Next, for a metric , the aggregated test case-based suspiciousness of , , can be calculated based on the local scores of in all the failing products containing it. In general, we can use any aggregation formula [bian1999comparing], such as arithmetic mean, geometric mean, maximum, minimum, and median to aggregate the local scores of .

However, the local test case-based scores of a statement, which are measured in different products, should not be directly aggregated. The reason is that the scores of the statement in different products might be incomparable. Indeed, with some ranking metrics such as Op2 [ding2013fault] or Dstar [wong2013dstar], once the numbers of tests of the products are different, the local scores of the statements in these products might be in significantly different ranges. Intuitively, if these local scores are directly aggregated, the products which have larger ranges will have more influence on the suspiciousness score of the statement in the whole system. Meanwhile, such larger-score-range products are not necessarily more important in measuring the overall test case-based suspiciousness of the statement. Directly aggregating these incomparable local scores of the statement can result in an inaccurate suspiciousness assessment. Thus, to avoid this problem, these local scores in each product should be normalized into a common scale, e.g., , before being aggregated. We will show the impact of the normalization as well as choosing the aggregation function and ranking metric on VarCop’s performance in Section 9.2.

7.3 Assessment Combination

Finally, the two assessment scores and of statement are combined with a combination weight, to form a single suspiciousness score of the statement:

Note that, to avoid the bias caused by the range-difference between the two criteria, these two scores should be normalized into a common range, e.g.,

before the interpolation. In the ranking process, the isolated suspicious statements are ranked according to their interpolated suspiciousness score

. The impact of the combination weight on VarCop’s fault localization performance will be empirically shown in Section 9.2.

8 Empirical Methodology

To evaluate our variability fault localization approach, we seek to answer the following research questions:

RQ1: Accuracy and Comparison. How accurate is VarCop in localizing variability bugs? And how is it compared to the state-of-the-art approaches [arrieta2018spectrum, keller2017critical, wong2016survey]?

RQ2: Intrinsic Analysis. How do the components including the suspicious statement isolation, the normalization, the suspiciousness aggregation function, and the combination weight contribute to VarCop’s performance?

RQ3: Sensitivity Analysis. How do various factors affect VarCop’s performance including the size of sampled product set and the size of test suite in each sampled product?

RQ4: Performance in Localizing Multiple Bugs. How does VarCop perform on localizing multiple variability bugs?

RQ5: Time Complexity. What is VarCop’s running time?

8.1 Dataset

To evaluate VarCop, we conducted several experiments on a large public dataset of variability bugs [ourdataset]. This dataset includes 1,570 buggy versions with their corresponding tests of six Java SPL systems which are widely used in SPL studies. There are 338 cases of a single-bug, and 1,232 cases of multiple-bug. The details are shown in Table II.

In the benchmark [ourdataset], to generate a large number of variability bugs, the bug generation process includes three main steps: Product Sampling and Test Generating, Bug Seeding, and Variability Bug Verifying. First, for an SPL system, a set of products is systematically sampled by the existing techniques [sampling_comparision]. To inject a fault into the system, a random modification is applied to the system’s original source code by using a mutation operator. Finally, each generated bug is verified against the condition in Definition 3.2 to ensure that the fault is a variability bug and caught by the tests. The detailed design decisions can be found in [ourdataset].

To the best of our knowledge, this is the only public dataset containing the versions of the SPL systems that failed by variability bugs found through testing [ourdataset]. Indeed, Arrieta et al. [arrieta2018spectrum] also constructed a set of artificial bugs to evaluate their approach in localizing bugs in SPL systems. However, this dataset has not been public. Besides, Abal et al. [98bugs] and Mordahl et al. [mordahl2019empirical] also published their datasets of real-world variability bugs. However, these datasets are not fit our evaluation well because all of these bugs are not provided along with corresponding test suites, and in fact, most of these bugs are compile-time bugs. Specially, these variability bugs are collected by (manually) investigating bug reports [98bugs] or applying static analysis tools [mordahl2019empirical].

Before running VarCop on the dataset proposed by Ngo et al. [ourdataset]

, we performed a simple inspection for each case on whether the failures of the system are possibly caused by non-variability bugs. Naturally, there are bugs which may be classified as “variability" because of the low-quality test suites which cannot reveal the bugs in some buggy products. We found that there are 53/1,570 cases (19 single-bug cases and 34 multiple-bug cases) where among the sampled products in each case, the product containing only the base feature and disabling all of the optional features fails several tests. These cases possibly contain non-variability bugs. We will discuss them in Section

9.2.

System Details Test info Bug info
#LOC #F #P Cov #SB #MB
ZipMe 3460 13 25 42.9 55 249
GPL 1944 27 99 99.4 105 267
Elevator-FH-JML 854 6 18 92.9 20 102
ExamDB 513 8 8 99.5 49 214
Email-FH-JML 439 9 27 97.7 36 90
BankAccountTP 143 8 34 99.9 73 310
  • #F and #P: Numbers of features and sampled products.

  • Cov: Statement coverage (%).

  • #SB and #MB: Numbers of single-bug and multiple-bug cases.

TABLE II: Dataset Statistics

8.2 Evaluation Setup, Procedure, and Metrics

8.2.1 Empirical Procedure

Comparative Study. For each buggy version, we compare the performance in ranking buggy statements of VarCop, Arrieta et al. [arrieta2018spectrum], SBFL [pearson2017evaluating, keller2017critical, naish2011model, abreu2009spectrum, abreu2007accuracy], and the combination of slicing method and SBFL (S-SBFL[chaleshtari2020smbfl, li2020more] accordingly. For SBFL, each SPL system is considered as a non-configurable code. SBFL ranks all the statements executed by failed tests. For S-SBFL, to improve SBFL, S-SBFL isolates all the executed failure-related statements in every failing product by slicing techniques [static_slicing] before ranking. A failure-related statement is a statement included in at least a program slice which is backward sliced from a failure point in a failing product. In this experiment, we used 30 most popular SBFL metrics [keller2017critical, naish2011model, pearson2017evaluating]. For each metric, , we compared the performance of all 4 techniques using .

Intrinsic Analysis. We studied the impacts of the following components: Suspicious Statement Isolation, Ranking Metric, Normalization, Aggregation Function, and Combination Weight. We created different variants of VarCop with different combinations and measured their performance.

Sensitivity Analysis. We studied the impacts of the following factors on the performance of VarCop: Sample size and Test set size. To systematically vary these factors, the sample size is varied based on -wise coverage [kwise2012] and the numbers of tests are gradually added.

8.2.2 Metrics

We adopted Rank, EXAM [wong2008crosstab], and Hit@X [lo2014fusion] which are widely used in evaluating FL techniques [keller2017critical, pearson2017evaluating, arrieta2018spectrum]. We additionally applied Proportion of Bugs Localized (PBL) [keller2017critical] for the cases of multiple variability bugs.

Rank. The lower rank of buggy statements, the more effective approach. If there are multiple statements having the same score, buggy statements are ranked last among them. Moreover, for the cases of multiple bugs, we measured Ranks of the first buggy statement (best rank) in the lists.

EXAM. EXAM [wong2008crosstab] is the proportion of the statements being examined until the first faulty statement is reached:

where is the position of the buggy statement in the ranked list and is the total number of statements in the list. The lower EXAM, the better FL technique.

Hit@X. Hit@X [lo2014fusion] counts the number of bugs which can be found after investing X ranked statements, e.g., Hit@1 counts the number of buggy statements correctly ranked among the experimental cases. In practice, developers only investigate a small number of ranked statements before giving up[parnin2011automated]. Thus, we focus on .

Proportion of Bugs Localized (PBL). PBL [keller2017critical] is the proportion of the bugs detected after examining a certain number of the statements. The higher PBL, the better approach.

9 Empirical Results

9.1 Performance Comparison (RQ1)

Table III shows the average performance of VarCop, SBFL, the combination of slicing method and SBFL (S-SBFL), and the feature-based approach proposed by Arrieta et al. [arrieta2018spectrum] (FB) on 338 buggy versions containing a single bug each [ourdataset] in Rank and EXAM. The detailed ranking results of each case can be found on our website [website].

# Ranking Metric Rank EXAM
VarCop S-SBFL SBFL FB VarCop S-SBFL SBFL FB
[HTML]C0C0C0 1 Barinel 7.83 9.88 11.48 136.27 2.11 2.87 3.15 21.79
[HTML]C0C0C0 2 Dstar 6.16 7.20 8.09 108.78 1.77 1.94 2.02 15.88
[HTML]C0C0C0 3 Ochiai 6.19 7.25 8.14 109.91 1.77 1.95 2.03 16.10
[HTML]C0C0C0 4 Op2 5.86 6.07 6.74 106.99 1.71 1.75 1.80 15.36
[HTML]C0C0C0 5 Tarantula 6.96 9.88 11.48 136.27 1.98 2.87 3.15 21.79
6 Kulczynski2 5.61 6.36 7.08 108.23 1.67 1.77 1.83 15.59
7 M2 5.94 6.13 6.82 108.43 1.71 1.76 1.81 15.77
8 Harmonic Mean 5.95 6.52 7.28 149.70 1.72 1.80 1.86 21.37
9 Zoltar 6.00 6.12 6.78 107.57 1.68 1.75 1.80 15.45
10 Geometric Mean 6.05 7.37 8.29 149.70 1.76 1.99 2.09 21.37
11 Ample2 6.15 6.16 6.86 149.58 1.75 1.77 1.82 21.30
12 Rogot2 6.22 6.52 7.28 133.66 1.80 1.80 1.86 22.24
13 Sorensen Dice 6.50 8.79 10.17 115.72 1.84 2.41 2.62 17.29
14 Goodman 6.50 8.79 10.17 115.72 1.84 2.41 2.62 17.29
15 Jaccard 6.63 8.79 10.17 115.72 1.83 2.41 2.62 17.29
16 Dice 6.63 8.79 10.17 115.72 1.83 2.41 2.62 17.29
17 Anderberg 6.68 8.79 10.17 115.72 1.84 2.41 2.62 17.29
18 Cohen 6.81 8.93 10.33 152.04 1.87 2.47 2.70 21.61
19 Fleiss 6.82 12.24 52.03 145.70 2.09 3.51 9.13 21.65
20 Simple Matching 6.88 28.00 242.70 158.19 2.11 6.67 30.68 21.96
21 Humman 6.88 28.00 242.70 158.19 2.11 6.67 30.68 21.96
22 Wong2 6.88 28.00 242.70 158.19 2.11 6.67 30.68 21.96
23 Hamming 6.88 28.00 242.70 158.19 2.11 6.67 30.68 21.96
24 Sokal 6.91 28.00 242.70 158.19 2.15 6.67 30.68 21.96
25 Euclid 6.96 28.00 242.70 158.19 2.17 6.67 30.68 21.96
26 Rogers Tanimoto 7.05 28.00 242.70 158.19 2.20 6.67 30.68 21.96
27 Scott 7.38 13.22 50.86 147.65 2.17 3.76 8.79 22.23
28 Rogot1 7.38 13.22 50.86 147.65 2.17 3.76 8.79 22.23
29 Russell Rao 14.06 17.87 24.00 309.62 3.58 5.05 6.39 39.27
30 Wong1 14.06 17.87 24.00 309.62 3.58 5.05 6.39 39.27
TABLE III: Performance of VarCop, SBFL, the combination of Slicing and SBFL (S-SBFL), and Arrieta et al. [arrieta2018spectrum] (FB)

Compare to SBFL and S-SBFL. For both Rank and EXAM, VarCop outperformed S-SBFL and SBFL in all the studied metrics. On average, VarCop achieved 33% better in Rank compared to S-SBFL and nearly 50% compared to SBFL. This means, to find a variability bug, by using VarCop, developers have to investigate only 5 statements instead of about 8 and up to 10 suspicious statements by S-SBFL and SBFL. In EXAM, the improvements of VarCop compared to both S-SBFL and SBFL are also significant, 30% and 43%, respectively. For developer, the proportion of statements they have to examine is reduced by about one third and one haft by using VarCop compared to S-SBFL and SBFL. For the 5 most popular metrics [pearson2017evaluating], [wen2019historical] (in shade) including Tarantula [taratula], Ochiai [ochiai1957zoogeographic], Op2 [ding2013fault], Dstar [wong2013dstar], and Barinel [abreu2009spectrum], VarCop achieved the improvement of more than 15%. Especially, for certain metrics such as Simple Matching [meyer2004comparison], the improvements by VarCop are remarkable, 4 times and 35 times compared to S-SBFL and SBFL.

1public  boolean consistent(){
2    for (int i = 0; i < students.length; i++) {
3        if (students[++i] != null &&                     !students[i].backedOut &&                       students[i].points < 0) {
4            //Patch: students[i] != null
5            return false;
6        }
7    }
8    return true;
9}
Fig. 3: A variability bug in system ExamDB

There are two reasons for these improvements. Firstly, the set of the suspicious statements isolated by VarCop is much smaller than other approaches’. VarCop identifies the suspicious statements by analyzing the root causes of failures. The suspicious space by VarCop is only about 70% of the space of S-SBFL and 10% of the space of SBFL. The average suspicious space isolated by VarCop is only 66 statements, while the isolated set by S-SBFL contains 87 statements, and this suspicious set identified by SBFL is even much larger, 660 statements. Secondly, the suspiciousness of statements computed by VarCop is not biased by the tests in any specific product. Unlike SBFL, in VarCop, for a suspicious statement , the appearances of in both passing and failing products, as well as the local test case-based suspiciousness scores of in all the failing products containing it are aggregated appropriately. This suspiciousness measurement approach helps VarCop overcome the weakness of SBFL in computing suspiciousness for the statements of systems.

Fig. 3 shows a variability bug (ID_298) in feature BackOut of ExamDB. In this code, each member in students must be visited. However, the students with the even index are incorrectly ignored because i is increased twice after each iteration (line 2 and line 3). This bug is revealed only when both ExamDB and BackOut are enabled. For QA, 8 products are sampled for testing. There are 4 failing products and total 168 statements executed during running failed tests. By using Tarantula [taratula], VarCop ranked the buggy statement (line 3) first, while SBFL ranked it . Indeed by locally ranking the buggy statement in each product, is ranked in 3 out of 4 failing products. Meanwhile, this statement is ranked at the position in the ranked list of the other failing product (). In , there are 10 correct statements which are executed by failed tests only, yet not executed by any passed test. By using Tarantula, SBFL assigned these statements the highest suspiciousness scores in . Thus, those statements have higher scores than the buggy statement which is executed during running both the failed and passed tests. In the whole system, SBFL uses the test results of all the sampled products to measure the suspiciousness for all the 168 statements. Consequently, it misleadingly assigned higher scores for all of the 10 statements which are executed by only the failed tests in , yet not executed by any tests in the others. Consequently, the ranking result by SBFL is considerably driven by the test results of and mislocates the buggy statement. Meanwhile, VarCop ignored 101/168 failure-unrelated statements and measured the suspiciousness of only 67 statements by analyzing the root cause of the failures. Additionally, in VarCop, the test case-based suspiciousness of these statements are aggregated from the suspiciousness values which are measured in the failing products independently. Thus, the low-quality test suite of cannot significantly affect the suspiciousness measurement, and the buggy statement is still ranked thanks to the test suites of other products.

Fig. 4: Hit@1–Hit@5 of VarCop, S-SBFL and SBFL

Furthermore, VarCop also surpasses S-SBFL and SBFL in Hit@X. In Fig. 4, after investigating statements, for , there are more bugs found by using VarCop compared to S-SBFL and SBFL. On average, in 78% of the cases, VarCop correctly ranked the buggy statements at the top-5 positions, while S-SBFL and SBFL ranked them at the top-5 positions in only 70% and 61% of the cases, respectively. Moreover, in about two-thirds of the cases (+65%), the bug can be found by examining only first 3 statements in the lists of VarCop. Meanwhile, to cover the same proportion of the cases by using S-SBFL and SBFL, developers have to investigate up to 4 and 5 statements. Especially, for Hit@1, the number of bugs are found by VarCop after investigating the first ranked statements is about 101 bugs (30%). This means, in one-third of the cases, developers just need to examine the first statements in the ranked lists to find bugs by using VarCop.

Compare to Arrieta et al.[arrieta2018spectrum]. As illustrated in Table III, in all the studied metrics, VarCop outperformed Arrieta et al. [arrieta2018spectrum] 21 times in Rank and 11 times in EXAM. Instead of ranking statements, this approach localizes the variability bugs at the feature-level. Consequently, all the statements in the same feature are assigned to the same score. In Fig. 3, the buggy statement at line 3 is assigned the same suspiciousness level with 22 correct statements. Thus, even the feature containing the fault, BackOut, is ranked first, the buggy statement is still ranked in the statement-level fault localizing. Unfortunately, BackOut is actually ranked , then the buggy statement is ranked . This could lead to the ineffectiveness of the feature-based approach proposed by Arrieta et al.[arrieta2018spectrum] in localizing variability bugs in the statement-level.

Overall, our results show that VarCop significantly outperformed the state-of-the-art approaches, S-SBFL, SBFL, and Arrieta et al. [arrieta2018spectrum], in all 30/30 SBFL ranking metrics.

9.1.1 Performance by Bug Types

We further analyzed VarCop’s performance on localizing bugs in different types based on mutation operators [ma2006mujava] and kinds of code elements [sobreira2018dissection].

In Table IV, VarCop performs most effectively on the bugs created by Conditional Operators which are ranked between and on average. The reason is that these bugs are easier to be detected (killed) by the original tests than other kinds of mutants [smith2009guiding]. This means, if the bugs in this kind can cause the failures in some products, the bugs will be easier to be revealed by the products’ tests. Moreover, the correct states of either passing or failing of products affect the performance of FL techniques. As a result, this kind of bug is more effectively localized by VarCop. Meanwhile, VarCop did not localize well the bugs created by Arithmetic Operators, as they are more challenging to be detected by the original tests set of the products [smith2009guiding]. Indeed, because of the ineffectiveness of the test suites in several products, even they contain the bug(s), their test suites cannot detect the bugs, and the products still pass all their tests. In these cases, the performance would be negatively affected.

Group Mutation Operator #Bugs Rank EXAM
Conditional COR, COI, COD 32 1.63 0.61
Assignment ASRS 7 2.14 0.89
Logical LOI 17 2.47 2.21
Deletion CDL, ODL 18 3.56 1.09
Relational ROR 52 5.13 1.29
Arithmetic
AODU, AOIU, AORB,
AOIS, AORS, AODS
212 7.63 2.01
TABLE IV: Performance by Mutation Operators

Table V shows VarCop’s performance in different kinds of bugs categorized based on code elements [sobreira2018dissection]. As seen, VarCop works quite stably in these kinds of bugs. Particularly, the average Rank achieved by VarCop for bugs in different code elements is between and

, with the standard deviation is only 1.3. In addition, the average

and the standard deviation are 1.53 and 0.73, respectively.

Code Element #Bugs Rank EXAM
Method Call 22 4.23 0.37
Conditional 148 5.20 1.86
Loop 17 6.41 2.27
Assignment 108 7.18 1.82
Return 43 7.21 1.33
TABLE V: Performance by Code Elements of Bugs

9.1.2 Performance by the number of involving features

We also analyzed the performance of VarCop by the number of the features which involve in the visibility of the bugs [ourdataset]. In our experiment, the number of involving features is in the range of . In +76% of the cases, the number of involving features is fewer than or equal to 7. VarCop’s performance in Rank by the numbers of involving features are not significantly different, from 4.45 to 9.69, Fig. 5. In fact, both the number of detected Suspicious PCs and the size of each Suspicious PC, which determine the size of isolated suspicious space, are affected by the number of involving features. Specially, for the bugs with a smaller number of involving features, the detected Suspicious PCs are likely fewer, but each of them is likely smaller. With a larger number of involving features, the detected Suspicious PCs tend to be more, yet each of these Suspicious PCs is likely larger. For a bug, the isolated suspicious statements space is in direct proportion to the number of detected Suspicious PCs, but it is in inverse proportion to the size of each Suspicious PC. Thus, the number of involving features would not linearly affect the number of isolated suspicious statements and VarCop’s performance.

Fig. 5: Performance by number of involving features of bugs

9.2 Intrinsic Analysis (RQ2)

9.2.1 Impact of Suspicious Statements Isolation on Performance

To study the impact of Suspicious Statements Isolation (Fig. 2), which includes Buggy PC Detection and Suspicious Statements Identification components, on VarCop’s performance, we built the variant of VarCop where these two components are disabled. For a buggy system, this variant of VarCop ranks all the statements which are executed during running failed tests in the failing products. Fig. 6 shows the performance of VarCop using 5 most popular SBFL metrics [pearson2017evaluating], [wen2019historical] when Buggy PC Detection and Suspicious Statements Identification are enabled/disabled. As expected, when enabling these components, the performance of VarCop is significantly better, about 16% in Rank.

Fig. 6: Impact of Buggy PC Detection on performance

Interestingly, even when disabling Suspicious Statements Isolation, this variant of VarCop is still better than S-SBFL and SBFL. Specially, VarCop obtained a better Rank than S-SBFL and SBFL in 21/30 and 27/30 metrics, respectively. In these ranking metrics, the average improvements of VarCop compared to S-SBFL and SBFL are 34% and 45%. Meanwhile, for the remaining metrics, the performances of S-SBFL and SBFL are better than VarCop by only 10% and 3%. For the bug in Fig. 3, thanks to the proposed suspiciousness measurement approach, this variant of VarCop still ranked the buggy statement () which is much better than the and positions by S-SBFL and SBFL.

Note that, for 19/338 cases which possibly contain non-variability bugs (mentioned in Section 8.1), there might be no Buggy PC in these buggy systems to be detected. Moreover, the low-quality test suites in some passing (yet buggy) products might “fool" fault localization techniques [baudry2006improving]. These passing products might also make VarCop less effective in isolating suspicious statements. Hence, for these cases, we turn off VarCop’s suspicious statements isolation component to guarantee its effectiveness.

9.2.2 Impact of Ranking Metric on Performance

We also studied the impact of the selection of the local ranking metric on VarCop’s performance. To do that, we built different variants of VarCop with different metrics. In Table III (the and columns), the performance of VarCop is quite stable with the different ranking metrics. Particularly, for all the studied metrics, the average EXAM achieved by VarCop is in a narrow range, from 1.71–3.58, with the standard deviation of 0.46. Additionally, the average Rank of the buggy statements assigned by VarCop is varied from . This stability of VarCop is obtained due to the suspicious statements isolation and suspiciousness measurement components. Indeed, VarCop only considers the statements that are related to the interactions which are the root causes of the failures. Moreover, VarCop is not biased by the test suites of any specific products. Consequently, its performance is less affected by the low-quality test suites of any product. Thus, selecting an inappropriate ranking metric, which is unknown beforehand in practice, does not significantly affect VarCop’s performance. This demonstrates that VarCop is practical in localizing variability bugs.

In contrast, the performances of S-SBFL and SBFL techniques are considerably impacted by choosing the ranking metrics. By S-SBFL method, the average Rank of the buggy statements widely fluctuates from to . By SBFL, the fluctuation of Rank is even much more considerable, from to . Consequently, the QA process would be extremely inefficient if developers using the SBFL technique with an inappropriate ranking metric.

9.2.3 Impact of Normalization on Performance

To study the impact of the normalization, we built the variants of VarCop which enable and disable the normalization component. In this experiment, in both cases, the local test case-based scores are accordingly measured by 30 popular SBFL metrics and are aggregated by arithmetic mean.

Fig. 7: Impact of Normalization on performance

In Fig. 7, when enabling the normalization, VarCop’s performance is better than when normalization is off. Particularly, the performance of VarCop is improved 64% in Rank and 32% in EXAM when the normalization is enabled. One reason is that for some SBFL metrics, such as Fleiss [fleiss1965estimating] and Humman [lourencco2004binary], the ranges of the product-based and test case-based suspiciousness values are significantly different. Additionally, for these metrics, the ranges of the local test case-based suspiciousness scores in different products are also significantly different. For example, there is a bug (ID_25) in the system Email, with Fleiss, the range of suspiciousness scores in product is , while the range in another product, is much different, . Without normalization, a statement in is more likely to be assigned a higher final score than one in . Meanwhile, with several metrics such as Ochiai [ochiai1957zoogeographic] and Tarantula [taratula], the performance of VarCop is slightly different when normalization is on/off. For these metrics, the local scores of the statements in the products are originally assigned in quite similar ranges. Thus, these local scores might not need to be additionally normalized. Overall, to ensure that the best performance of VarCop, the normalization should be on.

9.2.4 Impact of Aggregation Function on Performance

To study the impact of choosing aggregation function on performance, we varied the aggregation function of the test case-based suspiciousness assessment. In this experiment, Op2 [ding2013fault] was randomly chosen to measure the local scores of the statements. As seen in Fig. 8, the performance of VarCop is not significantly affected when the aggregation function is changed. Specially, the average Rank of VarCop is around , while the EXAM is about 1.76.

Fig. 8: Impact of choosing on performance

9.2.5 Impact of Combination Weight on Performance

Fig. 9: Impact of choosing combination weight on performance

We varied the combination weight (Section 7) when combining the product-based and test case-based suspiciousness assessments to form the final score of statements. Fig. 9 shows the average Rank and EXAM of the faults in 36 buggy versions of Email, with .

As seen, the performance of VarCop is better when both the product-based and test case-based suspiciousness scores are combined to measure the suspiciousness of the statements. For , the statements are ranked by only their product-based suspiciousness. All the statements in the same feature will have the same suspiciousness score because they appear in the same number of passing and failing products. For , a statement is ranked by only the score which is aggregated from the local scores of the statement in the failing products. Consequently, the overall performance may be affected by the low-quality test suites of some products. For instance, the correct statement appears in only one failing product . However, in , is misleadingly assigned the highest score. As a result, when , also has the highest score in the whole system, since this score is aggregated from , the only failing product containing . Hence, both of the product-based and test case-based suspiciousness assessments are necessary for measuring the suspiciousness of statements (Mentioned in O3).

9.3 Sensitivity Analysis (RQ3)

9.3.1 Impact of Sample Size on Performance

In this experiment, for each buggy version of GPL which is randomly selected system, we used -wise coverage, for , to systematically vary sample size. Then, we ran VarCop on each case with each set of sampled products.

Fig. 10 shows the average Rank and EXAM of VarCop in the buggy versions of GPL with different sample sets. As expected, the larger sample, the higher performance in localizing bugs obtained by VarCop. However, when the ranking results reach a specific point, even more products are tested, the results are just slightly better. Specially, for the set of -wise coverage (One-disabled [42bugs]), the average Rank and EXAM are about 4.31 and 0.45, respectively. Meanwhile, for -wise, the ranking results are 1.5 times better. The reason is that, for a case, the more products are tested, the more information VarCop has to detect Buggy PCs and rank suspicious statements. However, compared to 3-wise and 4-wise, even much more products are sampled, which is much more costly in sampling and testing, the performance is just slightly improved. Hence, with VarCop, one might not need to use a very large sample to achieve a relatively good performance in localizing variability bugs.

Fig. 10: Impact of the sample size on performance

9.3.2 Impact of Test Suite’s Size on Performance

In this experiment, for every buggy version, we gradually increased the size of the test suite in each product to study the impact of tests on VarCop’s performance. The randomly selected system in this experiment is ExamDB.

In Fig. 11, VarCop’s performance is improved when increasing the test suite size. Particularly, when the number of tests increased from 13 to 90 tests/product, both Rank and EXAM of VarCop are improved by about twice. After that, even more tests are added, VarCop’s performance is just slightly affected. The reason is, increasing the number of tests provides more information to distinct the correct and incorrect statements, thus improves VarCop’s performance. However, when the test suites reach a specific effectiveness degree in detecting bugs, added tests would not provide supplementary information for the FL process.

Fig. 11: Impact of the size of test set on performance

Overall, one should trade off between the fault localization effectiveness and the cost of generating tests and running them. Furthermore, as discussed in Section 9.2.1, instead of focusing on expanding test suites for products, developers should improve the effectiveness of the test suites in detecting bugs. This can be done by improving test coverage and applying mutation testing [mutation_testing].

9.4 Performance in Localizing Multiple Bugs (RQ4)

To evaluate VarCop on buggy systems that contain multiple variability bugs, we conducted an experiment on 1,232 buggy versions of the subject systems with 2,947 variability bugs in total. The components of VarCop were randomly configured: the ranking metric is Op2 [ding2013fault], and the aggregation function is arithmetic mean. For other settings of VarCop, one can find on our website [website].

Fig. 12: VarCop, S-SBFL and SBFL in localizing multiple bugs

Fig. 12 shows that the average percentage of buggy statements in each case found (PBL) by VarCop far surpasses the corresponding figures of S-SBFL and SBFL when the same number of statements are examined in their ranked lists. Specially, after examining the first statement, VarCop can find nearly 10% of the incorrect statements in a buggy system. Meanwhile, only 5% and 1% of the bugs are found by S-SBFL and SBFL, respectively, after inspecting the first statement. Furthermore, about 35% of the bugs can be found by VarCop by checking only first 5 statements in the ranked lists. Meanwhile, with S-SBFL and SBFL, developers have to investigate up to 7 and even 10 statements to achieve the same performance. In addition, the average best Rank that VarCop assigned for the buggy statements is about . Meanwhile, the corresponding figures of S-SBFL, SBFL, and Arrieta et al. [arrieta2018spectrum] are , , and