DeepAI
Log In Sign Up

Faster Multi-Goal Simulation-Based Testing Using DoLesS (Domination with Least Square Approximation)

For cyber-physical systems, finding a set of test cases with the least cost by exploring multiple goals is a complex task. For example, Arrieta et al. reported that state-of-the-art optimizers struggle to find minimal test suites for this task. To better manage this task, we propose DoLesS (Domination with Least Squares Approximation) which uses a domination predicate to sort the space of possible goals to a small number of representative examples. Multi-objective domination then divides these examples into a "best" set and the remaining "rest" set. After that, DoLesS applies an inverted least squares approximation approach to learn a minimal set of tests that can distinguish best from rest in the reduced example space. DoLesS has been tested on four cyber-physical models: a tank flow model; a model of electric car windows; a safety feature of an AC engine; and a continuous PID controller combined with a discrete state machine. Comparing to the recent state-of-the-art paper attempted the same task, DoLesS performs as well or even better as state-of-the-art, while running 80-360 times faster on average (seconds instead of hours). Hence, we recommend DoLesSas a fast method to find minimal test suites for multi-goal cyber-physical systems. For replication purposes, all our code is on-line:https://github.com/hellonull123/Test_Selection_2021.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/01/2020

Quantifying the Latency and Possible Throughput of External Interrupts on Cyber-Physical Systems

An important characteristic of cyber-physical systems is their capabilit...
01/14/2021

Multi-Fidelity Digital Twins: a Means for Better Cyber-Physical Systems Testing?

Cyber-Physical Systems (CPSs) combine software and physical components. ...
02/12/2019

Time-aware Test Case Execution Scheduling for Cyber-Physical Systems

Testing cyber-physical systems involves the execution of test cases on t...
02/23/2021

Data Driven Testing of Cyber Physical Systems

Consumer grade cyber-physical systems (CPS) are becoming an integral par...
11/21/2022

Cost-effective Simulation-based Test Selection in Self-driving Cars Software

Simulation environments are essential for the continuous development of ...
05/14/2019

Faster Creation of Smaller Test Suites (with SNAP)

State-of-the-art theorem provers, combined with smart sampling heuristic...

1. Introduction

Simulation models play an important role in many domains. Engineers build such models to simulate complex systems (Matinnejad et al., 2016). In the case of cyber-physical systems, these models are sometimes shipped along with the actual device, which means that analysts can now access high-fidelity simulations of their systems. Hence, much of the work on cyber-physical testing focuses on taking full advantage of high-fidelity simulators, prior to live testing (Arrieta et al., 2019a). For example, analysts can use the simulators for test suite minimization; i.e. they can explore many tests in the simulator in order to remove tests that do not need to be explored in the real world.

However, using these models for test case minimization can be a very difficult process (Arrieta et al., 2019a). These simulation models are built to simulate complex systems such as electronic and physical models. Hence executing these simulation models can be very time consuming (Arrieta et al., 2016). This problem gets even worse for multi-goal problems (e.g. minimizing runtime and maximizing the number of bugs found) since it is necessary to run the models multiple times for different subsets of the goals (Arrieta et al., 2016). For example, Arrieta et al. reported that testing a high fidelity simulation model can take hours to days (Arrieta et al., 2019a; Sagardui et al., 2017; González et al., 2018). Further, they warned that state-of-the-art multi-goal optimizers (e.g. NSGA-III and MOEA/D) struggle to find minimal test suites for this task.

Recently, Chen (Chen et al., 2018) and Agrawal et al. (Agrawal et al., 2020) reported successes with a variant of optimization called “DUO” (data mining using/used-by optimizers). In this approach, a data mining method firstly divides the problem space, and then an optimizer executes in each small division. Inspired by that DUO approach, in this work, we apply a sorting method on the objective space to divide the multi-objective test suite minimization problem into several smaller partitions. Our DoLesS algorithm (Domination with Least Squares Approximation) applies a domination predicate to sort the example space to a small number of representative data points. Multi-objective domination divides these data points into a “best” set and the remaining “rest” set. After all that, DoLesS applies an inverted least squares approach to learn a minimal set of tests that can distinguish the best from the rest in the reduced example space.

To evaluate our proposed test case selection approach for simulation models, we compare DoLesS with the most recent state-of-the-art approach produced by Arrieta et al. (Arrieta et al., 2019a). In that companion, we ask the following research questions.

RQ1: Can we verify that test case selection for multi-goal cyber-physical systems is a hard problem? Arrieta et al. (Arrieta et al., 2019a) reported that standard multi-goal optimizers such as NSGA-III and MOEA/D failed in this task. This is an important observation since, if otherwise, there will be no clear motivation for this paper. Accordingly, as a first step, we replicate their results in our experiment.

RQ2: Can DoLesS find better ways to select test cases which result in test suites with higher effectiveness measure (objective) scores? Section 3.1 of this paper reviews five effectiveness measurement metrics which are used to select test cases for cyber-physical systems in the previous study (Arrieta et al., 2019a). Our results show that on those five metrics, general performances of DoLesS is better than previous state-of-the-art method.

RQ3: Do selected test cases by DoLesS beat the prior state-of-the-art? Apart from the five effectiveness measurement metrics used by Arrieta et al. (Arrieta et al., 2019a), two other evaluation scores of interest are (a) reduction in the number of test case and (b) faults detection performance with the reduced test cases. As shown in our result section §5, DoLesS usually performs as well, if not better, than the prior state-of-the-art.

RQ4: Is DoLesS far more efficient than prior state-of-the-art in terms of running time? For all the reasons stated above, we need methods that offer faster feedback from models of cyber-physical systems. In this regard, it is significant to note that DoLesS runs 80-360 times faster than the prior state-of-the-art.

Based on the above, we say our novel contributions are:

  1. [leftmargin=6mm]

  2. We propose a novel test generation method (DoLesS).

  3. We verify that DoLesS solves a hard problem (test case selection for multi-goal cyber-physical systems). This is a problem that defeats state-of-the-art optimizers (NSGA-III and MOEA/D).

  4. We clearly document the value of doing DoLesS. When testing on four cyber-physical models, DoLesS finds test suites as good, as even better, than those found by Arrieta et al.’s approach (Arrieta et al., 2019a). Further, DoLesS does so while running 80-360 times faster (seconds instead of hours, mean time). Hence, we recommend DoLesS as a fast method to find minimal test cases for multi-goal cyber-physical systems.

The rest of this paper is structured as follow. Section 2 introduces the background and related work in test case selection for simulation-based testing. Section 3 introduces the problem of studying effectiveness measurement metrics in cyber-physical systems and illustrates how they are calculated by mathematical formula. Moreover, multi-objective optimizers and our proposed approach are introduced in this section as well. Section 4

introduces the case studies, performance evaluation metrics, and statistical analysis method used in this study. Section 

5 shows our experimental results. Section 6 explores threats to validity and Section 7 makes the summary of our study and states the possible future work.

Based on the above, we can conclude that DoLesS is faster, yet more effective, than prior results since:

  • [leftmargin=6mm]

  • DoLesS can handle multiple goals (in our experiment, 5 goals) simultaneously. Hence it does not need to loop the algorithm times (where is the number of subsets for the goals) like the prior state-of-the-art method.

  • DoLesS’s sorting procedure uses continuous domination to very quickly divide candidates into a very small “best” set (that we can focus on) and a much larger “rest” (that we can mostly ignore). Like much research before us, we argue that continuous domination is more informative than binary domination (Zitzler and Künzli, 2004; Sayyad et al., 2013; Wagner et al., 2007).

  • Chen et al. (Chen et al., 2018) argues that some SE optimization problems can be solved better by over-sampling than via evolutionary methods. For example, the evolutionary NSGA-II method mutates 100 individuals for 250 generations (these parameters were selected to ensure comparability to the prior study). On the other hand, our DoLesS over-sampling method explores 10,000 individuals for one generation. This result suggests that cyber-physical system testing might be another class of problem that better to be solved via the over-sampling methods which stated by Chen et al (Chen et al., 2018).

2. Background

A repeated result is that test suites can be minimized (i.e. we can run fewer tests) while still being as effective (or better) than running the larger test suite (Ahmed, 2016; Di Nardo et al., 2015; Wong et al., 1997, 1998; Yoo and Harman, 2012). Note that “effective” can mean different things in different domains, depending on the goals of the testing. For example, at FSE’14, Elbaum et al. (Elbaum et al., 2014) reported that Google could find similar number of bugs, but after far fewer tests execution. This was an important result since, at that time, the initial Google test suites were taking weeks to execute. Such long test suite runtimes is detrimental to many agile software practices.

Research has found many test case selection techniques such as DejaVu based, firewall based, dependency based, and specification based techniques (Engström et al., 2010). We note that different test suite minimization methods need different kinds of data. For example, in 1995, Binkley et al. proposed a semantic-based method which takes the use of differences and similarities of two consecutive versions to select test cases (Binkley, 1995). Rothermel et al. developed a test case selection technique for C++ software on 2020 (Rothermel et al., 2000). In 2001, Chen et al. developed test case selection strategies based on the boolean specifications (Chen and Lau, 2001). In 2005, a fuzzy expert system was developed in test case selection by Xu et al. (Xu et al., 2005). In 2006, Grindal et al. presented an empirical study on evaluating five combination strategies for test case selection (Grindal et al., 2006). In 2011, Cartaxo et al. implemented a similarity function for test case selection in model-based testing (Cartaxo et al., 2011). Pradhan et al.  (Pradhan et al., 2016) proposed a multi-objective optimization test case selection approach which can be used with limited time constraints. Arrieta et al. also used test case execution history to select test cases (Arrieta et al., 2016). Also in 2017, Lachmann et al. (Lachmann et al., 2017) did an empirical study on several black-box metrics and made comparisons on their performance in selecting test cases in system testing.

Due to the data requirements, many of the above methods are unsuitable for cyber-physical systems, for two reasons.

Figure 1. Feedback controller (1904). Energy is released (as steam) when the weights spin faster and pull away from the shaft. Thus, increased speed leads to energy release, which slows the system. From Wikipedia.

Firstly, cyber-physical systems are embodied in their environment. Hence, it is not enough to explore static features of (e.g.) the code base. Rather, it is required to test how that code base reacts to its surrounding environments. Hence, using just static information such as (e.g.) code coverage metrics is not recommended for testing cyber-physical systems. Secondly, at least for the systems studied here, cyber-physical systems make extensive use of process control theory. In that theory, the feedback controller is used to compare the value or status of process variables with the desired set-point. This controller then applies the difference as a control signal to bring the process variable output of the plant to the same value as the set-point (for example, see the steam governor of Figure 1). Hence, for test suite minimization of process control applications, the requirement is data collected from the feedback loops inside the cyber-physical systems. Accordingly, here we use input and output signals in the simulation models instead of execution history or coverage information.

Figure 2. Examples of anti-patterns seen for systems under feedback. The name of anti-patterns from left to right are instability, discontinuity, and growth to negative infinity correspondingly. The alarming patterns are shown in red marks.

In one of the IST’19 journal paper, Arrieta et al. (Arrieta et al., 2019a) explored issues associated with test suite minimization by using the data extracted from feedback loops. They noted that feedback loops have anti-patterns; i.e. undesirable features that appear in a time series trace of the output of the system. Figure 2 shows three such features correspondingly from left to right in the red dash rectangle: instability, discontinuity, and growth to infinity. Later in this paper we will mathematically define these anti-patterns.

In all, Arrieta et al. (Arrieta et al., 2019a) explored seven goals for cyber-physical model testing: maximizing the three anti-patterns that shown in Figure 2, maximizing three other measures of effectiveness, as well as minimizing total execution time. Arrieta et al. used mutation testing to check the validity of their minimized test suite. Mutation based testing is a fault-based testing technique which implements “mutation adequacy score” to assess test suite adequacy by creating mutants (Jia and Harman, 2010), and then pruning test cases which cannot distinguish the original model from the mutant. In our study, we use the mutants generated in the experiments from Arrieta et al. (Arrieta et al., 2019a). Those mutants were generated with Hanh et al.’s technique (Binh et al., 2016) and some of the mutants are removed if (a) they are not detected by any test case, (b) they are killed by all test cases, and (c) they are equivalent mutants (Papadakis et al., 2015). Like Arrieta et al., we say a test suite is minimal when it retires as many mutants as a larger suite.

Mutation testing is the inner loop of Arrieta et al.’s process and, in their experiments, they found mutation testing to be an effective technique. The problem area in their work was the outer loop that optimized for seven goals. They found that standard optimizers such as NSGA-II (Deb et al., 2002) can be ineffective for more than three goals (a result that is echoed by prior work (Sayyad et al., 2013)). More recent optimizers like NSGA-III (Deb and Jain, 2013) and MOEA/D (Zhang and Li, 2007) also failed for this multi-goal task. Later in this paper, we replicate their experiment and strengthen that finding (see RQ1).

To address this optimization failure, they resorted to “pairwise” approach based on NSGA-II. That is, they ran NSGA-II with all 21 subsets of “choose two or three from seven” goals then returned the test suite associated with the run which has the best scores (where the “best” here is measured just on a subset of goals). While definitely an extension to the state-of-the-art, Arrieta et al.’s (Arrieta et al., 2019a) study had two drawbacks. Firstly, the test cases selected in this way was only the best which measured on a subset of the optimization goals. Secondly, the “pairwise” approach increased optimization time by an order of magnitude, which is a major issue for large simulators, especially when we are running these algorithms 20 times (to check the generalizability of this stochastic process).

Hence in this work, we seek to improve the mutation based test suite minimization method from Arrieta et al.’s study (Arrieta et al., 2019a)

. Like them, we will optimize for the anti-patterns and effectiveness measures seen in process control systems. But unlike that prior work, we will offer methods that simultaneously succeed across many goals (without needing anything like the pairwise heuristic used in Arrieta et al.). Further, we show that all this can be achieved without additional runtime cost.

2.1. Testing Simulation models

Cyber-physical system developers often use simulation tool (e.g. Simulink) to build cyber-physical models (Chowdhury et al., 2018). For an example Simulink models, see Figure 3. This is a model with two hierarchical levels (Arrieta et al., 2019a). A complex model will have far more blocks and operators.

In Simulink models, the inputs and outputs are all signals (here signal means a time series function). This means at each simulation time step , there will be a value in each input and in each output regarding to that time step. For example, if we simulate a model for 5 seconds in real time and the time step is 0.05, then there will be

simulation steps, which means each input or output should be a vector of length 101.

Assuming an initial set of test cases for a simulation, each test simulates the model from a set of unique input signals to a set of output signals  (Matinnejad et al., 2016; Arrieta et al., 2019a).

Our goal for this study is to select representative test cases from the initial test suite to minimize the test execution time, but not influence the testing performance. Here we can define the test selection problem as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] Given an initial set of test cases , we want to find a subset of that set of test cases which can test the model as initial test suite does with . If we search in the space which contains all the subsets of the initial test suite, then the search space will be very large. For example, with only 100 test cases, there will be possible subsets. Thus, cost-effectively selecting test cases is a significant problem.

Figure 3. Simple example of a Simulink model - Cruise Controller of a car (Arrieta et al., 2019a).

3. Experimental Methods

3.1. Simulation Effectiveness Metrics

In this study, we implement five out of seven effectiveness measurement metrics which Arrieta et al. (Arrieta et al., 2019a) used in their study. The first three metrics are also widely used in previous studies (Wang et al., 2013; Matinnejad et al., 2015; Matinnejad et al., 2017), and the forth metric is proposed by Arrieta et al. (Arrieta et al., 2019a).

Aside: We exclude two of the metrics explored by Arrieta et al. (Input & Output-based test similarity metrics) since these two metrics will always result similar normalized values (0.95-0.99) with different test case selections. Such similar normalized values can affect the performance of multi-objective optimization algorithms. We will introduce the remaining five metrics in the rest of this section.

3.1.1. Test Execution Time

Total test execution time is the first metric we implement in our study. Wang et al. (Wang et al., 2013) stated that the number of selected test cases can be treated as the measurement for selecting representative test cases from the initial test suite. However, Arrieta et al. pointed out the problem that each test case has different execution time (Arrieta et al., 2019a). In our study, we use the similar calculation that Arrieta et al. (Arrieta et al., 2019a) did in their study to deal with the test execution time. The test execution time is calculated as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] Define a set of test cases that are selected from the initial test suite . Let denotes the execution time of the test case in , and denotes the execution time of the test case in . The total test execution time of a set of selected test cases is (Arrieta et al., 2019a)

(1)

In our study, we want to minimize this metric because the goal of test case selection is to decrease the test execution time.

3.1.2. Discontinuity in Output Signal

Discontinuity is the second metric we implement in our study. As Matinnejad et al. (Matinnejad et al., 2017) stated, the discontinuity of the output signal is a short duration pulse in the output signal, which means the output signal increases or decreases to a value in a very short time, and recovers back to normal. If executing a test case causes discontinuity in the output signal, then that test case detects the faulty behavior in the model. Assume we have output signals . For discontinuity score of each output signal where , Matinnejad et al. calculated it with (Matinnejad et al., 2015; Matinnejad et al., 2017)

(2)

where is the left change rate of step and is the right change rate of step .

The discontinuity rate of a set of selected test cases is calculated as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] With same definition of test case above, let denotes the discontinuity score of the test case in , and denotes the discontinuity score of the test case in . The total discontinuity score of a set of selected test cases is (Arrieta et al., 2019a)

(3)

is the normalized discontinuity score of the test case among output signals . In our study, we want to maximize this metric because the goal is to detect more discontinuity in the output signal.

3.1.3. Instability in Output Signal

Instability is the third metric we implement in our study. As Matinnejad et al. (Matinnejad et al., 2017) stated, the instability of the output signal is a duration of quick and frequent oscillations in the output signal, which means the output signal increase and decrease repeatedly in a duration of time. If executing a test case causes instability in the output signal, then that test case detects the undesirable impact on physical process (Matinnejad et al., 2017). Assume we have output signals . For instability score of each output signal where , Matinnejad et al. calculated it with (Matinnejad et al., 2017)

(4)

where is the total number of simulation steps and is the time stamp in the simulation model for each step.

The instability score of a set of selected test cases is calculated as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] With same definition of test cases above, let denotes the instability score of the test case in , and denotes the instability score of the test case in . The total instability score of a set of selected test cases is (Arrieta et al., 2019a)

(5)

is the normalized instability score of the test case among output signals . In our study, we want to maximize this metric because the goal is to detect more instability in the output signal.

3.1.4. Growth to Infinity in Output Signal

This is the forth metric we implemented in our study. As Matinnejad et al. (Matinnejad et al., 2015) pointed out, the growth to infinity of the output signal is the phenomenon that the output signal increases or decreases to infinity value. If executing a test case causes growth to infinity in the output signal, then that test case detects the faulty behavior in the model. Assume we have output signals . For growth to infinity score of each output signal where , Matinnejad et al. calculated it with (Matinnejad et al., 2015)

(6)

where is the total number of simulation steps and is the time stamp in the simulation model for each step.

The growth to infinity score of a set of selected test cases is calculated as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] With same definition above, let denotes the infinity score of the test case in , and denotes the infinity score of the test case in . The total growth to infinity score of a set of selected test cases is (Arrieta et al., 2019a)

(7)

is the normalized infinity score of the test case among output signals . In our study, we want to maximize this metric because the goal is to detect more growth to infinity situation in the output signal.

3.1.5. Output Minimum and Maximum Difference in Output Signal

This is the last metric we implemented in our study. Arrieta et al. proposed this metric in their work because the difference between maximum output signal and minimum output signal can indicate the level of how a model is being tested. If executing a test case results in large minimum and maximum difference in the output signal, then that test case can detect more parts in the simulation model. Assume we have output signals . For output minimum and maximum difference score of each output signal where , Arrieta et al. calculated it with (Arrieta et al., 2019a)

(8)

where is the total number of simulation steps and is the time stamp in the simulation model for each step.

The difference of output minimum and maximum of a set of selected test cases is calculated as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] With same definition of test case above, let denotes the output minimum and maximum difference of the test case in , and denotes the output minimum and maximum difference of the test case in . The total output minimum and maximum difference score of a set of selected test cases is (Arrieta et al., 2019a)

(9)

is the normalized output minimum and maximum difference score of the test case among output signals . In our study, we want to maximize this metric because the goal is to coverage more parts that can be tested.

Figure 4. DoLesS Framework - Continuous domination selects representative goals from the large initial random population; Data processing reads the data set in and forms the linear equation system; Linear least square solver solves the least square approximation; and Evaluation evaluates the selected test cases.

3.2. Algorithms

3.2.1. Binary vs Continuous Domination

In the following, all the algorithms use binary domination except for DoLesS that uses continuous domination.

Binary domination decides one individual is better than another if it is better on at least one goal and worse on none. Numerous studies (Zitzler and Künzli, 2004; Sayyad et al., 2013; Wagner et al., 2007) warn that binary domination is hard to distinguish candidates once the number of goals grows to three or more.

For many-goal problems, Zilter’s continuous domination predicate (Zitzler and Künzli, 2004) is useful (Zitzler and Künzli, 2004; Sayyad et al., 2013; Wagner et al., 2007). Continuous domination judges the domination status of pair of individuals by running a “what-if” query which checks the situation when we jump from one individual to another, and back again. Specifically:

  • For the forward jump, we compute .

  • For the reverse jump we compute .

where and are the values on the same index from two individuals, is the number of goals (in our case ), and is the weight {-1,1} if we are minimization or maximizing the goal correspondingly. According to Zitler (Zitzler and Künzli, 2004), one example is preferred to another if we lost the least jumping to it; i.e. .

Specifically, in this work, we use this predicate to select better goal sets that (a) minimize test execution time, (b) maximize discontinuity score, (c) maximize instability score, (d) maximize growth to infinity score, and (e) maximize output minimum & maximum difference.

3.2.2. Nsga-Ii

NSGA-II is a common evolutionary genetic algorithm 

(Deb et al., 2002)

. Firstly, it generates an initial set of population as the starter of the entire algorithm. Secondly, these candidates will evolve to offsprings in a series of generations by implementing the crossover and mutation operators with their individual probability. In our reproduction experiment, we use single point crossover with 0.8 probability and bit-flip mutation with

probability ( is the number of test cases). Thirdly, parents for next generation will be selected by selection operator, which utilizes a non-dominated sorting algorithm to select top non-dominated solutions (Panichella et al., 2017). In the situation where a front needs to be divided because it exceeds the total number of population, NSGA-II uses the crowding distance to split candidates in that group.

3.2.3. Nsga-Iii

NSGA-III is an improved NSGA-II algorithm (Deb and Jain, 2013)

. In NSGA-III, all procedures such as initial population generation, crossover, and mutation are similar to NSGA-II, except selection procedure. In NSGA-III, the selection procedure is applied based on a set of reference points. The reference points are uniformly distributed on the normalized hyper-plane with some division number

 (Deb and Jain, 2013). After that, each objective point is normalized adaptively and associated with a reference point by calculating its distance to the corresponding reference line. Niche-Preservation operation is then applied to select candidates which will be used in the next generation (Deb and Jain, 2013).

3.2.4. Moea/d

MOEA/D is the first multi-objective optimization algorithm which utilizes decomposition technique (Zhang and Li, 2007). More specifically, MOEA/D explicitly decomposes the problem into multiple sub-problems with less objectives in each subgroup and solves these sub-problems simultaneously (Zhang and Li, 2007). To so, prior to inference, all examples get random weights assigned to their goals. Examples are then clustered by those weights such that all examples know the space of other examples that weighted in a similar direction. Next, during the execution, if one example finds a way to improve itself, its local neighborhood will move in the same direction as does.

3.2.5. DoLesS

Figure 4

shows the entire framework of our approach. Unlike above evolutionary algorithms, our proposed approach

DoLesS (Domination with Least Squares Approximation):

  • [leftmargin=4mm]

  • Uses continuous domination (defined below, see the first block in Figure 4) to reduce the size of initial large random sets of goals and find a “best” group of representative samples.

  • For each data entry in the “best” group, DoLesS then uses a least square approximation technique (the third block in Figure 4) to inversely predict the test selection outcomes which can fit the representative sets of goals best. This least square approximation technique is discussed below.

After sorting on the domination score, DoLesS divides data into:

  • The “best” items. In our case study, we randomly generate 10000 initial candidates, hence the first 100 candidates with highest domination score are grouped into the “best” group.

  • And the remaining “rest” items.

Here we select the number of final population as 100 with two reasons: (a) 10000 random initial population is large enough to cover a wide range of possible outcomes and (b) to make comparison fair, we select same number of final candidates as previous work (Arrieta et al., 2019a).

In the data processing stage of Figure 4, we take data from each model (which Arrieta et al. also used in their study (Arrieta et al., 2019a)), and then processes it into the form of least square approximation structure by combining with the representative goals generated from continuous domination. Table 1(i) shows a simple example of effectiveness measurement data collected from the models. We can find that each test case will have a single score for all effectiveness measurement data ( means the score of test case in effectiveness measure ). The corresponding matrix equation system for the above example is shown in Table 1(ii). This equation shows the linear relationship of test selection outcomes and the final effectiveness measure scores (e.g. the final score of effectiveness measure 1 can be obtained by where is the outcome of test selection). In this example, our goal is to find the best outcomes of to which can result to . To summarize the above example, in our approach, we collect effectiveness measurement data for test cases (like Table 1(i)) and want to find the best set of outcomes for to which can get the closest scores to representative goals which are selected by continuous domination.

EM 1
EM 2
EM 3
EM 4
EM 5
(i) (ii)
Table 1. An example of (i) collected effectiveness measurement data (EM means effectiveness measure) and (ii) its corresponding matrix equation form

Linear Least Squares Approximation is a method which predicts the best value of a set of unknown variables that fits the relationship between expected and observed sets of data. In general, solving a system of linear equations () will result no solution or infinite solutions. This always happens when (a) the number of constraints (equations) greater than the number of variables (overdetermined) or (b) the number of variables greater than the number of constraints (underdetermined). The way to find the best approximate solution is called the linear least square approximation. As mentioned in Section §3.2.5, in our study, we have 5 goals and number of test cases (where ). Thus, our equation system contains 5 equations (5 goals) and variables (where must greater than 5). In this case, finding possible selections of test cases becomes a underdetermined least square approximation. [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] Problem: Find that minimize

(10)

In the formulation , where is an outcome vector of test cases, we want to predict the value (0/1) for each entry of . The final outcome of is a vector of float numbers (ranged from 0 to 1 by controller) which indicates lower effect to the final score with coefficient higher effect to the final score with coefficient from 0 1. Since test selection outcomes can only have 0 (discard that test) and 1 (select that test), we use the threshold of 0.5 to indicate higher probability or lower probability. A value ¡ 0.5 means higher chance to be 0 and a value ¿ 0.5 means higher chance to be 1. For each representative candidate found by continuous domination, DoLesS finds the test selection which can get the closest score to that candidate. Although there exists delta between original ideal scores and truth scores generated by predicted results because of the approximation procedure, our results show that least square approximation can find adequate test cases which perform as well or better as the previous state-of-the-art. Our implementation of above step uses scipy.optimize, a python library, and uses the function called lsq_linear

, which solves the above problem by using either dense QR decomposition technique or Singular Value Decomposition technique.

Finally, in the Evaluation stage of Figure 4, DoLesS selects the Pareto front set111In multi-objective optimization, the Pareto front of a set of individuals are all examples not dominated by anything else. from the final population as Arrieta et al. did in their study (Arrieta et al., 2019a). All evaluations are made through 20 repeats for each of algorithm.

Project Name Two Tanks CW AC Engine EMB
# Input Signals 11 15 4 1
# Output Signals 7 4 1 1
# Test Cases 150 133 120 150
# Mutants 6 96 12 18
Table 2. Summary of number of I/O signals, number of test cases, and number of mutants in four case studies

4. Experimental Setup

4.1. Case Studies

We use four Cyber-physical system (CPS) models to evaluate our proposed approach. These four models come from the previous state-of-the-art study (Arrieta et al., 2019a). We implement the test cases and mutants that Arrieta et al. generated from these four models222https://github.com/aitorarrietamarcos/IST2019Paper. The summary of number of initial test cases and number of mutants are shown in Table 2. In that table:

  • Two Tanks project is a model that simulate the incoming and outgoing flows of the tanks (Menghi et al., 2019);

  • CW project is a model that simulate the electrics and mechanics of four car windows (Arrieta et al., 2019a);

  • AC Engine project is a model that simulate some safety functionalities in the AC engine (Arrieta et al., 2019b, 2017);

  • and EMB project simulates the software model controller which includes a continuous PID controller and a discrete state machine (Matinnejad et al., 2017).

At first glance, the case studies in Table 2 may appear to contain very small test cases. But appearances can be deceiving; e.g. the number of input signals is a poor measure of the internal complexity of a cyber-physical system. As shown in our RQ1 results, the systems of Table 2 are so complex that, for the purposes of test suite minimization, they defeated state-of-the-art optimizers (NSGA-III and MOEA/D).

4.2. Performance Criteria

To evaluate the selected test cases, we use two evaluation metrics from prior work (Arrieta et al., 2019a). These two evaluation metrics are (a) normalized test execution time and (b) mutant detection score. Previous study used these two metrics to calculate the hypervolumne indicator and average weighted sum of mutation score and normalized test execution time (Arrieta et al., 2019a), while in our study, we directly compare the performance of algorithms in these two metrics.

Normalized test execution time (TET-): Our goal for selecting test cases from the initial test suite is to speed up the testing process. Thus, test execution time is a very important indicator which can indicate whether selected test cases can significantly reduce the cost of testing. In this study we want to minimize this value since, as discussed in our introduction, the whole point of this paper is to reduce the time required for testing cyber-physical systems.

Mutant detection score (MS+): If a set of selected test cases can significantly reduce the test execution time, but cannot detect most of the mutants, then such a selection is a bad choice. Our goal for selecting test cases is detecting as much as mutants when minimizing the test execution time. Therefore, mutant detection score becomes another important evaluation metric. In this study we want to maximize this value since higher value means the test suite is better since it can detect more mutants.

That is, a good test case selection approach can both (a) minimize the test execution time and (b) maximize the mutant detection score.

Project Approach TET- MS+
Twotanks NSGA-II 0.30 1
NSGA-III 0.49 1
MOEA/D 0.54 1
CW NSGA-II 0.39 0.99
NSGA-III 0.61 0.98
MOEA/D 0.68 0.99
ACEngine NSGA-II 0.38 0.73
NSGA-III 0.61 0.72
MOEA/D 0.65 0.73
EMB NSGA-II 0.37 1
NSGA-III 0.54 1
MOEA/D 0.63 1
Table 3. RQ1 results: Reproduction results of Arrieta et al.’s study (Arrieta et al., 2019a). The metric with ”-” means less is better while ”+” means more is better. The light gray cell in each project means that approach wins others significantly (as computed by the statistical method in §4.3).
Project Approach Best Combination Time- Discontinuity+ Infinity+ Instability+ MinMax+ Wins
Twotanks NSGA-II Time, Infinite, Minmax 0.30 0.54 0.65 0.55 0.65 1
DoLesS - 0.30 0.56 0.67 0.56 0.67 5
CW NSGA-II Time, Instability 0.39 0.55 0.34 0.64 0.34 2
DoLesS - 0.36 0.50 0.58 0.44 0.58 3
ACEngine NSGA-II Time, Discontinuity, Instability 0.38 0.53 0.49 0.49 0.49 4
DoLesS - 0.30 0.47 0.44 0.41 0.44 1
EMB NSGA-II Time, Instability 0.37 0.50 0.49 0.57 0.50 1
DoLesS - 0.36 0.61 0.61 0.48 0.61 4
Table 4. RQ2 results: Scores of five effectiveness measurement metrics calculated by sets of selected test cases. All entries report the median score of 20 repeats. In the title row, the metric with “-” means we want to minimize that metric and the metric with “+” means we want to maximize that metric. The light gray cell in each project means that approach wins over another approach significantly (as computed by the statistical method of §4.3) in that metric. Last column counts the number of wins for each approach.

4.3. Statistical Analysis

In our study, we record the value of above two evaluation metrics in 20 repeats. To compare the total performance of different algorithms, we implement A Scott-Knott analysis (Mittas and Angelis, 2012). The Scott-Knott analysis can sort the candidates by their values, and assign candidates to different ranks if the values of candidate at position is significantly different (by more than a small effect size) to the values of candidate at position  (Ling et al., 2021).

More precisely, Scott-Knott sorts the candidates by their median scores (and in our study, the candidates are the test case selection approaches). Scott-Knott method will split the sorted candidates into two sub-lists which maximize the expected value of differences in the observed performances before and after division (Tu et al., 2020). After that, Scott-Knott will declare the one of the split as the best split. The best split should maximize the difference in the expected mean value before and after the split (Xia et al., 2018; Tu et al., 2021):

(11)

where , , and are size of list , , and . , , and are mean value of list , , and .

After the best split, Scott-Knott then implements some statistical hypothesis tests to check the division. If two items

and after division differ significantly by applying hypothesis test

, then such division is defined as a ”useful” division. Scott-Knott will run recursively on each half of the best division until no division can be made. In our study, we use cliff’s delta non-parametric effect size measure as the hypothesis test. Cliff’s delta quantifies the number of difference between two lists of observations beyond p-values interpolation 

(Xia et al., 2018). The division passes the hypothesis test if it is not a ”small” effect ( 0.147). The cliff’s delta non-parametric effect size test explores two list and with size and :

(12)

Cliff’s delta estimates the probability that a value in the list

is greater than a value in the list , minus the reverse probability (Macbeth et al., 2011) in the above formula. This hypothesis test and its effect size is supported by Hess and Kromery (Hess and Kromrey, 2004).

5. Results

Returning now to the research questions offered in the introduction, we offer the following results.

RQ1: Can We Verify that Test Case Selection for Multi-goal Cyber-physical Systems is a Hard Problem? To answer RQ1, we replicate Arrieta et al.’s experiment (Arrieta et al., 2019a). Table 3 shows our replication results with 20 repeats (with different random number seeds). Two algorithms differ significantly if they separate in different ranks in the Scott-Knott analysis.

As seen in Table 3, we can found the approach with NSGA-II that Arrieta et al. proposed (Arrieta et al., 2019a) performs better than other multi-goal optimizers (MOEA/D and NSGA-III). Specifically, NSGA-II has higher performance in both test execution time and mutation score in four case studies. However, even though NSGA-II beats the other methods, we cannot sanction its use. NSGA-II has all the problems discussed in §2. Specifically, NSGA-II can only handle pairs of goals. Hence, it has to be re-run multiple times to explore all pairs of five goals. As shown below, we can achieve better results, orders of magnitude faster. Therefore, we can answer RQ1 as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] Test case selection for multi-goal cyber-physical models is a hard problem that cannot be solved by merely applying, off-the-shelf, the latest optimizer technology.

These RQ1 results motivate the rest of this paper where we develop a fast approach, which can handle multiple goals simultaneously for the cyber-physical systems.

RQ2: Can DoLesS Find Better Ways to Select Test Cases which Result in Test Suites with Higher Effectiveness Measure (Objective) Scores? To answer RQ2, we calculate the scores of effectiveness measures (test execution time, discontinuity, growth to infinity, instability, minimum and maximum difference) after we generate the subsets of selected test cases.

Table 4 shows our simulation results. For each project, we use Scott-Knott statistical analysis to compare the performance across 20 repeats. Table 4 reports the median scores for each effectiveness measurement metric. To visualize the final results, we mark the winning approach in each metric by light gray, and count the number of wins in the last column.

As seen in Table 4, DoLesS wins in three out of four projects (in Twotanks, CW, and EMB). Moreover, in Twotanks and EMB, our proposed approach results higher scores on most of the effectiveness measurement metrics while previous approach only wins in one (DoLesS wins in all effectiveness measurement metrics in Twotanks and wins in 4 out of 5 effectiveness measurement metrics in EMB). These outcomes can indicate that in most of the cases, our proposed approach gets significant improvement than previous approach in these five metrics. Moreover, in half projects, the state-of-the-art approach concentrates on optimizing the Time metric and Instability metric (because of the algorithm design). However, by using DoLesS, we can find that most of the “goals” are equally optimized. Hence, DoLesS can handle multiple goals all the time while state-of-the-art method can only handle one or two goals in some cases.

By summarizing above findings, we answer RQ2 as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] DoLesS can find better test selections than state-of-the-art approach in terms of five effectiveness measurement metrics.

Project Approach TET- MS+ Wins
Twotanks NSGA-II 0.30 1 2
DoLesS 0.30 1 2
CW NSGA-II 0.39 0.98 1
DoLesS 0.36 0.95 1
ACEngine NSGA-II 0.39 0.72 1
DoLesS 0.30 0.72 2
EMB NSGA-II 0.37 1 1
DoLesS 0.35 1 2
Table 5. RQ3 results: Scores of two evaluation metrics calculated by sets of selected test cases. All entries are reported the median score of 20 repeats. In the title row, the metric with “-” means less is better while “+” means more is better. The light gray cells mark the winning approach (as computed by the statistical method in §4.3) in that metric. Last column counts the number of wins for each approach.

RQ3: Do selected test cases by DoLesS beat the prior state-of-the-art? To answer RQ3, we compare scores of TET- (normalized test execution time) and MS+ (mutant detection score) between the prior state-of-the-art and DoLesS. It is important to have higher performance on these two evaluation metrics since they directly indicate whether a test case selection is good or not. Table 5 shows our simulation results (note that TET - test execution time & MS - Mutation Score). For each method in each project, we repeat experiment 20 times and calculate the value of two evaluation metrics for each repeat. To obtain the final conclusion, we implement Scott-Knott statistical method to check if our approach significantly differ to state-of-the-art approach in each metric. The light gray cells mark the winning method (in the first rank) resulted from the Scott-Knott test. Moreover, we count the number of higher performance in these two evaluation metrics for each algorithm and record the number of wins in the last column.

As seen in Table 5, DoLesS gets better performance in both two evaluation metrics in two out of four projects (ACEngine & EMB). Moreover, in Twotanks, DoLesS and state-of-the-art method achieve the same performance (both two evaluation metrics are tied in the first rank). In the CW project, DoLesS has higher performance in minimizing test execution time when state-of-the-art method gets a little bit higher mutation score than DoLesS. Taking above comparisons together, we can conclude that test cases selected by DoLesS can achieve similar or better performance in minimizing test execution time and detecting more mutants in all projects.

By summarizing above findings, we answer RQ3 as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] In all projects, comparing to the state-of-the-art, DoLesS can get similar or better performance on minimizing the execution time of selected test cases while keeping to detect most of the mutants.

RQ4: Is DoLesS far more efficient than prior state-of-the-art in terms of running time? To answer RQ4, we count the execution time for both algorithms during the experiment. To make comparison fair enough, we run both two algorithms on the same 64-bit Windows 10 machine with a 4.2 GHz 8-core Intel Core i7 processor and 16 GB of RAM. Moreover, when running experiments, we make sure no huge process is starting or ending in our machine.

Table 6 shows the recorded runtime for each project. For each method, we repeat experiments 20 times and record the total runtime. The light gray cells mark the fastest approach.

Project Approach RunTime (s) Speed Up (times faster)
Twotanks NSGA-II 11964.6 83
DoLesS 144.7
CW NSGA-II 15409.9 362
DoLesS 42.6
ACEngine NSGA-II 14042.1 179
DoLesS 78.6
EMB NSGA-II 12585.6 319
DoLesS 39.5
Table 6. RQ4 results: Runtime comparison for our proposed DoLesS and the state-of-the-art approach. The light gray cell marked the fastest approach in each project. The last column marks how many times DoLesS faster than state-of-the-art.

As seen in Table 6, in both four projects, DoLesS runs significantly faster (80-360 times faster) than the previous method. By analyzing our proposed algorithm and state-of-the-art approach, we find previous approach implemented NSGA-II as their multi-objective optimization, which designed for 2 or 3 objectives (Panichella et al., 2017). To handle this issue, Arrieta et al. group objectives into 21 different combinations with two or three objectives in each combination, and select one of the best combinations by repeating their approach in those 21 groups (Arrieta et al., 2019a). However, in our approach, we just need continuous domination to find “ideal goals” and approximate corresponding test cases inversely.

By summarizing above findings, we answer RQ4 as follow: [boxsep=1pt,left=4pt,right=4pt,top=2pt,bottom=2pt] In both four projects, DoLesS run significantly faster (80-360 times) than the previous method. In other words, DoLesS is far more efficient than state-of-the-art approach.

Even though our current empirical results can only boost a speed of (up to) 300 times faster, we can make a theoretical case that, if the number of goals increases, our technique would be even more comparatively faster (please note that in the following counts, test execution time is the metric that must be in every combination):

  • [leftmargin=6mm]

  • 5 goals will result 10 different combinations of 2 or 3 objectives.

  • 7 goals will result 21 different combinations of 2 or 3 objectives.

  • 9 goals will result 36 different combinations of 2 or 3 objectives.

  • The above pattern shows that with more goals being utilized, the number of repeats for state-of-the-art NSGA-II approach increases exponentially. However, our approach can handle multiple goals simultaneously in one time. This can show the efficiency of our approach.

6. Threats to Validity

This section discusses issues raised by Feldt et al. (Feldt and Magazinius, 2010).

Construct validity: The construct validity threat mainly exists in the parameter settings of algorithms. For example, in our replication experiment, we use one point crossover with 0.8 crossover probability and bitflip mutation with (number of variables) mutate probability as prior studies did in order to keep consistent. For another example, in least square approximation, we use 0.5 threshold to indicate whether a test case has large probability to be chose or not. Moreover, we use the default setting of Python least square solver in our algorithm. Changing these parameters can result differences in selecting test cases. Therefore, our observation may differ when different parameters are used. We would consider hyper-parameter tuning (Tu et al., 2021; Tu and Nair, 2018) in future work to mitigate this threat.

Conclusion validity: The conclusion validity threat in this study is related to the random variations of our algorithm. To reduce the effect caused by this threat, we repeat all experiments 20 times in the same machine. Moreover, we apply Scott-Knott statistical test to compare if the outcomes of our proposed approach and the previous methods differ significantly.

Internal validity: Internal validity focuses on the correctness of the treatment caused the outcome. In this study, we constraint our simulations to the same data set. Moreover, we evaluate our approach and the previous approach in the same workflow. Another internal validity threat can refer to the mutants generated from the projects. To mitigate this threat, we use the same mutants that Arrieta et al. (Arrieta et al., 2019a) used in their study, which they have removed duplicated mutants.

External validity: External validity concerns the application of our algorithm in other problems. In this study, we generate our conclusion from four real-world Simulink cyber-physical systems. When applying our method into other case studies, these concerns may raised: (a) DoLesS may not be applicable in projects which cannot obtain the effectiveness measurement data; (b) DoLesS may needs modification for those projects which effectiveness measurement data and test cases are not in the linear relationship. For those projects, we would consider non-linear least square approximation as future work to mitigate this threat; (c) Other algorithms may perform better if other information is utilized.

7. Conclusion & Future Work

Finding representative test cases from the initial test suite is an important task in simulation based model. Better test case selection methods can not only reduce the test execution time in the future testing, but also maintain the same testing performance as usual. In other word, a good test case selection approach can (a) minimize the test execution time and (b) maximize the mutation score.

Previous literature by Arrieta et al. (Arrieta et al., 2019a) has shown the great success on using NSGA-II as the multi-objective optimization method to select representative test cases. However, their design has a deficiency where they need to evaluate 21 combinations first to select the best subset. Moreover, since NSGA-II is a randomized algorithm, repeats are necessary during the experiment. Therefore, their approach will execute NSGA-II 420 times with 20 repeats.

In this study, we address this deficiency by selecting test cases from all effectiveness measurement metrics. To do that, we use a very fast approach - continuous domination to select representative goals. Moreover, we make a better use of the linear relationship between test cases and goals to find the best test selections correspond to the representative goals (by linear least square approximation). Our experimental results show that our proposed approach DoLesS can (a) achieve better scores in terms of five effectiveness measurement metrics, (b) achieve better scores in terms of two evaluation metrics, and (c) solve test selection problem in a very fast speed (80-360 times faster than the state-of-the-art method) for cyber-physical models.

We conjecture that our method would be a better candidate for scaling to large systems than the method proposed by Arrieta et al. (Arrieta et al., 2019a). To see that, consider the following scenario: To successfully perform test case selection on selected cyber-physical case studies, Arrieta et al.’s approach required 3-5 hours execution time. Now imagine in some higher complexity simulation models (e.g. drone simulation models) with dozens more of test cases in the initial test suite, and these models have more signal processing criteria in I/O signals, both evaluation time and objectives are increased. In such scenario, the execution time of running NSGA-II in all subsets of objectives will exponentially increase as we mentioned at the end of RQ4. Moreover, such models (e.g. drone simulation models) requires far more fast feedback than usual cyber-physical models. Due to the above reasons, the ideal test case selection approach for complex simulation models needs to handle multiple goals (more than 4 goals) in the same time and perform selection in a very short execution time for the fast feedback. For that task, we recommend DoLesS.

As to further work, apart from extending this exploration of feedback loop anti-patterns, we conjecture that our methods could be useful for other multi-objective reasoning tasks. Standard practice in this area is to mutate large populations across a Pareto frontier. This has certainly been a fruitful research agenda (Maia et al., 2009; Yoo and Harman, 2010; Panichella et al., 2014; Zheng et al., 2016). But perhaps the testing community could reason about more goals, faster, if used our domination methods and least squares methods to “reason backwards” from goal space to decision space. Hence, future works can be conducted with

  • [leftmargin=6mm]

  • Finding more simulation projects which can strength our approach.

  • Developing more effectiveness measurement metrics which can better indicate representative test cases.

  • Adjusting our approach based on the testing scenarios of different projects. Moreover, in some cases, a new test case selection approach is needed for those projects.

Acknowledgments

This research was partially funded by blinded for review.

References

  • (1)
  • Agrawal et al. (2020) Amritanshu Agrawal, Tim Menzies, Leandro L Minku, Markus Wagner, and Zhe Yu. 2020. Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers. Empirical Software Engineering 25, 3 (2020), 2099–2136.
  • Ahmed (2016) Bestoun S Ahmed. 2016.

    Test case minimization approach using fault detection and combinatorial optimization techniques for configuration-aware structural testing.

    Engineering Science and Technology, an International Journal 19, 2 (2016), 737–753.
  • Arrieta et al. (2019a) Aitor Arrieta, Shuai Wang, Urtzi Markiegi, Ainhoa Arruabarrena, Leire Etxeberria, and Goiuria Sagardui. 2019a. Pareto efficient multi-objective black-box test case selection for simulation-based testing. Information and Software Technology 114 (2019), 137–154.
  • Arrieta et al. (2017) Aitor Arrieta, Shuai Wang, Urtzi Markiegi, Goiuria Sagardui, and Leire Etxeberria. 2017. Employing multi-objective search to enhance reactive test case generation and prioritization for testing industrial cyber-physical systems. IEEE Transactions on Industrial Informatics 14, 3 (2017), 1055–1066.
  • Arrieta et al. (2016) Aitor Arrieta, Shuai Wang, Goiuria Sagardui, and Leire Etxeberria. 2016. Search-based test case selection of cyber-physical system product lines for simulation-based validation. In Proceedings of the 20th International Systems and Software Product Line Conference. 297–306.
  • Arrieta et al. (2019b) Aitor Arrieta, Shuai Wang, Goiuria Sagardui, and Leire Etxeberria. 2019b. Search-Based test case prioritization for simulation-Based testing of cyber-Physical system product lines. Journal of Systems and Software 149 (2019), 1–34.
  • Binh et al. (2016) Nguyen Thanh Binh, Khuat Thanh Tung, et al. 2016. A novel fitness function of metaheuristic algorithms for test data generation for simulink models based on mutation analysis. Journal of Systems and Software 120 (2016), 17–30.
  • Binkley (1995) David Binkley. 1995. Reducing the cost of regression testing by semantics guided test case selection. In Proceedings of International Conference on Software Maintenance. IEEE, 251–260.
  • Cartaxo et al. (2011) Emanuela G Cartaxo, Patrícia DL Machado, and Francisco G Oliveira Neto. 2011. On the use of a similarity function for test case selection in the context of model-based testing. Software Testing, Verification and Reliability 21, 2 (2011), 75–100.
  • Chen et al. (2018) Jianfeng Chen, Vivek Nair, Rahul Krishna, and Tim Menzies. 2018. “Sampling” as a baseline optimizer for search-based software engineering. IEEE Transactions on Software Engineering 45, 6 (2018), 597–614.
  • Chen and Lau (2001) Tsong Yueh Chen and Man Fai Lau. 2001. Test case selection strategies based on boolean specifications. Software Testing, Verification and Reliability 11, 3 (2001), 165–180.
  • Chowdhury et al. (2018) Shafiul Azam Chowdhury, Soumik Mohian, Sidharth Mehra, Siddhant Gawsane, Taylor T Johnson, and Christoph Csallner. 2018. Automatically finding bugs in a commercial cyber-physical system development tool chain with SLforge. In Proceedings of the 40th International Conference on Software Engineering. 981–992.
  • Deb and Jain (2013) Kalyanmoy Deb and Himanshu Jain. 2013. An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints.

    IEEE transactions on evolutionary computation

    18, 4 (2013), 577–601.
  • Deb et al. (2002) Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation 6, 2 (2002), 182–197.
  • Di Nardo et al. (2015) Daniel Di Nardo, Nadia Alshahwan, Lionel Briand, and Yvan Labiche. 2015. Coverage-based regression test case selection, minimization and prioritization: A case study on an industrial system. Software Testing, Verification and Reliability 25, 4 (2015), 371–396.
  • Elbaum et al. (2014) Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Techniques for improving regression testing in continuous integration development environments. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 235–245.
  • Engström et al. (2010) Emelie Engström, Per Runeson, and Mats Skoglund. 2010. A systematic review on regression test selection techniques. Information and Software Technology 52, 1 (2010), 14–30.
  • Feldt and Magazinius (2010) Robert Feldt and Ana Magazinius. 2010. Validity threats in empirical software engineering research-an initial survey.. In Seke. 374–379.
  • González et al. (2018) Carlos A González, Mojtaba Varmazyar, Shiva Nejati, Lionel C Briand, and Yago Isasi. 2018. Enabling model testing of cyber-physical systems. In Proceedings of the 21th ACM/IEEE international conference on model driven engineering languages and systems. 176–186.
  • Grindal et al. (2006) Mats Grindal, Birgitta Lindström, Jeff Offutt, and Sten F Andler. 2006. An evaluation of combination strategies for test case selection. Empirical Software Engineering 11, 4 (2006), 583–611.
  • Hess and Kromrey (2004) Melinda R Hess and Jeffrey D Kromrey. 2004.

    Robust confidence intervals for effect sizes: A comparative study of Cohen’sd and Cliff’s delta under non-normality and heterogeneous variances. In

    annual meeting of the American Educational Research Association

    . Citeseer, 1–30.

  • Jia and Harman (2010) Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.
  • Lachmann et al. (2017) Remo Lachmann, Michael Felderer, Manuel Nieke, Sandro Schulze, Christoph Seidl, and Ina Schaefer. 2017. Multi-objective black-box test case selection for system testing. In Proceedings of the Genetic and Evolutionary Computation Conference. 1311–1318.
  • Ling et al. (2021) Xiao Ling, Rishabh Agrawal, and Tim Menzies. 2021. How Different is Test Case Prioritization for Open and Closed Source Projects. IEEE Transactions on Software Engineering (2021).
  • Macbeth et al. (2011) Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. 2011. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica 10, 2 (2011), 545–555.
  • Maia et al. (2009) Camila Loiola Brito Maia, Rafael Augusto Ferreira Do Carmo, Fabrıcio Gomes de Freitas, Gustavo Augusto Lima De Campos, and Jerffeson Teixeira De Souza. 2009. A multi-objective approach for the regression test case selection problem. In Proceedings of Anais do XLI Simposio Brasileiro de Pesquisa Operacional (SBPO 2009). 1824–1835.
  • Matinnejad et al. (2017) Reza Matinnejad, Shiva Nejati, and Lionel C Briand. 2017. Automated testing of hybrid Simulink/Stateflow controllers: industrial case studies. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 938–943.
  • Matinnejad et al. (2015) Reza Matinnejad, Shiva Nejati, Lionel C Briand, and Thomas Bruckmann. 2015. Effective test suites for mixed discrete-continuous stateflow controllers. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 84–95.
  • Matinnejad et al. (2016) Reza Matinnejad, Shiva Nejati, Lionel C Briand, and Thomas Bruckmann. 2016. Automated test suite generation for time-continuous simulink models. In proceedings of the 38th International Conference on Software Engineering. 595–606.
  • Menghi et al. (2019) Claudio Menghi, Shiva Nejati, Khouloud Gaaloul, and Lionel C Briand. 2019. Generating automated and online test oracles for simulink models with continuous and uncertain behaviors. In Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering. 27–38.
  • Mittas and Angelis (2012) Nikolaos Mittas and Lefteris Angelis. 2012. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Transactions on software engineering 39, 4 (2012), 537–551.
  • Panichella et al. (2017) Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2017. Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Transactions on Software Engineering 44, 2 (2017), 122–158.
  • Panichella et al. (2014) Annibale Panichella, Rocco Oliveto, Massimiliano Di Penta, and Andrea De Lucia. 2014. Improving multi-objective test case selection by injecting diversity in genetic algorithms. IEEE Transactions on Software Engineering 41, 4 (2014), 358–383.
  • Papadakis et al. (2015) Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946.
  • Pradhan et al. (2016) Dipesh Pradhan, Shuai Wang, Shaukat Ali, and Tao Yue. 2016. Search-based cost-effective test case selection within a time budget: An empirical study. In Proceedings of the Genetic and Evolutionary Computation Conference 2016. 1085–1092.
  • Rothermel et al. (2000) Gregg Rothermel, Mary Jean Harrold, and Jeinay Dedhia. 2000. Regression test selection for C++ software. Software Testing, Verification and Reliability 10, 2 (2000), 77–109.
  • Sagardui et al. (2017) Goiuria Sagardui, Joseba Agirre, Urtzi Markiegi, Aitor Arrieta, Carlos Fernando Nicolás, and Jose María Martín. 2017. Multiplex: A co-simulation architecture for elevators validation. In 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM). IEEE, 1–6.
  • Sayyad et al. (2013) Abdel Salam Sayyad, Tim Menzies, and Hany Ammar. 2013. On the value of user preferences in search-based software engineering: A case study in software product lines. In 2013 35Th international conference on software engineering (ICSE). IEEE, 492–501.
  • Tu and Nair (2018) Huy Tu and Vivek Nair. 2018. While Tuning is Good, No Tuner is Best. In FSE SWAN.
  • Tu et al. (2021) Huy Tu, George Papadimitriou, Mariam Kiran, Cong Wang, Anirban Mandal, Ewa Deelman, and Tim Menzies. 2021. Mining Workflows for Anomalous Data Transfers. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 1–12. https://doi.org/10.1109/MSR52588.2021.00013
  • Tu et al. (2020) H. Tu, Z. Yu, and T. Menzies. 2020. Better Data Labelling with EMBLEM (and how that Impacts Defect Prediction). TSE (2020). https://doi.org/10.1109/TSE.2020.2986415
  • Wagner et al. (2007) Tobias Wagner, Nicola Beume, and Boris Naujoks. 2007. Pareto-, Aggregation-, and Indicator-Based Methods in Many-Objective Optimization. In Evolutionary Multi-Criterion Optimization, Shigeru Obayashi, Kalyanmoy Deb, Carlo Poloni, Tomoyuki Hiroyasu, and Tadahiko Murata (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 742–756.
  • Wang et al. (2013) Shuai Wang, Shaukat Ali, and Arnaud Gotlieb. 2013. Minimizing test suites in software product lines using weight-based genetic algorithms. In Proceedings of the 15th annual conference on Genetic and evolutionary computation. 1493–1500.
  • Wong et al. (1998) W Eric Wong, Joseph R Horgan, Saul London, and Aditya P Mathur. 1998. Effect of test set minimization on fault detection effectiveness. Software: Practice and Experience 28, 4 (1998), 347–369.
  • Wong et al. (1997) W Eric Wong, Joseph Robert Horgan, Aditya P Mathur, and Alberto Pasquini. 1997. Test set size minimization and fault detection effectiveness: A case study in a space application. In Proceedings Twenty-First Annual International Computer Software and Applications Conference (COMPSAC’97). IEEE, 522–528.
  • Xia et al. (2018) Tianpei Xia, Rahul Krishna, Jianfeng Chen, George Mathew, Xipeng Shen, and Tim Menzies. 2018. Hyperparameter optimization for effort estimation. arXiv preprint arXiv:1805.00336 (2018).
  • Xu et al. (2005) Zhiwei Xu, Kehan Gao, and Taghi M Khoshgoftaar. 2005. Application of fuzzy expert system in test case selection for system regression test. In IRI-2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005. IEEE, 120–125.
  • Yoo and Harman (2010) Shin Yoo and Mark Harman. 2010. Using hybrid algorithm for pareto efficient multi-objective test suite minimisation. Journal of Systems and Software 83, 4 (2010), 689–701.
  • Yoo and Harman (2012) Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software testing, verification and reliability 22, 2 (2012), 67–120.
  • Zhang and Li (2007) Qingfu Zhang and Hui Li. 2007. MOEA/D: A multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on evolutionary computation 11, 6 (2007), 712–731.
  • Zheng et al. (2016) Wei Zheng, Robert M Hierons, Miqing Li, XiaoHui Liu, and Veronica Vinciotti. 2016. Multi-objective optimisation for regression testing. Information Sciences 334 (2016), 1–16.
  • Zitzler and Künzli (2004) Eckart Zitzler and Simon Künzli. 2004. Indicator-Based Selection in Multiobjective Search. In Lecture Notes in Computer Science. Springer Berlin Heidelberg, 832–842. https://doi.org/10.1007/978-3-540-30217-9_84