Faster Creation of Smaller Test Suites (with SNAP)

05/14/2019 ∙ by Jianfeng Chen, et al. ∙ NC State University 0

State-of-the-art theorem provers, combined with smart sampling heuristics, can generate millions of test cases in just a few hours. But given the heuristic nature of those methods, not all of those tests may be valid. Also, test engineers may find it too burdensome to run all those tests. Within a large space of tests, there can be redundancies (duplicate entries or similar entries that do not contribute much to overall diversity). Our approach, called SNAP uses specialized sub-sampling heuristics to avoid finding those repeated tests. By avoiding those repeated structures SNAP explores a smaller space of options. Hence, it is possible for SNAP to verify all its tests. To evaluate SNAP, this paper applied 27 real-world case studies from a recent ICSE'18 paper. Compared to prior results, SNAP's test case generation was 10 to 3000 times faster (median to max). Also, while prior work showed that their tests were 70 test engineers would find it relatively easiest to apply SNAP's tests since our test suites are 10 to 750 times smaller (median to max) than those generated using prior work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper replicates and improves QuickSampler [1], a recent ICSE’18 paper which built test suites by applying theorem provers to logical formula extracted from procedural source code. We found that QuickSampler generated test suites with many repeated entries. After applying some redundancy avoidance heuristics (defined below), our new algorithm (called Snap) runs much faster than QuickSampler and returns much smaller test suites. This is useful since smaller test suites are simpler to execute and maintain.

To generate tests from programs, they must first be converted into a logic formula. Fig. 1 shows how this might be done. Symbolic/dynamic execution techniques [2, 3] extract the possible execution branches of a procedural program. Each branch is a conjunction of conditions so the whole program can be summarized as the disjunction . Using deMorgan’s rules111 and . these clauses can be converted to conjunctive normal form (CNF) where:

  • The inputs to the program are the variables in the CNF;

  • A test is valid if uses input settings that satisfy the CNF.

  • A test suite is a set of valid tests.

  • One test suite is more diverse than another if it uses more variable within the CNF disjunctions. Diverse test suites are better since they cover more parts of the code.

Theorem provers like Z3, pycoSAT, MathSAT, or vZ [4, 5, 6, 7] can use this CNF as follows:

  • Generation: tests can be generated by “solving” the CNF; i.e. find settings to variables such that all clauses in CNF evaluates to “true”. Given multiple disjunctions inside an CNF, one CNF formula can generate multiple tests.

  • Verification: check if variable settings satisfy the CNF.

  • Repair: patch tests that fail verification by (a) removing “dubious” variable settings then (b) asking the theorem prover to appropriately complete the missing bits. Later in this paper, we show how to find the “dubious” settings.

In terms of runtimes:

  • Verification is fastest (since there are no options to search);

  • Repair is somewhat slower (some more options to search);

  • And generation is slowest since the theorem prover must search around all the competing CNF constraints.

In practice, generation can be very slow indeed. When translated to CNF, the case studies explored in this paper require 5000 to 500,000 variables within 17,000 to 2.6 million clauses (median to max). Even with state-of-the-art theorem provers like Z3, generating a single test from these clauses can take as much as 20 minutes. Worse still, this process must be repeated many times to find enough tests to build a diverse test suite.

[backgroundcolor=blue!05, linecolor=black, roundcorner=10pt]

1 int mid(int x, int y, int z) {
2  if (x < y) {
3    if (y < z) return y;
4    else if (x < z) return z;
5    else return x;
6  } else if (x < z) return x;
7  else if (y < z) return z;
8  else return y; }

The code above has the six branches shown below. Each branch can be modeled as a logical constraint . A valid test selects x, y, z such that it satisfies these constraints.

path 1: [C1: x < y < z] L2->L3
path 2: [C2: x < z < y] L2->L3->L4
path 3: [C3: z < x < y] L2->L3->L4->L5
path 4: [C4: y < x < z] L2->L6
path 5: [C5: y < z < x] L2->L6->L7
path 6: [C6: z < y < x] L2->L6->L7->L8
By convention, the disjunction is transformed into the conjunction normal form (CNF) . A valid assignment to the CNF, i.e. the assignment that fulfills all clauses, is corresponding to a test case, covering some branch of code. When translated into the input required for Z3, these conjunctions look like:
1 p cnf 11511 41411
2 ...
3 -11507 11510 0
4 -11510 11504 11507 11502 0
5 ...
Line 1 indicates that that this CNF expression has 11511 variables which are shared between 41411 clauses. Remaining lines the 41411 clauses (and a zero at end of line denotes end-of-clause). For example, line 2-5 can be read as
Fig. 1: From source code (top left) to constraint solver. Note one detail: as x, y and z are integers, we could use (say) 7 bits to representing them. If so, then tools like SMT [8] could find subsets of continuous ranges that satisfy branch conjunctions. That said, this paper is comparing its methods to those of QuickSampler study. Accordingly, we use the same conventions as QuickSampler; i.e. our constraints are just boolean variables.

To address the runtime issue, many researchers try to minimize the calls to the theorem prover. In that approach, heuristics are used to generate most of the tests. For example, QuickSampler assumes that adding the deltas between valid tests to a valid test will produce a new valid test ; i.e.

(1)

(where means “exclusive or”). Using Eq. 1, QuickSampler generates thousands, or even millions of test cases within hours. But there is a catch – heuristically generated tests may be invalid. According to Dutra et al., 30% of the tests built by QuickSampler are often invalid.

The research of this paper began when we observed that many of the test cases generated by QuickSampler contained duplicates. For example. in the blasted_case47 case study, QuickSampler generated more than samples within 2 hours. On these, there were only approximate unique solutions (i.e. one in 40).

Based on this observation, we conjectured:

If a slow test generator has some redundancy, then to build a faster generator, avoid that redundancy.

To test this conjecture, we built Snap. Like QuickSampler, Snap uses a combination of Eq. 1 (sometimes) and Z3 (at other times). Snap also includes some specialized sub-sampling heuristics that strive to avoid the redundant tests. Those heuristics are described later in this paper. For now, all we need say is that Snap explores and generates a much smaller sample of solutions than QuickSampler. Hence it runs faster, and generates smaller tests suites.

This paper evaluates Snap via four research questions.

RQ1: How reliable is the Eq. 1 heuristic? One reason we advocate Snap is that, unlike QuickSampler, our method verifies each test (with Z3). But is that necessary? How often does Eq. 1 produce invalid tests? In our experiments, we can confirm Dutra et al.

’s estimate that the percent of invalid tests generated by Eq. 

1 is usually 30% or less. However, that median result does not quite characterize the variability of that distribution. We found that in a third of case studies, the percent of valid tests generated by Eq. 1 is 25% to 50% (median to max). Hence we say: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #1: Eq. 1 should not be used without a verification of the resulting test. One useful feature of Snap is that its test suites are so small that it is possible to quickly verify all our final candidate test suites. That is, unlike QuickSampler, all Snap’s tests are valid.

RQ2: How diverse are the Snap test cases? Since Snap explores far fewer tests than QuickSampler, its tests suites could be less diverse. Nevertheless: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #2: The diversity of Snap’s test suites are not markedly worse than those of QuickSampler.

RQ3: How fast is Snap? Snap was motivated by the observation that QuickSampler built many similar test cases; i.e. much of its analysis seemed redundant. If so, we would expect Snap to generate test cases much faster than QuickSampler (since it avoids redundant analysis). This prediction turns out to be true. In our case studies: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #3:  Snap was 10 to 3000 times faster than QuickSampler (median to max).

RQ4: How easy is it to apply Snap’s test cases? Finally, we end on a pragmatic note. The smaller a test suite, the easier it is for programmers to run those tests. Therefore it is important to ask which method produces fewer tests: QuickSampler or Snap? We find that: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #4:  Snap’s test cases were 10 to 750 times smaller than those of QuickSampler (median to max). Hence, we argue that it would be easiest for an industrial practitioner to execute and maintain Snap’s test suite.

Reference Year Citation Sampling methodology Case study size (maxvariables) Verifying samples Distribution/ diversity reported
[9] 1999 105 Binary Decision Diagram 1.3K
[10] 2003 50 Interval-propagation-based 200
[11] 2004 54 Binary Decision Diagram K
[12] 2004 141 Random Walk + WalkSat No experiment conducted
[13] 2011 88 Sampling via determinism 6k
[14] 2012 25 MaxSat + Search Tree Experiment details not reported
[15] 2014 29 Hashing based 400K
[16] 2015 28 Hashing based (paralleling) 400K
[17] 2016 29 Universal hashing 400K
[1] 2018 5 Z3 + Eq. 1 flipping 400K
Snap 2019 this
paper
Z3 + Eq. 1 + local sampling 400K
 /  : the absence / presence of corresponding item     : only partial case studies (the small case studies) were reported
TABLE I: SNAP and its related work for solving theorem proving constraints via sampling.

In summary, the unique contributions of this paper are:

  • A novel mutation based sampling algorithm name Snap;

  • Experiments on common case studies that compare Snap to a recent state-of-the-art sampler in method (QuickSampler, from ICSE’18);

  • Based on that comparison, we show that Snap generates much smaller solutions that the diverse as the prior state-of-the-art; and does so far faster. Further, 100% of our tests are valid (while other methods may only generate 70% valid tests, or less).

  • A reproduction package for this paper, and Snap.222Source code at https://github.com/ai-se/SatSpaceExpo

The rest of this paper is structured as follows: §II introduces some related works in solving this problem. §III shows the core algorithm of Snap. §IV addresses the details of experiments and the study cases. §V discusses the experiment results. Following that, §VI and §VII have further discussion and conclusions.

Ii Related Work

Using the methods of Fig. 1, software often generates CNF with three or more variables per clause. Since 3-SAT problem is NP-complete [18], so generating tests from these clauses is an inherently slow process.

This problem has been explored for decades. One way to solve the theorem proving problem is to simplify or decompose the CNF formulas. A recent example in this arena was GreenTire, proposed by Jia et al. [19]. GreenTire

supports constraint reuse based on the logical implication relation among constraints. One advantage of this approach is its efficiency guarantees. Similar to the analytical methods in linear programming, they are always applied to specific class of problem. However, even with the improved theorem prover, such methods may be difficult to be adopted in large models.

GreenTire was tested in 7 case studies. Each case study was corresponding to a small code script with tens lines of code, e.g. the BinTree in [20]. For the larger models, such as those explored in this paper, the following methods might do better.

Another approach, which we will call sampling, is to combine theorem provers Z3 with stochastic sampling heuristics. For example, given random selections for , Eq. 1 might be used to generate a new test suite, without calling a theorem prover. Theorem proving might then be applied to some (small) subset of the newly generated tests, just to assess how well the heuristics are working. Table I includes some of related works.

The earliest sampling tools were based on binary decision diagrams (BDDs) [21]. Yuan et al. [9, 11] build a BDD from the input constraint model and then weighted the branches of the vertices in the tree such that a stochastic walk from root to the leaf was able to generate samples with desired distribution. In other work, Iyer proposed a technique named RACE which has been applied in multiple industrial solutions [10]. RACE (a) builds a high-level model to represent the constraints; then (b) implements a branch-and-bound algorithm for sampling diverse solutions. The advantage of RACE is its implementation simplicity. However, RACE, as well as the BDD-based approached introduced above, return highly biased samples, that is, highly non-uniform samples. For testing, this is not recommended since it means small parts of the code get explored at a much higher frequency than others.

Using a SAT solver WalkSat [22], Wei et al. [12] proposed SampleSAT. SampleSAT combines random walk steps with greedy steps from WalkSat. This method works well in small constraint models. However, due to the greedy nature of WalkSat, the performance of SampleSAT

is highly skewed as the size of the constraint model increases.

For seeking diverse samples, universal hashing [23] techniques have been proposed. These algorithms were designed for strong guarantees of uniformity. Meel et al. [17]

provided an overview of key ingredients of integration of universal hashing and SAT solvers; e.g. with universal hashing, it is possible to guarantee uniform solutions to a constraint model. These hashing algorithms can be applied to the extreme large models (with near 0.5M variables). More recently, several improved hashing-based techniques have been purposed to balance the scalability of the algorithm as well as diversity (i.e. uniform distribution) requirements. For example, Chakraborty

et al. proposed an algorithm named UniGen [15], following by the Unigen2 [16]. UniGen provides strong theoretical guarantees on the uniformity of generated solutions and has applied to constraint models with hundreds of thousands of variables. However, UniGen suffered from a large computation resource requirement. Later work explored a parallel version of this approach. Unigen2 achieved near linear speedup of the number of CPU cores.

algocf[!b]    

To the best of our knowledge, the state-of-the-art technique for generating test cases using theorem provers is QuickSampler [1]. QuickSampler was evaluated on large real-world case studies, some of which have more than 400K variables. At ICSE’18, it was shown that QuickSampler outperforms aforementioned Unigen2 as well as another similar technique named SearchTreeSampler [14]. QuickSampler starts from a set of valid solutions generated by Z3. Next, it computes the differences between the solutions using Eq. 1. New test cases generated in this manner are not guaranteed to be valid. According to Dutra et al.’s experiments, the percent of valid tests found by QuickSampler can be higher than 70%. The percent of valid tests found by Snap, on the other hand, is 100%. Further, as shown below, Snap builds those tests with enough diversity much faster than QuickSampler.

Iii About Snap

As stated in the introduction, Snap uses the Z3 theorem prover combined with Eq. 1. Also, Snap uses specialized sub-sampling heuristics to avoid redundant tests. Just to say the obvious, we have no formal proofs that any of the following are useful. Instead, these heuristics are based on hunches we acquired while working with QuickSampler.

Heuristic #1: Instead of computing some deltas between many tests, Snap restrains mutation to many deltas between a few tests. Specifically, Snap builds a pool of 10,000 deltas from valid tests (note that this process requires calling the theorem prover only times). Snap uses this pool as a set of candidate “mutators” for existing tests (and by “mutator”, we mean an operation that converts an existing test into a new one).

Heuristic #2: Snap builds new tests by apply Eq. 1 to old tests. To minimize redundancy, Snap uses old tests that are quite distant. More specifically, Snap uses the centroids found after applying a -means clustering algorithm to the initial samples.

Heuristic #3: We have an intuition that the more frequently we see a delta, the more likely it might represent a valid change to a test. Hence, when Snap mutates our centroids, it uses deltas that are seen most frequently.

Heuristic #4: We have another intuition that test cases that pass verification and somehow less interesting than those that fail. Hence, when Snap finds a new failing test, it repairs it (using the process described below) and focuses the rest of the test generation on that harder case.

Heuristic #5: Z3 is much slower for generating new tests than repairing invalid tests than for verifying that a test is valid. As discussed in the introduction, the reason for this is that the search space of options is much larger for generation that for repairing than for verification. Hence, Snap needs to verify more than it repairs (and also do repairs more than generating new tests).

Algorithm LABEL:fig:frame shows how Snap uses all these heuristics. In this algorithm, each test is a zero or one (false, true) for all the variables in the CNF of our case studies.

Snap uses the Z3 theorem prover for steps 1a,3biii, and 3biv. As required by Heuristic#5, Snap performs verification more often than repair, which in turn is performed far more often than generation:

  • The call to Z3 in step 1a can be the slowest (since this a generate call that must navigate all the constraints of our CNF). Hence, we only do this times;

  • The call to Z3 in step 3biii verification call is much faster since all the variables are set.

  • The call of Z3 in the step 3biv repair call, is a little slower than step 3biii since (as discussed below), our repair operator introduces some open choices into the test. But note that we only need to repair the minority of new tests that fail verification. How small is that minority? Later in this paper, we can use Fig. 3 to show that repairs are only needed on 30% (median) of all tests.

Algorithm LABEL:fig:frame requires a repair function for step 3biv, and a termination function for step 4a. Those two functions are discussed below.

Iii-a Implementing “Repair”

When the new test (found in step3ii) is invalid, Snap uses Z3 to repair that test. As mentioned in the introduction, Snap’s repair function deletes potentially “dubious” parts of a test case, then calls Z3 to fill in the missing details. In this way, when we repair a test, most of the bits are set and Z3 only has to search a small space.

To find the “dubious” section, we reflect on how step 3bii operates. Recall that the new test is where and are valid tests taken from samples. Since were valid tests, then the “dubious” parts of the test are anything that was not seen in both and . Hence, we preserve the bits in bits (where the corresponding bit was 1), while removing all other bits (where bit was 0). For example:

  • Assuming we are mutating (1,0,0,1,1,0,0,0) using (1,0,1,0,1,0,1,0).

  • If (0,0,1,1,0,0,1,0) is invalid, then Snap deletes the “dubious” sections as follows.

  • Snap preserves any “1” bits that were seen in .

  • Snap deletes the other bits; i.e. the 2, 4, 6, 8th bits (0,0,1,1,0,0,1,0).

  • Z3 is then called to fill out the missing bits of .

Heuristic #5

(shown above) is based on the assumption that these last step (where Z3 repairs the vector) is usually faster than generating a completely new solution from scratch.

Iii-B Implementing “Termination”

To implement Snap’s termination criteria (step 4a), we need a working measure of diversity. Recall from the introduction that one test suite is more diverse than another if it uses more of the variable settings with disjunctions inside the CNF. Diverse test suites are better since they cover more parts of the code.

To measure diversity, we used the normalized compression distance (NCD) proposed by Feldt et al. [24]. Feldt et al. showed that a test suite with high NCD implies higher code coverage during the testing333Just as an aside, we note that we did not adopt the diversity metric (distribution of samples displayed as histogram) from [16, 1] since computing that metric is very time-consuming. For the case studies of this paper, that calculation required days of CPU..

NCD is based on information theory – the Kolmogorov complexity [25] of a binary string is the length of the shortest program that outputs . NCD is based on the observation that the degree to which a string can be compressed by real-world compression programs (such as gzip or bzip2). Snap uses gzip to estimate Kolmogorov complexity. Let be the length for the compression of and be the compression length of binary string set ’s concatenation. The NCD of is defined as

(2)

Snap uses NCD as follows. This algorithm terminates when improvement of NCD was less than within minutes.

Iii-C Engineering Choices

Our implementation used the following control parameters that were set via engineering judgment:

  • ;

  • minutes;

  • samples;

  • clusters.

In future work, it could be insightful to vary these values.

Another area that might bear further investigation is the clustering method used in step 3a. For this paper, we tried different clustering methods. Clustering ran so fast that we were not motivated to explore alternate algorithms. Also, we found that the details of the clustering were less important than pruning away most of the items within each cluster (so that we only mutate the centroid).

Iv Experimental Set-up

Iv-a Code

To explore the research questions shown in the introduction, the Snap system shown in Algorithm LABEL:fig:frame was implemented in C++ using Z3 v4.8.4 (the latest release when the experiment was conducted). A -means cluster was added using the free edition of ALGLIB [26], a numerical analysis and data processing library delivered for free under GPL or Personal/Academic license. QuickSampler does not integrate the samples verification into the workflow. Hence, in the experiment, we adjusted the workflow of QuickSampler so that all samples are verified before termination. Also, the outputs of QuickSampler were the assignments of independent support. The independent support is a subset of variables which completely determines all the assignments to a formula [1]. In practice, engineers need the complete test case input; consequently, for valid samples, we extended the QuickSampler to get full assignments of all variables from independent support’s assignment via propagation.

Iv-B Experimental Rig

We compared Snap to the state-of-the-art QuickSampler, technique purposed by Dutra et al. at ICSE’18. To ensure a repeatable result, we updated the Z3 solver in QuickSampler into the latest version.

To reduce the observation error and test the performance robustness, we repeated all experiment 30 times with 30 different random seeds. To simulate real practice, such random seeds were used in Z3 solver (for initial solution generation), ALGLIB (for the -means) and other components. Due to the space limitation, we cannot report results for all 30 repeats. Correspondingly we report the medium or the IQR (75-25th variations) results.

All experiments were conducted in the machines with Xeon-E5@2GHz and 4GB memory, running CentOS. These were multi-core machines but for systems reasons, we only used one core per machine.

Iv-C Case Studies

Table II lists the attributes of all the case studies used in this work. We can see that number of variables ranges from hundreds to more than 486K. The large examples have more than 50K clauses, which is very huge. For exposition purposes, we divided the case studies into three groups: the small case studies with vars ; the medium case studies with vars and the large case studies with vars .

For the following reasons, our case studies are the same as those used in the QuickSampler paper:

  • We wanted to compare our method to QuickSampler over same case studies;

  • Their case studies were online available;

  • Their case studies are used in multiple papers [15, 16, 17, 1] etc.

These case studies are representative of scenarios engineers met in software testing or circuit testing in embedded system design. They include bit-blasted versions of SMTLib case studies, ISCAS89 circuits augmented with parity conditions on randomly chosen subsets of outputs and next-state variables, problems arising from automated program synthesis and constraints arising in bounded theorem proving. For more introduction of the case studies, please see  [16, 1].

For pragmatic reasons, certain case studies were omitted from our study. For example, we do not report on diagStencilClean.sk_41_36 in the experiment, since the purpose of this paper is to sample a set of valid solutions to meet the diversity requirement; while there are only 13 valid solutions from this model. The QuickSampler spent 20 minutes (on average) to search for one solution.

Also, we do report on the case studies marked with a star(*) in Table II. Based on the experiment, we found that even though the QuickSampler generates tens of millions of samples for these examples, all samples were the assignment to the independent support (defined in §IV-A). The omission of these case studies is not a critical issue. Solving or sampling these examples is not difficult; since they are all very small, as compared to other larger case studies.

Size Case studies Vars Clauses
blasted_case47 118 328
blasted_case110 287 1263
s820a_7_4 616 1703
s820a_15_7 685 1987
s1238a_3_2 685 1850
Small s1196a_3_2 689 1805
s832a_15_7 693 2017
blasted_case_1_b12_2* 827 2725
blasted_squaring16* 1627 5835
blasted_squaring7* 1628 5837
70.sk_3_40 4669 15864
ProcessBean.sk_8_64 4767 14458
56.sk_6_38 4836 17828
35.sk_3_52 4894 10547
80.sk_2_48 4963 17060
7.sk_4_50 6674 24816
doublyLinkedList.sk_8_37 6889 26918
19.sk_3_48 6984 23867
29.sk_3_45 8857 31557
Medium isolateRightmost.sk_7_481 10024 35275
17.sk_3_45 10081 27056
81.sk_5_51 10764 38006
LoginService2.sk_23_36 11510 41411
sort.sk_8_52 12124 49611
parity.sk_11_11 13115 47506
77.sk_3_44 14524 27573
Large 20.sk_1_51 15465 60994
enqueueSeqSK.sk_10_42 16465 58515
karatsuba.sk_7_41 19593 82417
tutorial3.sk_4_31 486193 2598178
TABLE II: Case studies used in this paper. Sorted by number of variables. Medium sized-problems are highlighted with blue rows while the large ones are in orange rows. Three items (marked with *) are not included in some further reports (see text). See text for details.
Fig. 2: Number of identical deltas among 100*100 pair of valid solution deltas for all case studies. Same color scheme as Table II.
Fig. 3: RQ1 results: percentage of valid mutations found it step3biii (computed separately for each case study).

V Results

The rest of this paper use the machinery defined above to answer the four research questions posed in the introduction.

V-a RQ1: How Reliable is the Eq. 1 Heuristic?

QuickSampler ran so quickly since it assumed that tests generated using Eq. 1 did not need verification. This research question checks that assumption, as follows.

For each case study, we randomly generated 100 valid solutions, using Z3. Next, we selected three and built a new test case using Eq. 1; i.e. .

Fig. 2 reveals the number of identical deltas seen within these of deltas. Among all case studies, we rarely found large sets of unique deltas. This means that among the 100 valid solutions given by Z3, many s were shared within various pairwise solutions. This is important since, if otherwise, the Eq. 1 heuristic would be dubious.

The percentage of these deltas that proved to be valid in step3biii of Algorithm 1 are shown in Fig. 3. Dutra et al.’s estimate were that the percentage of valid tests generated by Eq. 1 was usually 70% or more. As shown by the median values of Fig. 3, this was indeed the case. However, we also see that in lower third of those results, the percent of valid tests generated by Eq. 1 is very low: 25% to 50% (median to max). This result alone would be enough to make us cautious about using QuickSampler since, when the Eq. 1 heuristics fails, it seems to fail very badly. We recommend: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #1: Eq. 1 should not be used without a verification of the resulting test. By way of comparisons, it is useful to add here that Snap verifies every test case it generates. This is practical for Snap, but impractical for QuickSampler since these two systems typically process to test cases, respectively. In any case, another reason to recommend Snap over QuickSampler is that the former delivers tests suites where 100% of all tests are valid.

Fig. 4: RQ2 results: Normalized compression distance (NCD) observed when QuickSampler and Snap terminated on the same case studies. Median results over 30 runs (and the small black lines show the 75th-25th variations). Same color scheme as Table II.
Fig. 5: RQ3 results: Time to terminated (seconds), The y-axis is in log scale. The Snap sampling time for s1238_a_3_2 and parity.sk_11_11 is not reported since their achieved NCD were much worse than QuickSampler’s (see Fig. 4). Fig. 6 illustrates the corresponding speedups.

V-B RQ2: How Diverse are the Snap Test Cases?

As stated in our introduction, diverse test suites are better since they cover more parts of the code. A concern with Snap is that, since it explores fewer tests than QuickSampler its tests suites could be far less diverse.

Fig. 4 compares the diversity of the test suites generated by our two systems. These results are expressed as ratios of the observed NCD values. Results less than one indicate that Snap’s test suites are less diverse than QuickSampler.

In that figure, we see that occasionally, Snap’s faster test suite generation means that the resulting test suites are much less diverse (see s1238_3_2 and parity.sk_11_11). That said, while QuickSampler’s tests are more diverse, the overall difference is usually very small.

Also, RQ1 showed us that many of the QuickSampler tests are invalid. This means that the diversity numbers reported for QuickSampler are somewhat inflated since invalid tests would not enter the branches they are meant to cover. Hence, overall, we say: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #2: The diversity of Snap’s test suites are not markedly worse than those of QuickSampler.

V-C RQ3: How Fast is Snap?

Fig. 5 shows the execution time required for Snap and QuickSampler. The y-axis of this plot is a log-scale and shows time in seconds. These results are shown in the same order as Table II. That is, from left to right, these case studies grow from around 300 to around 3,000,000 clauses.

For the smaller case studies, shown on the left, Snap is sometimes slower than QuickSampler. Moving left to right, from smaller to larger case studies, it can be seen that Snap often terminates much faster than QuickSampler.

Fig. 6 is a summary of Fig. 5 that divides the execution time for both systems. From this figure it can be seen: [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #3:  Snap was 10 to 3000 times faster than QuickSampler (median to max). There are some exceptions to this conclusion, where QuickSampler was faster than Snap (see the right-hand-side of Fig. 6). We note that in most of those these cases, those models are small (17,000 clauses or less). For medium to larger models, with 20,000 to 2.5 million clauses, Snap is nearly always orders of magnitude faster than QuickSampler.

Fig. 6: RQ3 results: Sorted (time(QuickSampler) / time(Snap)). The implies Snap terminates earlier than QuickSampler.
Case studies Snap QuickSampler
blasted_case47 2899 71 0.02

isolateRightmost
15480 7510 0.49
LoginService2 404 210 0.52
19.sk_3_48 204 200 0.98
70.sk_3_40 3050 4270 1.40
s820a_15_7 29065 70099 2.41
29.sk_3_45 225 660 2.93
s820a_7_4 37463 124457 3.32
s832a_15_7 27540 96764 3.51
s1196a_3_2 225 1890 8.40
enqueueSeqSK 338 2495 7.38
blasted_case110 274 2386 8.71
tutorial3.sk_4_31 336 2953 8.79
81.sk_5_51 227 2814 12.40
sort.sk_8_52 812 10184 12.54
karatsuba.sk_7_41 139 4210 30.29
20.sk_1_51 239 10039 42.00
doublyLinkedList 278 12042 43.32
17.sk_3_45 228 12780 56.05
ProcessBean 1193 75392 63.20
7.sk_4_50 258 18090 70.12
56.sk_6_38 1827 149031 81.57
80.sk_2_48 653 54440 83.37
77.sk_3_44 245 33858 138.20
35.sk_3_52 258 193920 751.63
TABLE III: RQ4: results. Number of unique valid cases in the test suite. Case studies are sorted by number of variables. Same color scheme as Table II.

V-D RQ4: How Easy is it to Apply Snap’s Test Cases?

Finally, we end on a pragmatic note. The smaller a test suite, the easier it is for programmers to run those tests. Therefore it is important to ask which method produces fewer tests: QuickSampler or Snap?

Table III compares the number of tests (different suggested inputs) generated by QuickSampler and Snap. As shown by the last column in that table, [backgroundcolor=gray!10,roundcorner=3pt] Conclusion #4:  Snap’s test cases were 10 to 750 times smaller than those of QuickSampler (median to max). Hence, we argue that it would be easiest for an industrial practitioner to execute and maintain the test suites generated by Snap.

Vi Discussion

Vi-a Why does Snap Work?

This section reflects on the success of Snap. Specifically, we ask why can Snap generate comparable diversity to QuickSampler? We conjecture that, for SE problems:

Combining solutions from local samples leads to diverse global solutions.

Snap is an example of such a “local sampling”. Consider how it executes: each round of the sampling (step 3 in Algorithm 1) focuses on a local sample (generated by the clustering method). All these local samples then combined into a global test suite. The success of Snap’s local sampling strategy may be a comment on the nature of software systems. Large SE systems are the result of much work by many teams. In such systems, small parts of software combine into some greater whole. For such systems, local sampling (as done in Snap) would be useful.

There is much other evidence of the benefits of local sampling in SE. Chen et al. proposed a local sampling technique called Sway [27, 28] that finds optimized configuration for agile requirements engineering [29]. Subsequently, it was seen that the same local sampling approach leads to a new high-watermark in optimizing cloud containers deployments [30]. All of that researches used the same strategy (local sampling) and had similar results (orders of magnitude improvement on the prior state-of-the-art).

Based on this, we make two observations. Firstly, local sampling is a strategy that may be very useful in many future SE applications. Secondly, if SE problems have such a structure (of larger problems composed of numerous smaller ones), then it would not be insightful to assess new methods using randomly generated solutions (e.g. to assess Snap using randomly generated CNF formula). We say this since such randomly generated models may not contain the structures that have been found so useful in local sampling methods like Snap and elsewhere [27, 28, 30].

Vi-B Threats to Validity

One threat to validity of this work is the baseline bias. Indeed, there are many other sampling techniques, or solvers, that Snap might be compared to. However, our goal here was to compare Snap to a recent state-of-the-art result from ICSE’18. In further work, we will compare Snap to other methods.

A second threat to validity is internal bias that raises from the stochastic nature of sampling techniques. Snap requires many random operations. To mitigate the threats, we repeated the experiments for 30 times and reported the medium or IQR of those results.

A third threat is the measurement bias. To determine the diversity of a test suite, in the experiment, we use normalized compression distance (NCD). Prior research has argued for the value of that measure [24]. However, there exist many other diversity measurements for the theorem proving problem and changing the diversity measurement might lead to change of the results. That said, in one research report, it is impossible to explore all options. For the convenient of further exploration, we have released the source code of Snap in the hope that other researchers will assist us by evaluating Snap on a broader range of measures.

Another threat is hyperparameter bias

. The hyperparameter is the set of configurations for the algorithm. For

Snap, we need to use a set of control parameters shown in §III-C. There now exists a range of mature hyperparameter optimizers [31, 32, 33] which might be useful for finding better settings for Snap. This is a clear direction for future work.

Vii Conclusion

The experiments of this paper suggest that Snap is a better test suite generation system than QuickSampler. Our system avoids much of the redundant reasoning that:

  • Slows up QuickSampler by a factor of 10 to 3000 (compared to Snap);

  • And which also results in test suites that are 10 to 750 times larger than they need to be (again, compared to Snap).

Another reason to prefer Snap is that since its test suites are so small, we can run verification on 100% of all of Snap’s test. That is, unlike QuickSampler, all our tests are known to be valid.

Snap has its drawbacks. Specifically, the diversity of its test suites are not always as good as QuickSampler. That said, this difference in diversity is so small (see Fig. 4) that overall, given all the other advantages of Snap, we can still recommend it.

As shown by the Algorithm 1 pseudocode, Snap is relatively simple to implement, Hence, if nothing else, we can recommend Snap as a baseline method against which more elaborate methods might be benchmarked.

References

  • [1] R. Dutra, K. Laeufer, J. Bachrach, and K. Sen, “Efficient sampling of sat solutions for testing,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).   IEEE, 2018, pp. 549–559.
  • [2] R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, and I. Finocchi, “A survey of symbolic execution techniques,” ACM Computing Surveys (CSUR), vol. 51, no. 3, p. 50, 2018.
  • [3] M. Christakis, P. Müller, and V. Wüstholz, “Guiding dynamic symbolic execution toward unverified program executions,” in Proceedings of the 38th International Conference on Software Engineering.   ACM, 2016, pp. 144–155.
  • [4] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2008, pp. 337–340.
  • [5] A. Biere, “Picosat essentials,” Journal on Satisfiability, Boolean Modeling and Computation, vol. 4, pp. 75–97, 2008.
  • [6] R. Bruttomesso, A. Cimatti, A. Franzén, A. Griggio, and R. Sebastiani, “The mathsat 4 smt solver,” in International Conference on Computer Aided Verification.   Springer, 2008, pp. 299–303.
  • [7] N. Bjørner, A.-D. Phan, and L. Fleckenstein, “z-an optimizing smt solver,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2015, pp. 194–199.
  • [8] M. Finke, “Equisatisfiable sat encodings of arithmetical operations,” Online] http://www. martin-finke. de/documents/Masterarbeit_bitblast_ Finke. pdf, 2015.
  • [9] J. Yuan, K. Shultz, C. Pixley, H. Miller, and A. Aziz, “Modeling design constraints and biasing in simulation using bdds,” in Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design.   IEEE Press, 1999, pp. 584–590.
  • [10] M. A. Iyer, “Race: A word-level atpg-based constraints solver system for smart random simulation,” in null.   IEEE, 2003, p. 299.
  • [11] J. Yuan, A. Aziz, C. Pixley, and K. Albin, “Simplifying boolean constraint solving for random simulation-vector generation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 3, pp. 412–420, 2004.
  • [12] W. Wei, J. Erenrich, and B. Selman, “Towards efficient sampling: Exploiting random walk strategies,” in AAAI, vol. 4, 2004, pp. 670–676.
  • [13] V. Gogate and R. Dechter, “Samplesearch: Importance sampling in presence of determinism,” Artificial Intelligence, vol. 175, no. 2, pp. 694–729, 2011.
  • [14] S. Ermon, C. P. Gomes, and B. Selman, “Uniform solution sampling using a constraint solver as an oracle,” arXiv preprint arXiv:1210.4861, 2012.
  • [15] S. Chakraborty, K. S. Meel, and M. Y. Vardi, “Balancing scalability and uniformity in sat witness generator,” in Proceedings of the 51st Annual Design Automation Conference.   ACM, 2014, pp. 1–6.
  • [16] S. Chakraborty, D. J. Fremont, K. S. Meel, S. A. Seshia, and M. Y. Vardi, “On parallel scalable uniform sat witness generation,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2015, pp. 304–319.
  • [17] K. S. Meel, M. Y. Vardi, S. Chakraborty, D. J. Fremont, S. A. Seshia, D. Fried, A. Ivrii, and S. Malik, “Constrained sampling and counting: Universal hashing meets sat solving,” in Workshops at the thirtieth AAAI conference on artificial intelligence, 2016.
  • [18] C. A. Tovey, “A simplified np-complete satisfiability problem,” Discrete applied mathematics, vol. 8, no. 1, pp. 85–89, 1984.
  • [19] X. Jia, C. Ghezzi, and S. Ying, “Enhancing reuse of constraint solutions to improve symbolic execution,” in Proceedings of the 2015 International Symposium on Software Testing and Analysis.   ACM, 2015, pp. 177–187.
  • [20] W. Visser, C. S. Pǎsǎreanu, and R. Pelánek, “Test input generation for java containers using state matching,” in Proceedings of the 2006 international symposium on Software testing and analysis.   ACM, 2006, pp. 37–48.
  • [21] S. B. Akers, “Binary decision diagrams,” IEEE Transactions on computers, no. 6, pp. 509–516, 1978.
  • [22] B. Selman, H. A. Kautz, B. Cohen et al., “Local search strategies for satisfiability testing.” Cliques, coloring, and satisfiability, vol. 26, pp. 521–532, 1993.
  • [23] Y. Mansour, N. Nisan, and P. Tiwari, “The computational complexity of universal hashing,” Theoretical Computer Science, vol. 107, no. 1, pp. 121–133, 1993.
  • [24] R. Feldt, S. Poulding, D. Clark, and S. Yoo, “Test set diameter: Quantifying the diversity of sets of test cases,” in 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).   IEEE, 2016, pp. 223–233.
  • [25] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications.   Springer Science & Business Media, 2013.
  • [26] S. Bochkanov and V. Bystritsky, “Alglib,” Available from: www. alglib. net, vol. 59, 2013.
  • [27] J. Chen, V. Nair, R. Krishna, and T. Menzies, “” sampling” as a baseline optimizer for search-based software engineering,” IEEE Transactions on Software Engineering, 2018.
  • [28]

    J. Chen, V. Nair, and T. Menzies, “Beyond evolutionary algorithms for search-based software engineering,”

    Information and Software Technology, vol. 95, pp. 281–294, 2018.
  • [29] T. Menzies and J. Richardson, “Xomo: Understanding development options for autonomy,” in COCOMO forum, vol. 2005, 2005.
  • [30] J. Chen and T. Menzies, “Riot: A stochastic-based method for workflow scheduling in the cloud,” in 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).   IEEE, 2018, pp. 318–325.
  • [31]

    W. Fu and T. Menzies, “Revisiting unsupervised learning for defect prediction,” in

    Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering.   ACM, 2017, pp. 72–83.
  • [32] T. Xia, R. Krishna, J. Chen, G. Mathew, X. Shen, and T. Menzies, “Hyperparameter optimization for effort estimation,” arXiv preprint arXiv:1805.00336, 2018.
  • [33] V. Nair, A. Agrawal, J. Chen, W. Fu, G. Mathew, T. Menzies, L. Minku, M. Wagner, and Z. Yu, “Data-driven search-based software engineering,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).   IEEE, 2018, pp. 341–352.