SAPIENZ^div @ SSBSE 2019: Does Diversity Improve the Test Suite Generation for Mobile Applications?
In search-based software engineering we often use popular heuristics with default configurations, which typically lead to suboptimal results, or we perform experiments to identify configurations on a trial-and-error basis, which may lead to better results for a specific problem. To obtain better results while avoiding trial-and-error experiments, a fitness landscape analysis is helpful in understanding the search problem, and making an informed decision about the heuristics. In this paper, we investigate the search problem of test suite generation for mobile applications (apps) using SAPIENZ whose heuristic is a default NSGA-II. We analyze the fitness landscape of SAPIENZ with respect to genotypic diversity and use the gained insights to adapt the heuristic of SAPIENZ. These adaptations result in SAPIENZ^div that aims for preserving the diversity of test suites during the search. To evaluate SAPIENZ^div, we perform a head-to-head comparison with SAPIENZ on 76 open-source apps.READ FULL TEXT VIEW PDF
Context: In search-based software engineering we often use popular heuri...
Anyone working in the technology sector is probably familiar with the
Recent developments in fitness landscape analysis include the study of L...
The search ability of an Evolutionary Algorithm (EA) depends on the vari...
Substitution Boxes (S-boxes) are nonlinear objects often used in the des...
Interest in cryptocurrencies has skyrocketed since their introduction a
The state of the art in local search for the Traveling Salesman Problem ...
SAPIENZ^div @ SSBSE 2019: Does Diversity Improve the Test Suite Generation for Mobile Applications?
In search-based software engineering and particularly search-based testing, popular heuristics (e.g.,) with best-practice configurations in terms of operators and parameters (e.g.,) are often used. As this out-of-the-box usage typically leads to suboptimal results, costly trial-and-error experiments are performed to find a suitable configuration for a given problem, which leads to better results . To obtain better results while avoiding trial-and-error experiments, fitness landscape analysis can be used [16, 23]. The goal is to analytically understand the search problem, determine difficulties of the problem, and identify suitable configurations of heuristics that can cope with these difficulties (cf. [16, 19]).
In this paper, we investigate the search problem of test suite generation for mobile applications (apps). We rely on Sapienz that uses a default NSGA-II to generate test suite for apps . NSGA-II has been selected as it “is a widely-used multiobjective evolutionary search algorithm, popular in SBSE research” [17, p. 97], but without adapting it to the specific problem (instance). Thus, our goal is to analyze the fitness landscape of Sapienz and use the insights for adapting the heuristic of Sapienz. This should eventually yield better test results.
Our analysis focuses on the global topology of the landscape, especially how solutions (test suites) are spread in the search space and evolve over time. Thus, we are interested in the genotypic diversity of solutions, which is considered important for evolutionary search . According to our analysis, Sapienz lacks diversity of solutions so that we extend it to Sapienz that integrates four diversity promoting mechanisms. Therefore, our contributions are the descriptive study analyzing the fitness landscape of Sapienz (Section 3), Sapienz (Section 4), and the empirical study with 76 apps evaluating Sapienz (Section 5).
Sapienz is a multi-objective search-based testing approach . Using NSGA-II, it automatically generates test suites for end-to-end testing of Android apps. A test suite consists of test cases , each of which is a sequence of up to GUI-level events that exercise the app under test. The generation is guided by three objectives: (i) maximize fault revelation, (ii) maximize coverage, and (iii) minimize test sequence length. Having no oracle, Sapienz
considers a crash of the app caused by a test as a fault. Coverage is measured at the code (statement coverage) or activity level (skin coverage). Given these objectives, the fitness function is the triple of the number of crashes found, coverage, and sequence length. To evaluate the fitness of a test suite,Sapienz executes the suite on the app under test deployed on an Android device or emulator.
A fitness landscape analysis can be used to better understand a search problem . A fitness landscape is defined by three elements (cf. ): (1) A search space as a set of potential solutions. (2) A fitness function for each of the objectives. (3) A neighborhood relation that associates neighbor solutions to each solution (e.g., using basic operators, or distances of solutions). Based on these three elements, various metrics have been proposed to analyze the landscape [16, 23]. They characterize the landscape, for instance, in terms of the global topology (i.e., how solutions and the fitness are distributed), local structure (i.e., ruggedness and smoothness), and evolvability (i.e., the ability to produce fitter solutions). The goal of analyzing the landscape is to determine difficulties of a search problem and identify suitable configurations of search algorithms that can cope with these difficulties (cf. [16, 19]).
At first, we define the three elements of a fitness landscape (cf. Section 2) for Sapienz: (1) The search space is given by all possible test suites according to the representation of test suites in Section 2. (2) The fitness function is given by the triple of the number of crashes found, coverage, and test sequence length (cf. Section 2). (3) As the neighborhood relation we define a genotypic distance metric for two test suites (see Algorithm 1). The distance of two test suites and is the sum of the distances between their ordered test sequences, which is obtained by comparing all sequences of and by index (lines 4–6). The distance of two sequences is the difference of their lengths (line 7) increased by for each different event at index (lines 8–11). Thus, the distance is based on the differences of ordered events between the ordered sequences of two test suites.
This metric is motivated by the basic mutation operator of Sapienz shuffling the order of test sequences within a suite, and the order of events within a sequence. It is common that the neighborhood relation is based on operators that make small changes to solutions .
To analyze the fitness landscape of Sapienz, we extended Sapienz with metrics that characterize the landscape. We then executed Sapienz on five apps, repeat each execution five times, and report mean values of the metrics for each app.111All experiments were run on single 4.0 Ghz quad-core PC with 16 GB RAM, using 5 Android emulators (KitKat 4.4.2, API level 19) in parallel to test one app.
The five apps we selected for the descriptive study are part of the 68 F-Droid benchmark apps  used to evaluate Sapienz . We selected aarddict, MunchLife, and passwordmanager since Sapienz did not find any fault for these apps, and hotdeath and k9mail222We used ver. 5.207 of k9mail and not ver. 3.512 as in the 68 F-Droid apps benchmark., for which Sapienz did find faults . Thus, we consider apps for which Sapienz did and did not reveal crashes to obtain potentially different landscape characteristics that may present difficulties to Sapienz.
We configured Sapienz as in . The crossover and mutation rates are set to 0.7 and 0.3 respectively. The population and offspring size is 50. An individual (test suite) contains 5 test sequences, each constrained to 20–500 events. Instead of 100 generations , we observed in initial experiments that the search stagnates earlier so that we set the number of generation to 40 (stopping criterion).
The results of our study provide an analysis of the fitness landscape of Sapienz with respect to the global topology, particularly the diversity of solutions, how the solutions are spread in the search space, and evolve over time. According to Smith et al. [27, p. 31], “No single measure or description can possibly characterize any high-dimensional heterogeneous search space”. Thus, we selected metrics from literature and implemented them in Sapienz, which characterize (1) the Pareto-optimal solutions, (2) the population, and (3) the connectedness of Pareto-optimal solutions, all with a focus on diversity. These metrics are computed after every generation so that we can analyze their development over time. In the following, we discuss these metrics and the results of the fitness landscape analysis. The results are shown in Figure 1 where the metrics (y-axis) are plotted over the 40 generations of the search (x-axis) for each of the five apps.
(1) Metrics for Pareto-Optimal Solutions
Proportion of Pareto-optimal solutions (ppos). For a population , ppos is the number of Pareto-optimal solutions divided by the population size: . A high and especially strongly increasing ppos may indicate that the search based on Pareto dominance stagnates due to missing selection pressure . A moderately increasing ppos may indicate a successful search.
For Sapienz and all apps (see Fig. 1(a)), ppos slightly fluctuates since a new solution can potentially dominate multiple previously non-dominated solutions. At the beginning of the search, ppos is low (0.0–0.1), shows no improvement in the first 15–20 generations, and then increases for all apps except of passwordmanager. Thus, the search seems to progress while the enormously increasing ppos for MunchLife and hotdeath might indicate a stagnation of the search.
Hypervolume (hv). To further investigate the search progress, we compute the hv after each generation. The hv is the volume in the objective space covered by the Pareto-optimal solutions [10, 31]. Thus, an increasing hv indicates that the search is able to find improved solutions, otherwise the hv and search stagnate.
Based on the objectives of Sapienz (max. crashes, max. coverage, and min. sequence length), we choose the nadir point ( crashes, coverage, and sequence length of ) as the reference point for the hv. In Fig. 1(b), the evolution of the hv over time rather than the absolute numbers are relevant to analyze the search progress of Sapienz. While the hv increases during the first 25 generations, it stagnates afterwards for all apps; for k9mail already after 5 generations. For aarddict, MunchLife, and hotdeath the hv stagnates after the ppos drastically increases (cf. Fig. 1(a)), further indicating a stagnation of the search.
(2) Population-Based Metrics
Population diameter (diam). The diam metrics measure the spread of all population members in the search space using a distance metric for individuals, in our case Algorithm 1. The maximum diam computes the largest distance between any two individuals of the population : [5, 20], showing the absolute spread of
. To respect outliers, we can compute the averagediam as the average of all pairwise distances between all individuals :
Additionally, we compute the minimum diameter to see how close individuals are in the search space, or even identical: .
Concerning each plot for Sapienz and all apps (see Fig. 1(c)), the upper, middle, and lower curve are respectively maxdiam, avgdiam, and mindiam
. For each curve, we see a clear trend that the metrics decrease over time, which is typical for genetic algorithms due to the crossover. However, the metrics drastically decrease forSapienz in the first 25 generations. The avgdiam decreases from to eventually for each app. The maxdiam decreases similarly but stays higher for hotdeath and k9mail than for the other apps. The development of the avgdiam and maxdiam indicates that all individuals are continuously getting closer to each other in the search space, thus becoming more similar. The population even contains identical solutions as indicated by mindiam reaching .
Relative population diameter (reldiam). Bachelet  further proposes the relative population diameter, which is the avgdiam in proportion to the largest possible distance : . This metric is indicative of the concentration of the population in the search space. A small reldiam indicates that the population members are grouped together in a region of the space .
For Sapienz, the largest possible distance between two test suites is 2500, in which case they differ in all events (up to 500 for a test sequence) for all of their five individual test sequences. For and all apps (cf. Fig. 1(d)), reldiam starts at a high level of around 0.9 indicating that the solutions are spread in the search space. Then, it decreases in the first 25 generations to around 0.4 (aarddict, MunchLife, and passwordmanager), and below 0.3 (hotdeath and k9mail) indicating a grouping of the solutions in one or more regions of the search space.
(3) Metrics Based on the Connectedness of Pareto-Optimal Solutions
The following metrics analyze the connectedness and thus, clusters of Pareto-optimal solutions in the search space [9, 22]. For this purpose, we consider a graph in which Pareto-optimal solutions are vertices. The edges connecting the vertices are labeled with weights , which are the number of moves a neighborhood operator has to make to reach one vertice from another . This results in a graph of fully connected Pareto-optimal solutions. Introducing a limit on and removing the edges whose weights are larger than leads to varying sizes of connected components (clusters) in the graph. This graph can be analyzed by metrics to characterize the Pareto-optimal solutions in the search space [22, 12].
In our case, the weights are determined by the distance metric for test suites based on the mutation operator of Sapienz (cf. Algorithm 1). We determined experimentally to be investigating values of , , , and . While a high value results in a single cluster of Pareto-optimal solutions, a low value results in a high number of singletons (i.e., clusters with one solution). Thus, two test suites (vertices) are connected (neighbors) in the graph if they differ in less than events across their test sequences as computed by Algorithm 1.
Proportion of Pareto-optimal solutions in clusters (pconnec). This metric divides the number of vertices (Pareto-optimal solutions) that are members of clusters (excl. singletons) by the total number of vertices in the graph . A high pconnec indicates a grouping of the Pareto-optimal solutions in the search space.
As shown in Fig. 1(e), pconnec is relatively low during the first generations before it increases for all apps. For MunchLife, passwordmanager, and hotdeath, pconnec reaches 1 meaning that all Pareto-optimal solutions are in clusters, while it converges around 0.7 and 0.8 for aarddict and k9mail respectively. This indicates that the Pareto-optimal solutions are grouped in the search space.
Number of clusters (nconnec). We further analyze in how many areas of the search space (clusters) the Pareto-optimal solutions are grouped. Thus, nconnec counts the number of clusters in the graph [22, 12]. A high (low) nconnec indicates that the Pareto-optimal solutions are spread in many (few) areas of the search space.
Fig. 1(f) plots nconnec for Sapienz and all apps. The y-axis of each plot denoting nconnec ranges from to . Initially, the Pareto-optimal solutions are distributed in 2–4 clusters, then grouped in cluster. An exception is k9mail for which there always exists more than clusters. Except for k9mail, this indicates that the Pareto-optimal solutions are grouped in one area of the search space.
Minimum distance for a connected graph (kconnec). This metric identifies so that all Pareto-optimal solutions are members of one cluster [22, 12]. Thus, kconnec quantifies the spread of all Pareto-optimal solutions in the search space.
For Sapienz, Fig. 1(g) plots kconnec (ranging from to ) over the generations. Similarly to the diam metrics (cf. Fig. 1(c)), kconnec decreases, moderately for hotdeath (from initially 700 to 600) and k9mail (1000 800), and drastically for passwordmanager (1200 200), MunchLife (1000 200), and aarddict (600 100). This indicates that all Pareto-optimal solutions are getting closer in the search space as the spread of the cluster is decreasing.
Number of Pareto-optimal solutions in the largest cluster (lconnec). It determines the size of the largest cluster by the number of members , showing how many Pareto-optimal solutions are in the most dense area of the search space.
Fig. 1(h) plots lconnec (ranging from to given the population size of ) over the generations. lconnec increases after - generations to (aarddict and hotdeath) or even (MunchLife) solutions. This indicates that the largest cluster is indeed large so that many Pareto-optimal solutions are grouped in one area of the search space. In contrast, lconnec stays always below indicating smaller largest clusters for passwordmanager and k9mail than for the other apps.
Proportion of hypervolume covered by the largest cluster (hvconnec). Besides lconnec, we compute the relative size of the largest cluster in terms of hypervolume (hv). Thus, hvconnec is the proportion of the overall hv covered by the Pareto-optimal solutions in the largest cluster. It quantifies how this cluster in the search space dominates in the objective space and contributes to the hv.
For Sapienz (cf. Fig. 1(i)), hvconnec varies a lot during the first generations, then stabilizes at a high level for all apps. For aarddict, MunchLife, and passwordmanager, the largest clusters covers of the hv since there is only cluster left (cf. nconnec in Fig. 1(f)). For hotdeath, hvconnec is close to indicating that there is other cluster covering of the hv (cf. nconnec). For k9mail, hvconnec is around indicating that the other – clusters (cf. nconnec) cover only of the hv. This indicates that the largest cluster covers the largest proportion of the hv, and thus contributes most to the Pareto front.
The results characterizing the fitness landscape of Sapienz reveal insights about how Sapienz manages the search problem of generating test suites for apps.
Firstly, the development of the proportion of Pareto-optimal solutions (cf. Fig. 1(a)) and hypervolume (cf. Fig. 1(b)) indicates a stagnation of the search after 25 generations. The drastically increasing proportion of Pareto-optimal solutions in some cases may indicate a problem of dominance resistance, i.e., the search cannot produce new solutions that dominate the current, poorly performing but locally non-dominated solutions . In other cases, the proportion remains low, i.e., the search cannot find many non-dominated solutions.
Secondly, the development of the population diameters (cf. Fig. 1(c)) indicate a decreasing diversity of all solutions during the search. The development of the relative population diameter (cf. Fig. 1(d)) witnesses this observation and indicates that the population members are concentrated in the search space . The minimum diameter (cf. Fig. 1(c)) even indicates that the population contains duplicates of solutions, which reduces the genetic variation in the population.
Thirdly, the development of the proportion of Pareto-optimal solutions in clusters (cf. Fig. 1(e)) indicates a grouping of these solutions in the search space, mostly in one cluster (cf. Fig. 1(f)). Another indicator for the decreasing diversity of the Pareto-optimal solutions is the decreasing minimum distance required to form one cluster of all these solutions (cf. Fig. 1(g)). Additionally, the largest cluster is often indeed large in terms of number of Pareto-optimal solutions (cf. Fig. 1(h)), and hypervolume covered by these solutions (cf. Fig. 1(i)). Even if there exist multiple clusters of Pareto-optimal solutions, the largest cluster still contributes most to the overall hypervolume and thus, to the Pareto front.
In summary, the fitness landscape analysis of Sapienz indicates a stagnation of the search while the diversity of all solutions decreases in the search space.
Given the fitness landscape analysis results, Sapienz suffers from a decreasing diversity of solutions in the search space over time. It is known that the performance of genetic algorithms is influenced by diversity [30, 21]. A low diversity may lead the search to a local optimum that cannot be escaped easily . Thus, diversity is important to address dominance resistance so that the search can produce new solutions that dominate poorly performing, locally non-dominated solutions . Moreover, Shir et al. [26, p. 95] report that promoting diversity in the search space does not hamper “the convergence to a precise and diverse Pareto front approximation in the objective space of the original algorithm”.
Therefore, we extended Sapienz to Sapienz by integrating mechanisms into the search algorithm that promote the diversity of the population in the search space.333Sapienz is available at: https://github.com/thomas-vogel/sapienzdiv-ssbse19. We developed four mechanisms that extend the Sapienz algorithm at different steps: at the initialization, before and after the variation, and at the selection. Algorithm 2 shows the extended search algorithm of Sapienz and highlights the novel mechanisms in blue. We now discuss these mechanisms.
Diverse initial population. As the initial population may effect the results of the search , we assume that a diverse initial population could be a better start for the exploration. Thus, we extend the generation of the initial population to promote diversity. Instead of generating solutions, we generate solutions where (line 9 in Algorithm 2). Then, we select those solutions from that are most distant from each other using Algorithm 1, to form the first population (line 10).
Adaptive diversity control. This mechanism dynamically controls the diversity if the population members are becoming too close in the search space relative to the initial population. It further makes the algorithm adaptive as it uses feedback of the search to adapt the search (cf. ).
To quantify the diversity of population , we use the average population diameter (avgdiam) defined in Eq. 1. At the beginning of each generation, is calculated (line 15) and compared to the diversity of the initial population (line 16) calculated once in line 12. The comparison checks whether has decreased to less than . For example, the condition is satisfied for the given threshold if has decreased to less than of .
In this case, the offspring is obtained by generating new solutions using the original Sapienz method to initialize a population (line 17). The next population is formed by selecting the most distant individuals from the current population and offspring (line 19). In the other case, the variation operators (crossover and mutation) of Sapienz are applied to obtain the offspring (line 21) followed by the selection. Thus, this mechanism promotes diversity by inserting new individuals to the population, having an effect of restarting the search.
Duplicate elimination. The fitness landscape analysis found duplicated test suites in the population. Eliminating duplicates is one technique to maintain diversity and improve search performance [30, 25]. Thus, we remove duplicates after reproduction and before selection in the current population and offspring (line 23). Duplicated test suites are identified by a distance of computed by Algorithm 1.
Hybrid selection. To promote diversity in the search space, the selection is extended by dividing it in two parts: (1) The non-dominated sorting of NSGA-II is performed as in Sapienz (lines 24–31 in Algorithm 2) to obtain the solutions sorted by domination rank and crowding distance. (2) From , the best solutions form the next population where is the size of and the configurable number of diverse solutions to be included in (line 32). These diverse solutions are selected as the most distant solutions from the current population and offspring (line 33) using the distance metric of Algorithm 1. Finally, is added to the next population (line 34).
While the NSGA-II sorting considers the diversity of solutions in the objective space (crowding distance), the selection of Sapienz also considers the diversity of solutions in the search space, which makes the selection hybrid.
We evaluate Sapienz in a head-to-head comparison with Sapienz to investigate the benefits of the diversity-promoting mechanisms. Our evaluation targets five research questions (RQ) with two empirical studies similarly to :
How does the coverage achieved by Sapienz compare to Sapienz?
How do the faults found by Sapienz compare to Sapienz?
How does Sapienz compare to Sapienz concerning the length of their fault-revealing test sequences?
How does the runtime overhead of Sapienz compare to Sapienz?
How does the performance of Sapienz compare to the performance of Sapienz
with inferential statistical testing?
We conduct two empirical studies, Study 1 to answer RQ1-4, and Study 2 to answer RQ5. The execution of both studies was distributed on eight servers444For each server: 2Intel(R) Xeon(R) CPU E5-2620 @ 2.00GHz, with 64GB RAM. while each server runs one approach to test one app at a time using Android emulators (Android KitKat version, API 19). We configured Sapienz and Sapienz as in the experiment for the fitness landscape analysis (cf. Section 3.2) and in . The only difference is that we test each app for generations in contrast to Mao et al.  who test each app for one hour, since we were not in full control of the servers running in the cloud. However, we still report the execution times of both approaches (RQ4). Moreover, we configured the novel parameters of Sapienz as follows: , , and . For Study 1 we perform one run to test each app over generations by each approach. For Study 2 we perform repetitions of such runs for each app and approach.
Study 1 In this study we use 66 of the 68 F-Droid benchmark apps555We exclude aGrep and frozenbubble as Sapienz/Sapienz cannot start these apps. provided by Choudhary et al.  and used to evaluate Sapienz . The results on each app are shown in Table 1 where S refers to Sapienz, Sd to Sapienz, Coverage to the final statement coverage achieved, #Crashes to the number of revealed unique crashes, Length to the average length of the minimal fault-revealing test sequences (or ‘–’ if no fault has been found), and Time (min) to the execution time in minutes of each approach to test the app over generations.
RQ1 Sapienz achieves a higher final coverage for 15 apps, Sapienz for 24 apps, and both achieve the same coverage for 27 apps. Fig. 4 shows that a similar coverage is achieved by both approaches on the 66 apps, in average by Sapienz and by Sapienz, providing initial evidence that Sapienz and Sapienz perform similarly with respect to coverage.
RQ2 To report about the found faults, we count the total crashes, out of which we also identify the unique crashes (i.e., their stack traces are different from the traces of the other crashes of the app). Moreover, we exclude faults caused by the Android system (e.g., native crashes) and test harness (e.g., code instrumentation).
As shown in Table 4, Sapienz revealed more total (6941 vs 5974) and unique (141 vs 119) crashes, and found faults in more apps (46 vs 43) than Sapienz. Moreover, it found 51 unique crashes undetected by Sapienz, Sapienz found 29 unique crashes undetected by Sapienz, and both found the same 90 unique crashes. The results for the 66 apps provide initial evidence that Sapienz can outperform Sapienz in revealing crashes.
RQ3 Considering the minimal fault-revealing test sequences (i.e., the shortest of all sequences causing the same crash), their mean length is 244 for Sapienz and 209 for Sapienz on the 66 apps (cf. Table 4). This provides initial evidence that Sapienz produces longer fault-revealing sequences than Sapienz.
RQ4 Considering the mean execution time of testing one app over 10 generation, Sapienz takes 118 and Sapienz 101 min. for the 66 apps. Fig. 4 shows that the diversity-promoting mechanisms of Sapienz cause a noticeable runtime overhead compared to Sapienz. This provides initial evidence about the cost of promoting diversity at which an improved fault detection can be obtained.
In this study we use the same 10 F-Droid apps as in the statistical analysis in 0.05) and the Vargha-Delaney effect size to characterize small, medium, and large differences between Sapienz and Sapienz ( 0.56, 0.64, and 0.71 respectively).
RQ5 The results are presented by boxplots in Fig. 5 for each of the 10 apps and concern: coverage, #crashes, sequence length, and time (cf. Study 1). The effect size for these concerns are shown in Table 2, which compares Sapienz and Sapienz (Sd-S) and emphasizes statistically significant results in bold. Sapienz significantly outperforms Sapienz with large effect size on all apps for execution time. The remaining results are inconclusive. Sapienz significantly outperforms Sapienz with large effect size on only 3/10 apps for coverage, 2/10 for #crashes, and almost 1/10 for length. The remaining results are not statistically significant or do not indicate large differences.
|App||Ver.||Coverage Sd-S||#Crashes Sd-S||Length Sd-S||Time Sd-S|
Study 1 provided initial evidence that Sapienz can find more faults than Sapienz while achieving a similar coverage but using longer sequences. Especially, the fault revelation capabilities of Sapienz seemed promising, however, we could not confirm them by the statistical analysis in Study 2. The results of Study 2 are inconclusive in differentiating both approaches by their performance. Potentially, the diversity promotion of Sapienz does not results in the desired effect in the first 10 generations we considered in the studies. In contrast, it might show a stronger effect at later stages since we observed in the fitness landscape analysis that the search of Sapienz stagnates after 25 generations.
Internal validity. A threat to the internal validity is a bias in the selection of the apps we took from [6, 17] although the 10 apps for Study 2 were selected by an “unbiased random sampling” [17, p. 103]. We further use the default configuration of Sapienz and Sapienz without tuning the parameters to reduce the threat of overfitting to the given apps. Finally, the correctness of the diversity-promoting mechanisms is a threat that we addressed by computing the fitness landscape analysis metrics with Sapienz to confirm the improved diversity.
External validity. As we used 5 (for analyzing the fitness landscape) and 76 Android apps (for evaluating Sapienz) out of over 2.500 apps on F-Droid and millions on Google Play, we cannot generalize our findings although we rely on the well-accepted “68 F-Droid benchmark apps” .
Related work exists in two main areas: approaches on test case generation for apps, and approaches on diversity in search-based software testing (SBST).
Test case generation for apps. Such approaches use random, model-based, or systematic exploration strategies for the generation. Random strategies implement UI-guided test input generators where events on the GUI are selected randomly . Dynodroid  extends the random selection using weights and frequencies of events. Model-based strategies such as PUMA , DroidBot , MobiGUITAR , and Stoat  apply model-based testing to apps. Systematic exploration strategies range from full-scale symbolic execution 15, 17]. All of these approaches do not explicitly manage diversity, except of Stoat  encoding diversity of sequences into the objective function.
Diversity in SBST. Diversity of solutions has been researched for test case selection and generation. For the former, promoting diversity can significantly improve the performance of state-of-the-art multi-objective genetic algorithms . For the latter, promoting diversity results in increased lengths of tests without improved coverage , matching our observation. Both approaches witness that diversity promotion is crucial and its realization “requires some care” [24, p. 782].
In this paper, we reported on our descriptive study analyzing the fitness landscape of Sapienz indicating a lack of diversity during the search. Therefore, we proposed Sapienz that integrates four mechanisms to promote diversity. The results of the first empirical study on the 68 F-Droid benchmark apps were promising for Sapienz but they could not be confirmed statistically by the inconclusive results of the second study with 10 further apps. As future work, we plan to extend the evaluation to more generations to see the effect of Sapienz when the search of Sapienz stagnates. Moreover, we plan to identify diversity-promoting mechanisms that quickly yield benefits in the first few generations.
Acknowledgments This work has been developed in the FLASH project (GR 3634/6-1) funded by the German Science Foundation (DFG) and has been partially supported by the 2018 Facebook Testing and Verification research award.
Paquete, L., Stützle, T.: Clusters of non-dominated solutions in multiobjective combinatorial optimization: An experimental analysis. In: Multiobjective Programming and Goal Programming, pp. 69–77. Springer (2009)
Smith, T., Husbands, P., Layzell, P.J., O’Shea, M.: Fitness landscapes and evolvability. Evolutionary Computation10(1), 1–34 (2002)