Fuzzing [Miller1990, Godefroid2020Fuzzing] has shown great success in finding vulnerabilities and erroneous behavior in a variety of different programs and software [noauthor_afl_2018, noauthor_libfuzzer_2018]. A fuzzer generates random input data and enhances or mutates them to trigger potential defects or software vulnerabilities. In general, fuzzing comes in various flavors: blackbox, whitebox, and greybox fuzzing [Godefroid2020Fuzzing]. While blackbox fuzzers have no knowledge about the internals of the application under test and apply random input generation, whitebox fuzzers can unleash the full power of program analysis techniques to use the retrieved context information to generate inputs. Greybox fuzzing strikes a balance between these two cases: it employs a light-weight instrumentation of the program to collect some feedback about the generated inputs, which is used to guide the mutation process. This approach reduces the overhead significantly and makes greybox fuzzing an extremely successful vulnerability detection technique [Bohme2017AFLGo]. Nevertheless, greybox fuzzers still struggle to create semantically and syntactically correct inputs [pham_smart_2019]. The lack of the structural input awareness is considered to be the main limitation. Since greybox fuzzers usually apply mutations on the bit level representation of an input, it is hard to keep a high level, syntactically correct structure. Yet, to detect vulnerabilities deep inside programs, complex input files are needed.
In summary, this paper makes the following contributions:
We propose an evolutionary grammar-based fuzzing approach (EvoGFuzz
) that combines the concept of probabilistic grammars and evolutionary algorithms to generate test inputs that trigger defects and unwanted behavior.
We implement EvoGFuzz as an extension of an probabilistic grammar-based fuzzer [pavese_inputs_2018] and the ANTLR parser generator.
We demonstrate the effectiveness of EvoGFuzz on ten real-world examples across multiple programming languages and input formats, and provide a comparison with the original approach [pavese_inputs_2018].
2 Related Work
EvoGFuzz focuses on the generation of test inputs to reveal defects and unwanted behavior. Existing approaches in this area can be separated in search-based, generative, learning-based, and combined techniques [Anand2013SurveyTesting, Orso2014SoftwareTesting].
Search-based input generation. Test input generation can be formulated as a search
problem to be solved by meta-heuristic search[Harman2012SBSE, fuzzingbook2019]. A simple way is to randomly generate inputs, as employed in the original work on fuzzing by Miller et al. [Miller1990]. More sophisticated random testing strategies are directed by feedback [Pacheco2007Randoop]. Evolutionary search applies fitness functions to select promising inputs, while the inputs are generated by mutating an initial population. Recent advances in fuzzing show the strength of such search-based techniques [noauthor_afl_2018, Bohme2016AFLFast, Lemieux2018FairFuzz]. One of the most popular greybox fuzzers is AFL [noauthor_afl_2018]
that applies a genetic algorithm guided by coverage information. While these techniques can successfully generate error-revealing inputs, they miss required information about a program to generate syntactically and semantically correct inputs[pham_smart_2019, Wang2019Superion].
Generative input generation. Hanford [Hanford1970] introduced grammar-based test generation with his syntax machine. Recent advances in grammar-based fuzzing pick up this idea and use a grammar to generate inputs that are syntactically correct [Godefroid2008GrammarWhiteBoxFuzzing, Holler2012FuzzingCodeFragments]. The main focus of grammar-based fuzzers are parsers and compilers [Yang2011CSmith, Holler2012FuzzingCodeFragments]. Having grammar production rules augmented with probabilities (aka probabilistic grammars) allows to generate inputs based on rule prioritization. Pavese et al. [pavese_inputs_2018] employ this notion: they take an input grammar, augment it with probabilities and generate structured inputs that represent common or very uncommon inputs. In general, generative approaches require the input grammar or language specification, which might not always be a available or accurate enough. Therefore, Höschele and Zeller [Hoschele2017AUTOGRAM2] proposed input grammar mining.
Learning-based input generation.
In addition to grammar mining, machine learning is increasingly applied for software testing[Godefroid2017LearnFuzz, Cummins2018, Liu2017]. Those techniques learn input structures from seed inputs and use them to generate new testing sequences. They target web browsers [Godefroid2017LearnFuzz], compilers [Cummins2018], and mobile apps [Liu2017].
Combined techniques. Recently, a lot of research efforts focus on the combination of grammar-based and coverage-based fuzzing (CGF) with the goal to use the grammar to generate valid inputs but to use CGF to further explore the input space. Le et al. [Le2019Saffron] propose a fuzzer that generates inputs for the worst-case analysis. While they leverage a grammar to generate seed inputs for a CGF, they continuously complement/refine the grammar. Atlidakis et al. [Atlidakis2020] propose Pythia to test REST APIs. They learn a statistical model that includes the structure of specific API requests as well as their dependencies across each other. While Pythia’s mutation strategies use this statistical model to generate valid inputs, coverage-guided feedback is used to select inputs that cover new behavior. Other fuzzing works aim to incorporate grammar knowledge within their mutation strategies [pham_smart_2019, Wang2019Superion]. Similarly to Pythia, we use seed inputs to generate an initial probabilistic grammar. However, with every iteration we retrieve new probabilities for the grammar while also mutating these probabilities, which enables evolution of the grammar and a broad exploration of the input space.
3 Evolutionary Grammar-Based Fuzzing (EvoGFuzz)
In this section, we will present EvoGFuzz, a language-independent evolutionary grammar-based fuzzing approach to detect defects and unwanted behavior.
The key idea of EvoGFuzz
is to combine probabilistic grammar-based fuzzing and evolutionary computing. This combination aims for directing the probabilistic generation of test inputs toward “interesting” and “complex” inputs. The motivation is that “interesting” and “complex” inputs more likely reveal defects in a software under test (SUT). For this purpose, we extend an existing probabilistic grammar-based fuzzer[pavese_inputs_2018] with an evolutionary algorithm (Figure 1).
Similarly to the original fuzzer, EvoGFuzz requires a correctly specified grammar for the target language, that is, the input language of the SUT. From this grammar, we create a so-called counting grammar (Activity 1 in Figure 1) that describes the same language but allows us to measure how frequently individual choices of production rules are used when parsing input files of the given language. Thus, the counting grammar allows us to learn a probabilistic grammar from a sampled initial corpus of inputs (Activity 2). Particularly, we learn the probabilities for selecting choices of production rules in the grammar, which correspond to the relative numbers of using these choices when parsing the initial corpus. Consequently, we use the probabilistic grammar to generate more input files that resemble features of the initial corpus, that is, “more of the same” [pavese_inputs_2018] is produced (Activity 3). Whereas this activity is the last step of the approach by Pavese et al. [pavese_inputs_2018], it is the starting point of the evolutionary process in EvoGFuzz as it generates an initial population of test inputs. An individual of the population therefore corresponds to a single input file for the SUT.
The evolutionary algorithm of EvoGFuzz starts a new iteration with analyzing each individual using a fitness function that combines feedback and structural scores (Activity 4). By executing the SUT with an individual as input, the feedback score determines whether the individual triggers an exception. If so, this input is considered as an “interesting” input. The structural score quantifies the “complexity” of the individual. If the stopping criterion is fulfilled (e.g., a time budget has been completely used), the exception-triggering input files are returned. Otherwise, the “interesting” and most “complex” individuals are selected for evolution (Activity 5). The selected individuals are used to learn a new probabilistic grammar
, particularly the probability distribution for the production rules similarly to Activity 2 that, however, used a sampled initial corpus of inputs (Activity 6). Thus, the new probabilistic grammar supports generating “more of the same” interesting and complex inputs. To mitigate a genetic drift toward specific features of the selected individuals, we mutate the new probabilistic grammar by altering the probabilities for randomly chosen production rules (Activity 7). Finally, using the mutated probabilistic grammar, we again generate new input files (Activity 8) starting the next evolutionary iteration.
Assuming that inputs similar to “interesting” and “complex” inputs more likely reveal defects in the SUT, EvoGFuzz guides the generation of inputs toward “interesting” and “complex” inputs by iteratively generating, evaluating, and selecting such inputs, and learning (updating) and mutating the probabilistic grammar. In contrast to typical evolutionary algorithms, EvoGFuzz does not directly evolve the individuals (test input files) by crossover or mutation but rather the probabilistic grammar whose probabilities are iteratively learned and mutated. In the following, we will discuss each activity of EvoGFuzz in detail.
3.1 Probabilistic Grammar-Based Fuzzing (Activities 1–3)
Pavese et al. [pavese_inputs_2018] have proposed probabilistic grammar-based fuzzing to generate test inputs that are similar to a set of given inputs. Using a (context-free) grammar results in syntactically correct test inputs being generated. However, the production rules of the grammar are typically chosen randomly to generate inputs, which does not support influencing the features of the generated inputs. To mitigate this situation, Pavese et al. use a probabilistic grammar, in which probabilities are assigned to choices of production rules. The distribution of these probabilities are learned from a sample of inputs. Consequently, test inputs produced by a learned probabilistic grammar are therefore similar to the sampled inputs. Pavese et al. call this idea “more of the same” [pavese_inputs_2018] because the produced inputs share the same features as the sampled inputs.
Technically, a given context-free grammar for the input language of the SUT is transformed to a so-called counting grammar by adding a variable to each choice of all production rules (Activity 1). Such a variable counts how often its associated choice of a production rule is executed when parsing a given sample of inputs. Knowing how often a production rule is executed in total during parsing, the probability distribution of all choices of the rule is determined. Thus, using the counting grammar to parse a sample of input files, the variables of the grammar are filled with values according to the executed production rules and their choices (Activity 2). This results in a probabilistic grammar, in which a probability is assigned to each choice of a production rule. Using this probabilistic grammar, we can generate new input files that resemble features of the sampled input files since both sets of input files have the same probability distribution for the production rules (Activity 3). Thus, EvoGFuzz uses the approach by Pavese et al. to initially learn a probabilistic grammar from input files (Activities 1 and 2), and to generate new input files for the (initial) population (Activity 3), which starts an evolutionary iteration.
3.2 Evolutionary Algorithm (Activities 4–8)
The evolutionary algorithm of EvoGFuzz evolves a population of test input files by iteratively (i) analyzing the fitness of each individual, (ii) selecting the fittest individuals, (iii) learning a new probabilistic grammar based on the selected individuals, (iv) mutating the learned grammar, and (v) generating new individuals by the mutated grammar that form the population for the next iteration.
Analyze individuals (Activity 4). Our goal is to evolve individuals toward “complex” and “interesting” test inputs as such inputs more likely detect defects and unwanted behavior. To achieve this goal, we use a fitness function that quantifies both aspects, the complexity and whether an input is of interest.
Using this ratio, we score the structure of an input file by multiplying the ratio and the number of expansions to put more weight on the expansions while a good ratio (1) increases the score.
Benefits of this score are its efficient computation and independence of the input language, although controlling allows accommodation of a specific language.
Concerning the “interesting” inputs, we rely on the feedback from executing the SUT with a concrete input . Being interested in revealing defects in the SUT, we observe whether triggers any exception during execution. If so, such an input will be assigned the maximum fitness and favored over all other non-exception triggering inputs. This results in a feedback score for an input file :
Moreover, EvoGFuzz keeps track of all exception-triggering inputs throughout all iterations as it returns these inputs at the end of the evolutionary search.
Finally, we follow the idea by Veggalam et al. [veggalam_ifuzzer_2016] and combine the structural and feedback scores to a single-objective fitness function to be maximized:
Using this fitness function, all input files generated by the previous activity (Activity 3) are analyzed by executing them and computing their fitness.
Select individuals (Activity 5). Based on the fitness of the input files, a strategy is needed to select the most promising files among them that will be used for further evolution. To balance selection pressure, EvoGFuzz uses elitism [du_elitism_2018] and tournament selection [miller_genetic_1995]. By elitism, the top of the input files ranked by fitness are selected. Additionally, the winners of tournaments of size are selected. The participants of each tournament are randomly chosen from the remaining of the input files. In contrast to typical evolutionary algorithms, the selected individuals are not directly evolved by crossover or mutation, but they are used to learn a new probabilistic grammar.
Learn new probabilistic grammar (Activity 6). The selected input files are the most promising files of the population and they help in directing the further search toward “complex” and “interesting” inputs. Thus, these files are used to learn a new probabilistic grammar, particularly the probability distributions for all choices of production rules, by parsing them (cf. Activity 2 that learns a probabilistic grammar, however, from a given sample of input files). Thus, the learned probability distributions reflect features of the selected input files, and the corresponding probabilistic grammar can produce more input files that resemble these features. But beforehand, EvoGFuzz mutates the learned grammar.
Mutate grammar (Activity 7). We mutate the learned probabilistic grammar to avoid a genetic drift [wright_evolution_1929] toward specific features of the selected individuals. With such a drift, the grammar would generate only input files with specific features exhibited by the selected individuals from which the grammar has been learned. Thus, it would neglect other potentially promising, yet unexplored features. Moreover, mutating the grammar maintains the diversity of input files being generated, which further could prevent the search from being stuck in local optima. In contrast to typical evolutionary algorithms, we do not mutate the individuals directly for two reasons. First, mutating an input file may result in a syntactically invalid file (i.e., the file does not conform to the given grammar). Second, a stochastic nature of the search is already achieved by using a probabilistic grammar to generate input files.
Therefore, we mutate the learned probabilistic grammar by altering the probabilities of individual production rules. The resulting mutated grammar produces syntactically valid input files whose features are similar to the selected individuals but that may also exhibit other unexplored features. For instance, a mutation could enable choices of production rules in the grammar that have not been used yet to generate input files because of being tagged so far with a probability of 0 that is now mutated to a value larger than 0. This increases the genetic variation.
For a single mutation of a probabilistic grammar, we choose a random production rule with choices for expansions from the grammar. For each choice, we recalculate the probabilities by selecting a random probability from —we exclude to enable all choices by assigning a probability larger than zero—and normalizing with the sum of all of the probabilities to ensure (i.e., the individual probabilities of all choices of a production rule sum up to 1). Thus, a probability for one choice is calculated as follows:
Finally, EvoGFuzz allows multiple of such mutations ( many) of a probabilistic grammar in one iteration of search by performing each mutation independently from the other ones.
Generate input files (Activity 3). Using the learned and mutated grammar, EvoGFuzz generates new input files that resemble features of the recently selected input files but still diverge due to the grammar mutation. With the newly generated input files, the next iteration of the evolutionary process begins.
In this section, we evaluate the effectiveness of EvoGFuzz by performing experimentation on ten real-world applications.111Data and code artifacts are available here: https://doi.org/10.5281/zenodo.3961374 We compare EvoGFuzz to a baseline being the original approach by Pavese et al. [pavese_inputs_2018] (i.e., probabilistic grammar-based fuzzing), and ask the following research questions:[start=1,label=RQ0] Can evolutionary grammar-based fuzzing achieve a higher code coverage than the baseline? Can evolutionary grammar-based fuzzing trigger more exception types than the baseline?
4.1 Evaluation Setup
To answer the above research questions, we conducted an empirical study, in which we analyze the achieved line coverage and the triggered exception types. Line or code coverage [miller_systematic_1963] is a metric counting the unique lines of code of the targeted parser (i.e., the SUT) that have been executed during a test.
4.2 Research Protocol
Giving both approaches the same starting conditions, we considered the same randomly selected input files from Pavese et al. to create the initial probabilistic grammar. The baseline uses this probabilistic grammar to generate “more of the same” inputs, whereas EvoGFuzz uses this grammar to generate the initial population followed by executing its evolutionary algorithm. In our evaluation, we observe the performance of both approaches for all subjects over a time frame of 10 minutes, that is, each approach runs for 10 minutes to test one subject.
4.3 Experimental Results
Figures 2 to 2 show the coverage results for the ten subjects. For each subject, we plot a chart showing the comparison of EvoGFuzz and the baseline with regard to the achieved line coverage. The vertical axis represents the achieved line coverage in percent, and the horizontal axis represents the time in seconds (up to 600 seconds = 10 minutes). The median runs for both approaches are highlighted, with all individual runs being displayed in the background.
The detailed investigation of Figures 2 to 2 shows that for almost all JSON parsers both approaches eventually reach a plateau with regard to the achieved line coverage. The baseline reaches this plateau relatively early in the input generation process: there is no further improvement after only approximately 10 seconds. For EvoGFuzz, the point in time when reaching the plateau varies from parser to parser: between 10 seconds (Pojo, Fig. 2) and 450 seconds (json-simple, Fig. 2). In contrary, for Rhino (Fig. 2) both approaches cannot achieve a plateau within 10 minutes as they are able to continuously increase the line coverage over the time.
shows the maximum and median line coverage, the standard deviation as well as the number of generated input files, along with the increase of the median line coverage ofEvoGFuzz compared to the baseline. The improvement of the median line coverage ranges from 1.00% (Pojo) to 47.93% (Rhino). Additionally, the standard deviation (SD) values for the baseline in Table 1 indicate the existence of plateaus because all repetitions for each JSON parser show a very low (and often 0%) SD value.
To support the graphical evaluation, we do a statistical analysis to increase the confidence in our conclusions. As we consider independent samples and cannot make any assumption about the distribution of the results, we perform a non-parametric Mann-Whitney U test [mann_test_1947, arcuri_hitchhikers_2014] to check whether the achieved median line coverage of both approaches differ significantly for each subject. This statistical analysis confirms that EvoGFuzz produces a significantly higher line coverage than the baseline for all subjects (cf. last column of Table 1).
The #files columns in Table 1 denote the average number of input files generated by one approach when testing one subject for 10 minutes. For all subjects, the baseline is able to generate on average more files than EvoGFuzz. These differences indicate the costs of the evolutionary algorithm in EvoGFuzz being eventually irrelevant due to the improved line coverage achieved by EvoGFuzz.
Since both approaches managed to continuously increase the line coverage for the Rhino parser (Figure 2), we conducted an additional experiment with the time frame set to one hour and again repeated the experiment 30 times. The results can be seen in Figure 4. The chart shows that both approaches managed to further improve their (median) line coverage. EvoGFuzz was able to improve its previously achieved median line coverage of 13.95% to 16.10% with 18,500 generated input files, while the baseline improved from 9.43% to 10.23% with 22,700 generated input files, separating botch approaches even further.
Based on our evaluation, we conclude that EvoGFuzz is able to achieve a significantly higher line coverage than the baseline.
|JSON||ARGO||argo.saj.InvalidSyntaxException||30 / 30||0 / 30|
|Genson||java.lang.NullPointerException||30 / 30||30 / 30|
|jsonToJava||org.json.JSONException||30 / 30||30 / 30|
|jsonToJava||java.lang.IllegalArgumentException||30 / 30||30 / 30|
|jsonToJava||java.lang.ArrayIndexOutOfBoundsException||30 / 30||30 / 30|
|jsonToJava||java.lang.NumberFormatException||6 / 30||0 / 30|
|Pojo||java.lang.StringIndexOutOfBoundsException||30 / 30||30 / 30|
|Pojo||java.lang.IllegalArgumentException||30 / 30||30 / 30|
|Pojo||java.lang.NumberFormatException||22 / 30||0 / 30|
|Rhino||java.util.concurrent.TimeoutException||15 / 30||0 / 30|
|CSS3||No exceptions triggered|
|Total exception types||11||6|
RQ2 - Exception Types. To answer RQ2, we compare the number of times a unique exception type has been triggered. Table 2 shows the thrown exception types per subject and input language. If neither approach was able to trigger an exception, the subject is not included in the table. For the Gson, JsonJava, simple-json, minimal-json, and cssValidator parsers no defects and exceptions have been found by both approaches. The ratios in the 4th and 5th column relate to the number of experiment repetitions in which EvoGFuzz and the baseline were able to trigger the corresponding exception type.
Table 2 and Figure 4 show that during each experiment repetition, EvoGFuzz has been able to detect the same exception types than the baseline. Furthermore, EvoGFuzz was able to find five additional exception types that have not been triggered by the baseline. However, apart from the exception type argo.saj.InvalidSyntaxException, found in the ARGO parser, the other four exception types have not been identified by EvoGFuzz in all repetitions.
Overall, 11 different exception types in five subjects have been found in our evaluation, incl. just two custom types (org.json.JSONException and argo.saj.InvalidSyntaxException). Out of these 11 exception types, five have not been triggered by the baseline. Figure 4 shows that all six exception types triggered by the baseline were also found by EvoGFuzz.
4.4 Threats to Validity
Internal Validity. The main threats to internal validity of fuzzing evaluations are caused by the random nature of fuzzing [arcuri_hitchhikers_2014, Klees2018EvaluateFuzzing]
. It requires a meticulous statistical assessment to make sure that observed behaviors are not randomly occurring. Therefore, we repeated all experiments 30 times and reported the descriptive statistics of our results. To match the evaluation of Pavese et al.[pavese_inputs_2018], we used the same set of subjects and seed inputs. Furthermore, we automated the data collection and statistical evaluation. Finally, we did not tune the parameters of the baseline and EvoGFuzz to reduce the threat of overfitting to the given grammars and subjects. Only for the fitness function of EvoGFuzz, we determined appropriate values for the three input grammars by experiments.
5 Conclusion and Future Work
This paper presented EvoGFuzz, evolutionary grammar-based fuzzing that combines the technique by Pavese et al. [pavese_inputs_2018] with evolutionary optimization to direct the generation of complex and interesting inputs by a fitness function. EvoGFuzz is able to generate structurally complex input files that trigger exceptions. The introduced mutation of grammars maintains genetic diversity and allows EvoGFuzz to discover features that have previously not been explored. Our experimental evaluation shows improved coverage compared to the original approach [pavese_inputs_2018]. Additionally, EvoGFuzz is able to trigger more exception types undetected by the original approach. As future work, we want to investigate cases of having no precise grammar of the input space (cf. [Le2019Saffron]) and using semantic knowledge of the input language to tune mutation operators. Finally, we want to compare EvoGFuzz with other state-of-the-art fuzzing techniques.