Evolutionary Grammar-Based Fuzzing

08/03/2020 ∙ by Martin Eberlein, et al. ∙ Humboldt-Universität zu Berlin 0

A fuzzer provides randomly generated inputs to a targeted software to expose erroneous behavior. To efficiently detect defects, generated inputs should conform to the structure of the input format and thus, grammars can be used to generate syntactically correct inputs. In this context, fuzzing can be guided by probabilities attached to competing rules in the grammar, leading to the idea of probabilistic grammar-based fuzzing. However, the optimal assignment of probabilities to individual grammar rules to effectively expose erroneous behavior for individual systems under test is an open research question. In this paper, we present EvoGFuzz, an evolutionary grammar-based fuzzing approach to optimize the probabilities to generate test inputs that may be more likely to trigger exceptional behavior. The evaluation shows the effectiveness of EvoGFuzz in detecting defects compared to probabilistic grammar-based fuzzing (baseline). Applied to ten real-world applications with common input formats (JSON, JavaScript, or CSS3), the evaluation shows that EvoGFuzz achieved a significantly larger median line coverage for all subjects by up to 48 compared to the baseline. Moreover, EvoGFuzz managed to expose 11 unique defects, from which five have not been detected by the baseline.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Software security vulnerabilities can be extremely costly [richardson_csi_2008]. Hunting down those issues has therefore been subject of intense research [Godefroid2012Sage, Song2008Bitblaze, diffuzz]. A typical example are internet browsers that combine a wide variety of interconnected components, using different interpreters and languages like JavaScript, Java, CSS, or JSON. This makes web browsers extremely prone to exploiting the growing set of embedded parsers and interpreters to launch malicious attacks. Hallaraker et al. [hallaraker_detecting_2005] have shown that in particular the JavaScript interpreter, which is used to enhance the client-side display of web pages, is responsible for high-level security issues, allowing attackers to steal users’ credentials and lure users into divulging sensitive information. Unfortunately, due to the steady increase in complexity, interpreters become increasingly hard to test and verify.

Fuzzing [Miller1990, Godefroid2020Fuzzing] has shown great success in finding vulnerabilities and erroneous behavior in a variety of different programs and software [noauthor_afl_2018, noauthor_libfuzzer_2018]. A fuzzer generates random input data and enhances or mutates them to trigger potential defects or software vulnerabilities. In general, fuzzing comes in various flavors: blackbox, whitebox, and greybox fuzzing [Godefroid2020Fuzzing]. While blackbox fuzzers have no knowledge about the internals of the application under test and apply random input generation, whitebox fuzzers can unleash the full power of program analysis techniques to use the retrieved context information to generate inputs. Greybox fuzzing strikes a balance between these two cases: it employs a light-weight instrumentation of the program to collect some feedback about the generated inputs, which is used to guide the mutation process. This approach reduces the overhead significantly and makes greybox fuzzing an extremely successful vulnerability detection technique [Bohme2017AFLGo]. Nevertheless, greybox fuzzers still struggle to create semantically and syntactically correct inputs [pham_smart_2019]. The lack of the structural input awareness is considered to be the main limitation. Since greybox fuzzers usually apply mutations on the bit level representation of an input, it is hard to keep a high level, syntactically correct structure. Yet, to detect vulnerabilities deep inside programs, complex input files are needed.

Recently, Pavese et al. [pavese_inputs_2018] presented an approach to generate test inputs using a grammar and a set of input seeds. By using the input seeds to obtain a probabilistic grammar, Pavese et al. generate similar inputs to the seeds, or by inverting probabilities of the grammar, generate dissimilar inputs. Similar input samples can be very useful, for instance, when learning from failure-inducing samples, while dissimilar inputs can be very useful for testing less common, and therefore less evaluated parts of a program. We pick up this general idea of generating inputs based on a probabilistic grammar and propose evolutionary grammar-based fuzzing (EvoGFuzz), which combines the technique with an evolutionary optimization approach to detect defects and unwanted behavior in parsers and interpreters. By using a probabilistic grammar, the fuzzer is able to generate syntactically correct inputs. Furthermore, our concept of an evolutionary process is able to generate interesting (i.e., failure-inducing inputs) and complex input files (e.g., nested loops in JavaScript or nested data structures in JSON). Utilizing the probabilistic grammar to generate new populations allows for good guiding properties. By selecting the most promising inputs of a population and by learning and evolving the probabilistic grammar accordingly, essentially favoring specific production rules from the previous population, this process allows the directed creation of inputs towards specific features. Additionally, EvoGFuzz aims to be language and grammar independent to appeal to a broader testing community.

To examine the effectiveness of our approach, we implemented EvoGFuzz as an extension of the tool by Pavese et al. [pavese_inputs_2018] and conducted experiments on several subjects for three common input languages and their parsers: JSON, JavaScript, and CSS3. We compared EvoGFuzz with the original approach and observed that within the same resource budget our technique can significantly increase the program coverage. Moreover, EvoGFuzz has been able to trigger more exception types (EvoGFuzz 11 vs. the original approach 6).

In summary, this paper makes the following contributions:

  • We propose an evolutionary grammar-based fuzzing approach (EvoGFuzz

    ) that combines the concept of probabilistic grammars and evolutionary algorithms to generate test inputs that trigger defects and unwanted behavior.

  • We implement EvoGFuzz as an extension of an probabilistic grammar-based fuzzer [pavese_inputs_2018] and the ANTLR parser generator.

  • We demonstrate the effectiveness of EvoGFuzz on ten real-world examples across multiple programming languages and input formats, and provide a comparison with the original approach [pavese_inputs_2018].

2 Related Work

EvoGFuzz focuses on the generation of test inputs to reveal defects and unwanted behavior. Existing approaches in this area can be separated in search-based, generative, learning-based, and combined techniques [Anand2013SurveyTesting, Orso2014SoftwareTesting].

Search-based input generation. Test input generation can be formulated as a search

problem to be solved by meta-heuristic search 

[Harman2012SBSE, fuzzingbook2019]. A simple way is to randomly generate inputs, as employed in the original work on fuzzing by Miller et al. [Miller1990]. More sophisticated random testing strategies are directed by feedback [Pacheco2007Randoop]. Evolutionary search applies fitness functions to select promising inputs, while the inputs are generated by mutating an initial population. Recent advances in fuzzing show the strength of such search-based techniques [noauthor_afl_2018, Bohme2016AFLFast, Lemieux2018FairFuzz]. One of the most popular greybox fuzzers is AFL [noauthor_afl_2018]

that applies a genetic algorithm guided by coverage information. While these techniques can successfully generate error-revealing inputs, they miss required information about a program to generate syntactically and semantically correct inputs 

[pham_smart_2019, Wang2019Superion].

Generative input generation. Hanford [Hanford1970] introduced grammar-based test generation with his syntax machine. Recent advances in grammar-based fuzzing pick up this idea and use a grammar to generate inputs that are syntactically correct [Godefroid2008GrammarWhiteBoxFuzzing, Holler2012FuzzingCodeFragments]. The main focus of grammar-based fuzzers are parsers and compilers [Yang2011CSmith, Holler2012FuzzingCodeFragments]. Having grammar production rules augmented with probabilities (aka probabilistic grammars) allows to generate inputs based on rule prioritization. Pavese et al. [pavese_inputs_2018] employ this notion: they take an input grammar, augment it with probabilities and generate structured inputs that represent common or very uncommon inputs. In general, generative approaches require the input grammar or language specification, which might not always be a available or accurate enough. Therefore, Höschele and Zeller [Hoschele2017AUTOGRAM2] proposed input grammar mining.

Learning-based input generation.

In addition to grammar mining, machine learning is increasingly applied for software testing 

[Godefroid2017LearnFuzz, Cummins2018, Liu2017]. Those techniques learn input structures from seed inputs and use them to generate new testing sequences. They target web browsers [Godefroid2017LearnFuzz], compilers [Cummins2018], and mobile apps [Liu2017].

Combined techniques. Recently, a lot of research efforts focus on the combination of grammar-based and coverage-based fuzzing (CGF) with the goal to use the grammar to generate valid inputs but to use CGF to further explore the input space. Le et al. [Le2019Saffron] propose a fuzzer that generates inputs for the worst-case analysis. While they leverage a grammar to generate seed inputs for a CGF, they continuously complement/refine the grammar. Atlidakis et al. [Atlidakis2020] propose Pythia to test REST APIs. They learn a statistical model that includes the structure of specific API requests as well as their dependencies across each other. While Pythia’s mutation strategies use this statistical model to generate valid inputs, coverage-guided feedback is used to select inputs that cover new behavior. Other fuzzing works aim to incorporate grammar knowledge within their mutation strategies [pham_smart_2019, Wang2019Superion]. Similarly to Pythia, we use seed inputs to generate an initial probabilistic grammar. However, with every iteration we retrieve new probabilities for the grammar while also mutating these probabilities, which enables evolution of the grammar and a broad exploration of the input space.

3 Evolutionary Grammar-Based Fuzzing (EvoGFuzz)

In this section, we will present EvoGFuzz, a language-independent evolutionary grammar-based fuzzing approach to detect defects and unwanted behavior.

Figure 1: Overview of EvoGFuzz.

The key idea of EvoGFuzz

is to combine probabilistic grammar-based fuzzing and evolutionary computing. This combination aims for directing the probabilistic generation of test inputs toward “interesting” and “complex” inputs. The motivation is that “interesting” and “complex” inputs more likely reveal defects in a software under test (SUT). For this purpose, we extend an existing probabilistic grammar-based fuzzer 

[pavese_inputs_2018] with an evolutionary algorithm (Figure 1).

Similarly to the original fuzzer, EvoGFuzz requires a correctly specified grammar for the target language, that is, the input language of the SUT. From this grammar, we create a so-called counting grammar (Activity 1 in Figure 1) that describes the same language but allows us to measure how frequently individual choices of production rules are used when parsing input files of the given language. Thus, the counting grammar allows us to learn a probabilistic grammar from a sampled initial corpus of inputs (Activity 2). Particularly, we learn the probabilities for selecting choices of production rules in the grammar, which correspond to the relative numbers of using these choices when parsing the initial corpus. Consequently, we use the probabilistic grammar to generate more input files that resemble features of the initial corpus, that is, “more of the same” [pavese_inputs_2018] is produced (Activity 3). Whereas this activity is the last step of the approach by Pavese et al. [pavese_inputs_2018], it is the starting point of the evolutionary process in EvoGFuzz as it generates an initial population of test inputs. An individual of the population therefore corresponds to a single input file for the SUT.

The evolutionary algorithm of EvoGFuzz starts a new iteration with analyzing each individual using a fitness function that combines feedback and structural scores (Activity 4). By executing the SUT with an individual as input, the feedback score determines whether the individual triggers an exception. If so, this input is considered as an “interesting” input. The structural score quantifies the “complexity” of the individual. If the stopping criterion is fulfilled (e.g., a time budget has been completely used), the exception-triggering input files are returned. Otherwise, the “interesting” and most “complex” individuals are selected for evolution (Activity 5). The selected individuals are used to learn a new probabilistic grammar

, particularly the probability distribution for the production rules similarly to Activity 2 that, however, used a sampled initial corpus of inputs (Activity 6). Thus, the new probabilistic grammar supports generating “more of the same” interesting and complex inputs. To mitigate a genetic drift toward specific features of the selected individuals, we mutate the new probabilistic grammar by altering the probabilities for randomly chosen production rules (Activity 7). Finally, using the mutated probabilistic grammar, we again generate new input files (Activity 8) starting the next evolutionary iteration.

Assuming that inputs similar to “interesting” and “complex” inputs more likely reveal defects in the SUT, EvoGFuzz guides the generation of inputs toward “interesting” and “complex” inputs by iteratively generating, evaluating, and selecting such inputs, and learning (updating) and mutating the probabilistic grammar. In contrast to typical evolutionary algorithms, EvoGFuzz does not directly evolve the individuals (test input files) by crossover or mutation but rather the probabilistic grammar whose probabilities are iteratively learned and mutated. In the following, we will discuss each activity of EvoGFuzz in detail.

3.1 Probabilistic Grammar-Based Fuzzing (Activities 1–3)

Pavese et al. [pavese_inputs_2018] have proposed probabilistic grammar-based fuzzing to generate test inputs that are similar to a set of given inputs. Using a (context-free) grammar results in syntactically correct test inputs being generated. However, the production rules of the grammar are typically chosen randomly to generate inputs, which does not support influencing the features of the generated inputs. To mitigate this situation, Pavese et al. use a probabilistic grammar, in which probabilities are assigned to choices of production rules. The distribution of these probabilities are learned from a sample of inputs. Consequently, test inputs produced by a learned probabilistic grammar are therefore similar to the sampled inputs. Pavese et al. call this idea “more of the same” [pavese_inputs_2018] because the produced inputs share the same features as the sampled inputs.

Technically, a given context-free grammar for the input language of the SUT is transformed to a so-called counting grammar by adding a variable to each choice of all production rules (Activity 1). Such a variable counts how often its associated choice of a production rule is executed when parsing a given sample of inputs. Knowing how often a production rule is executed in total during parsing, the probability distribution of all choices of the rule is determined. Thus, using the counting grammar to parse a sample of input files, the variables of the grammar are filled with values according to the executed production rules and their choices (Activity 2). This results in a probabilistic grammar, in which a probability is assigned to each choice of a production rule. Using this probabilistic grammar, we can generate new input files that resemble features of the sampled input files since both sets of input files have the same probability distribution for the production rules (Activity 3). Thus, EvoGFuzz uses the approach by Pavese et al. to initially learn a probabilistic grammar from input files (Activities 1 and 2), and to generate new input files for the (initial) population (Activity 3), which starts an evolutionary iteration.

3.2 Evolutionary Algorithm (Activities 4–8)

The evolutionary algorithm of EvoGFuzz evolves a population of test input files by iteratively (i) analyzing the fitness of each individual, (ii) selecting the fittest individuals, (iii) learning a new probabilistic grammar based on the selected individuals, (iv) mutating the learned grammar, and (v) generating new individuals by the mutated grammar that form the population for the next iteration.

Analyze individuals (Activity 4). Our goal is to evolve individuals toward “complex” and “interesting” test inputs as such inputs more likely detect defects and unwanted behavior. To achieve this goal, we use a fitness function that quantifies both aspects, the complexity and whether an input is of interest.

Concerning complexity, we focus on the structure of test input files assuming complex structures (e.g., nested loops in JavaScript) have a higher tendency to reveal uncommon behavior in the SUT (e.g., a JavaScript parser) than simple ones. However, we can make only few assumptions about the complexity of input files since EvoGFuzz is language independent and thus, it has no semantic knowledge about the language of test inputs besides the grammar. Thus, we can only rely on generic features of test input files and grammars to quantify the complexity of an input file. A straightforward and efficient metric to use would be the length of an input file in terms of the number of characters contained by the file. However, a fitness function maximizing file length would favor production rules that produce terminals being longer strings (e.g., “true” or “false” in JavaScript) over rules that produce more expansions through non-terminals to obtain complex structures (e.g., “if” branches or “for” loops). To mitigate this effect, we further use the number of used expansions to derive an input file because using more expansions to generate an input file makes the input file more complex. Thus, we build the ratio of the number of expansions to the length of an input file to favor input files that were produced by more expansions and to punish lengthy input files that contain long strings of characters. Depending on the language of the input files, this ratio can be controlled by the parameter .

(1)

Using this ratio, we score the structure of an input file by multiplying the ratio and the number of expansions to put more weight on the expansions while a good ratio (1) increases the score.

(2)

Benefits of this score are its efficient computation and independence of the input language, although controlling allows accommodation of a specific language.

Concerning the “interesting” inputs, we rely on the feedback from executing the SUT with a concrete input . Being interested in revealing defects in the SUT, we observe whether triggers any exception during execution. If so, such an input will be assigned the maximum fitness and favored over all other non-exception triggering inputs. This results in a feedback score for an input file :

(3)

Moreover, EvoGFuzz keeps track of all exception-triggering inputs throughout all iterations as it returns these inputs at the end of the evolutionary search.

Finally, we follow the idea by Veggalam et al. [veggalam_ifuzzer_2016] and combine the structural and feedback scores to a single-objective fitness function to be maximized:

(4)

Using this fitness function, all input files generated by the previous activity (Activity 3) are analyzed by executing them and computing their fitness.

Select individuals (Activity 5). Based on the fitness of the input files, a strategy is needed to select the most promising files among them that will be used for further evolution. To balance selection pressure, EvoGFuzz uses elitism [du_elitism_2018] and tournament selection [miller_genetic_1995]. By elitism, the top of the input files ranked by fitness are selected. Additionally, the winners of tournaments of size are selected. The participants of each tournament are randomly chosen from the remaining of the input files. In contrast to typical evolutionary algorithms, the selected individuals are not directly evolved by crossover or mutation, but they are used to learn a new probabilistic grammar.

Learn new probabilistic grammar (Activity 6). The selected input files are the most promising files of the population and they help in directing the further search toward “complex” and “interesting” inputs. Thus, these files are used to learn a new probabilistic grammar, particularly the probability distributions for all choices of production rules, by parsing them (cf. Activity 2 that learns a probabilistic grammar, however, from a given sample of input files). Thus, the learned probability distributions reflect features of the selected input files, and the corresponding probabilistic grammar can produce more input files that resemble these features. But beforehand, EvoGFuzz mutates the learned grammar.

Mutate grammar (Activity 7). We mutate the learned probabilistic grammar to avoid a genetic drift [wright_evolution_1929] toward specific features of the selected individuals. With such a drift, the grammar would generate only input files with specific features exhibited by the selected individuals from which the grammar has been learned. Thus, it would neglect other potentially promising, yet unexplored features. Moreover, mutating the grammar maintains the diversity of input files being generated, which further could prevent the search from being stuck in local optima. In contrast to typical evolutionary algorithms, we do not mutate the individuals directly for two reasons. First, mutating an input file may result in a syntactically invalid file (i.e., the file does not conform to the given grammar). Second, a stochastic nature of the search is already achieved by using a probabilistic grammar to generate input files.

Therefore, we mutate the learned probabilistic grammar by altering the probabilities of individual production rules. The resulting mutated grammar produces syntactically valid input files whose features are similar to the selected individuals but that may also exhibit other unexplored features. For instance, a mutation could enable choices of production rules in the grammar that have not been used yet to generate input files because of being tagged so far with a probability of 0 that is now mutated to a value larger than 0. This increases the genetic variation.

For a single mutation of a probabilistic grammar, we choose a random production rule with choices for expansions from the grammar. For each choice, we recalculate the probabilities by selecting a random probability  from —we exclude to enable all choices by assigning a probability larger than zero—and normalizing with the sum of all of the probabilities to ensure (i.e., the individual probabilities of all choices of a production rule sum up to 1). Thus, a probability for one choice is calculated as follows:

(5)

Finally, EvoGFuzz allows multiple of such mutations ( many) of a probabilistic grammar in one iteration of search by performing each mutation independently from the other ones.

Generate input files (Activity 3). Using the learned and mutated grammar, EvoGFuzz generates new input files that resemble features of the recently selected input files but still diverge due to the grammar mutation. With the newly generated input files, the next iteration of the evolutionary process begins.

4 Evaluation

In this section, we evaluate the effectiveness of EvoGFuzz by performing experimentation on ten real-world applications.111Data and code artifacts are available here: https://doi.org/10.5281/zenodo.3961374 We compare EvoGFuzz to a baseline being the original approach by Pavese et al. [pavese_inputs_2018] (i.e., probabilistic grammar-based fuzzing), and ask the following research questions:

[start=1,label=RQ0] Can evolutionary grammar-based fuzzing achieve a higher code coverage than the baseline? Can evolutionary grammar-based fuzzing trigger more exception types than the baseline?

4.1 Evaluation Setup

To answer the above research questions, we conducted an empirical study, in which we analyze the achieved line coverage and the triggered exception types. Line or code coverage [miller_systematic_1963] is a metric counting the unique lines of code of the targeted parser (i.e., the SUT) that have been executed during a test.

In order to examine the effectiveness of EvoGFuzz we evaluate our approach on the same test subjects that Pavese et al. have originally covered with their proposed probabilistic grammar-based fuzzing approach. These test subjects require three, in complexity varying input formats, namely JSON, JavaScript, and CSS3. ARGO, Genson, Gson, JSONJava, JsonToJava, MinimalJson, Pojo, and json-simple serve as the JSON parsers, whereas Rhino and cssValidator serve as the JavaScript and CSS parser, respectively. All parsers are widely used in browsers and web applications. A further description of all subjects along with their grammars can be found in the work of Pavese et al [pavese_inputs_2018]. All experiments have been performed on a virtual machine with Ubuntu 20.04 LTS featuring a Quad-Core 3GHz Intel(R) CPU with 16 GB RAM.

4.2 Research Protocol

Giving both approaches the same starting conditions, we considered the same randomly selected input files from Pavese et al. to create the initial probabilistic grammar. The baseline uses this probabilistic grammar to generate “more of the same” inputs, whereas EvoGFuzz uses this grammar to generate the initial population followed by executing its evolutionary algorithm. In our evaluation, we observe the performance of both approaches for all subjects over a time frame of 10 minutes, that is, each approach runs for 10 minutes to test one subject.

For EvoGFuzz, a population consists of 100 individuals () and one mutation of the grammar () is performed in each iteration of the search. The elitism rate is set to , and for each generation ten tournaments of size ten were held ( and ). In the fitness function, is set to for JSON and for JavaScript and CSS. Since the goal is to find exceptions, we configured the baseline to perform iterations of generating and executing 100 “more of the same” input files for 10 minutes. After 10 minutes the baseline and EvoGFuzz return all found exceptions and the exception-triggering test inputs. For each test subject and approach, we repeated these experiments 30 times.

4.3 Experimental Results

Figures 2 to 2 show the coverage results for the ten subjects. For each subject, we plot a chart showing the comparison of EvoGFuzz and the baseline with regard to the achieved line coverage. The vertical axis represents the achieved line coverage in percent, and the horizontal axis represents the time in seconds (up to 600 seconds = 10 minutes). The median runs for both approaches are highlighted, with all individual runs being displayed in the background.

RQ1 - Line coverage. To answer RQ1, we compare the line coverage achieved by both approaches. In particular, we investigate whether EvoGFuzz achieves at least the same percentage of line coverage than the baseline. Figures 2 to 2 show the results for the JSON parsers, and Figures 2 and 2 show the results for the JavaScript and CSS3 parser, respectively.

Figure 2: Median and Raw Line Coverage Results for the Ten Subjects.

The results show that EvoGFuzz improves the coverage for all subjects and is able to increase the median line coverage for JSON by up to 18.43% (json-simple, Fig. 2), for JavaScript by up to 47.93% (Rhino, Fig. 2), and for CSS3 by up to 8.45% (cssValidator, Fig. 2). These numbers are also listed in the column “Median increase” of Table 1.

The detailed investigation of Figures 2 to 2 shows that for almost all JSON parsers both approaches eventually reach a plateau with regard to the achieved line coverage. The baseline reaches this plateau relatively early in the input generation process: there is no further improvement after only approximately 10 seconds. For EvoGFuzz, the point in time when reaching the plateau varies from parser to parser: between 10 seconds (Pojo, Fig. 2) and 450 seconds (json-simple, Fig. 2). In contrary, for Rhino (Fig. 2) both approaches cannot achieve a plateau within 10 minutes as they are able to continuously increase the line coverage over the time.

Table 1 shows the accumulated coverage results for each subject and approach over all 30 repetitions. For both approaches, Table 1

shows the maximum and median line coverage, the standard deviation as well as the number of generated input files, along with the increase of the median line coverage of

EvoGFuzz compared to the baseline. The improvement of the median line coverage ranges from 1.00% (Pojo) to 47.93% (Rhino). Additionally, the standard deviation (SD) values for the baseline in Table 1 indicate the existence of plateaus because all repetitions for each JSON parser show a very low (and often 0%) SD value.

Subject LOC EvoGFuzz Baseline Median p-value
max median SD #files max median SD #files increase
ARGO 8,265 49.78% 48.48% 0.60% 11,900 43.19% 43.19% 0% 13,900 12.25% 0.000062
Genson 18,780 19.65% 19.09% 0.19% 8,100 16.17% 16.17% 0% 9,000 18.06% 0.000063
Gson 25,172 26.92% 26.67% 0.15% 9,800 24.08% 24.08% 0% 11,200 10.76% 0.000064
JSONJava 3,742 21.09% 18.47% 0.59% 12,700 16.72% 16.72% 0% 15,000 10.47% 0.000064
JsonToJava 5,131 18.58% 17.90% 0.39% 11,400 17.58% 17.45% 0.09% 13,400 2.58% 0.020699
minimalJSON 6,350 51.06% 50.83% 0.26% 14,000 46.38% 46.38% 0% 16,600 9.59% 0.000055
Pojo 18,492 32.33% 32.17% 0.07% 10.600 31,88% 31.88% 0.02% 12,100 1.00% 0.000061
json-simple 2,432 34.44% 33.80% 0.33% 14,200 28.54% 28.54% 0% 16,700 18.43% 0.000059
Rhino 100,234 15.42% 13.95% 0.43% 3,200 10.20% 9.43% 0.28% 3,800 47.93% 0.000183
cssValidator 120,838 7.53% 7.06% 0.21% 1,000 6.62% 6.51% 0.06% 2,500 8.45% 0.000183
Table 1: Coverage results for each subject and approach over 30 repetitions.

To support the graphical evaluation, we do a statistical analysis to increase the confidence in our conclusions. As we consider independent samples and cannot make any assumption about the distribution of the results, we perform a non-parametric Mann-Whitney U test [mann_test_1947, arcuri_hitchhikers_2014] to check whether the achieved median line coverage of both approaches differ significantly for each subject. This statistical analysis confirms that EvoGFuzz produces a significantly higher line coverage than the baseline for all subjects (cf. last column of Table 1).

The #files columns in Table 1 denote the average number of input files generated by one approach when testing one subject for 10 minutes. For all subjects, the baseline is able to generate on average more files than EvoGFuzz. These differences indicate the costs of the evolutionary algorithm in EvoGFuzz being eventually irrelevant due to the improved line coverage achieved by EvoGFuzz.

Since both approaches managed to continuously increase the line coverage for the Rhino parser (Figure 2), we conducted an additional experiment with the time frame set to one hour and again repeated the experiment 30 times. The results can be seen in Figure 4. The chart shows that both approaches managed to further improve their (median) line coverage. EvoGFuzz was able to improve its previously achieved median line coverage of 13.95% to 16.10% with 18,500 generated input files, while the baseline improved from 9.43% to 10.23% with 22,700 generated input files, separating botch approaches even further.

Figure 3: Line coverage of Rhino.
Figure 4: Unique exceptions triggered by EvoGFuzz (11) and the baseline (6).

Based on our evaluation, we conclude that EvoGFuzz is able to achieve a significantly higher line coverage than the baseline.

Input Subject Exception EvoGFuzz Baseline
language types
JSON ARGO argo.saj.InvalidSyntaxException 30 / 30 0 / 30
Genson java.lang.NullPointerException 30 / 30 30 / 30
jsonToJava org.json.JSONException 30 / 30 30 / 30
jsonToJava java.lang.IllegalArgumentException 30 / 30 30 / 30
jsonToJava java.lang.ArrayIndexOutOfBoundsException 30 / 30 30 / 30
jsonToJava java.lang.NumberFormatException 6 / 30 0 / 30
Pojo java.lang.StringIndexOutOfBoundsException 30 / 30 30 / 30
Pojo java.lang.IllegalArgumentException 30 / 30 30 / 30
Pojo java.lang.NumberFormatException 22 / 30 0 / 30
JavaScript Rhino java.lang.IllegalStateException 26 / 30 0 / 30
Rhino java.util.concurrent.TimeoutException 15 / 30 0 / 30
CSS3 No exceptions triggered
Total exception types 11 6
Table 2: Exception types that have been triggered by both approaches.

RQ2 - Exception Types. To answer RQ2, we compare the number of times a unique exception type has been triggered. Table 2 shows the thrown exception types per subject and input language. If neither approach was able to trigger an exception, the subject is not included in the table. For the Gson, JsonJava, simple-json, minimal-json, and cssValidator parsers no defects and exceptions have been found by both approaches. The ratios in the 4th and 5th column relate to the number of experiment repetitions in which EvoGFuzz and the baseline were able to trigger the corresponding exception type.

Table 2 and Figure 4 show that during each experiment repetition, EvoGFuzz has been able to detect the same exception types than the baseline. Furthermore, EvoGFuzz was able to find five additional exception types that have not been triggered by the baseline. However, apart from the exception type argo.saj.InvalidSyntaxException, found in the ARGO parser, the other four exception types have not been identified by EvoGFuzz in all repetitions.

Overall, 11 different exception types in five subjects have been found in our evaluation, incl. just two custom types (org.json.JSONException and argo.saj.InvalidSyntaxException). Out of these 11 exception types, five have not been triggered by the baseline. Figure 4 shows that all six exception types triggered by the baseline were also found by EvoGFuzz.

4.4 Threats to Validity

Internal Validity. The main threats to internal validity of fuzzing evaluations are caused by the random nature of fuzzing [arcuri_hitchhikers_2014, Klees2018EvaluateFuzzing]

. It requires a meticulous statistical assessment to make sure that observed behaviors are not randomly occurring. Therefore, we repeated all experiments 30 times and reported the descriptive statistics of our results. To match the evaluation of Pavese et al. 

[pavese_inputs_2018], we used the same set of subjects and seed inputs. Furthermore, we automated the data collection and statistical evaluation. Finally, we did not tune the parameters of the baseline and EvoGFuzz to reduce the threat of overfitting to the given grammars and subjects. Only for the fitness function of EvoGFuzz, we determined appropriate values for the three input grammars by experiments.

External Validity. The main threat to external validity is the generalizability of the experimental results that are based on a limited number of input grammars and systems under test. However, similar to Pavese et al. [pavese_inputs_2018], practically relevant input grammars with different complexities (small-sized grammars like JSON, and rather complex grammars like JavaScript and CSS) and widely used subjects (e.g., ARGO and Rhino) have been selected. As a result, we are confident that our approach will also work on other grammars and subjects.

5 Conclusion and Future Work

This paper presented EvoGFuzz, evolutionary grammar-based fuzzing that combines the technique by Pavese et al. [pavese_inputs_2018] with evolutionary optimization to direct the generation of complex and interesting inputs by a fitness function. EvoGFuzz is able to generate structurally complex input files that trigger exceptions. The introduced mutation of grammars maintains genetic diversity and allows EvoGFuzz to discover features that have previously not been explored. Our experimental evaluation shows improved coverage compared to the original approach [pavese_inputs_2018]. Additionally, EvoGFuzz is able to trigger more exception types undetected by the original approach. As future work, we want to investigate cases of having no precise grammar of the input space (cf. [Le2019Saffron]) and using semantic knowledge of the input language to tune mutation operators. Finally, we want to compare EvoGFuzz with other state-of-the-art fuzzing techniques.

References