Active Learning of Input Grammars

08/29/2017 ∙ by Matthias Höschele, et al. ∙ Universität Saarland 0

Knowing the precise format of a program's input is a necessary prerequisite for systematic testing. Given a program and a small set of sample inputs, we (1) track the data flow of inputs to aggregate input fragments that share the same data flow through program execution into lexical and syntactic entities; (2) assign these entities names that are based on the associated variable and function identifiers; and (3) systematically generalize production rules by means of membership queries. As a result, we need only a minimal set of sample inputs to obtain human-readable context-free grammars that reflect valid input structure. In our evaluation on inputs like URLs, spreadsheets, or configuration files, our AUTOGRAM prototype obtains input grammars that are both accurate and very readable - and that can be directly fed into test generators for comprehensive automated testing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Systematic testing of any program requires knowledge what makes a valid input for the program—formally, the language accepted by the program. To this end, computer science has introduced formal languages including regular expressions, and context-free grammars, which are long part of the standard computer science education canon. The problem of automatically inferring a language from a set of samples is well-known in computer linguistics as well as for compression algorithms. Inferring the input language for a given program, however, only recently has attracted the attention of researchers. The AUTOGRAM tool [11] observes the dynamic data flow of an input through the program to derive a matching input grammar. The GLADE tool [1] uses membership queries to refine a grammar from a set of inputs. The Learn&Fuzz approach by Godefroid et al. [8]

uses machine learning to infer structural properties, also from a set of inputs. All these approaches are motivated by using the inferred grammars for automatic

test generation: Given an inferred grammar, one can easily derive a producer that “fuzzes” the program with millions of grammar-conforming inputs.

One weakness of all these approaches is that they rely on a set of available input samples; and the variety and quality of these input samples determines the features of the resulting language model. If all input samples only contain positive integers, for instance, the resulting language model will never encompass negative integers, or floating-point numbers. This is a general problem of all approaches for language induction and invariant inference.

In this paper, we present an approach that combines an existing grammar learning technique, AUTOGRAM, with active membership queries to systematically generalize learned grammars. Our AUTOGRAM prototype starts with a program and a minimum of input samples, barely covering the essential features of the input language. Then, AUTOGRAM will systematically produce possible generalizations to check whether they would be accepted as well; and if so, extend the input grammar accordingly. The result is a grammar that is both general and accurate, and which can be directly fed into test generators for comprehensive automated testing.

Let us illustrate these abstract concepts with an example. Given the input in Figure 1, AUTOGRAM will first derive the initial concrete grammar in Figure 2. To do so, AUTOGRAM follows the approach pioneered by AUTOGRAM (Figure 3), namely following the flow of input fragments into individual functions and variables of the program, and adopting their identifiers to derive the names of nonterminals. The ’http’ characters, for instance, are stored in a variable named protocol; this results in the PROTOCOL grammar rule.

http://user:pass@www.google.com:80/command
?foo=bar&lorem=ipsum#fragment
Fig. 1: Sample URL input
SPEC ::= STRING ’?’ QUERY ’#’ REF
STRING ::= PROTOCOL ’://’ AUTHORITY PATH
AUTHORITY ::= USERINFO ’@’ HOST ’:’ PORT
PROTOCOL ::= ’http’
USERINFO ::= ’user:pass’
HOST ::= ’www.google.com’
PORT ::= ’80’
PATH ::= ’/command’
QUERY ::= ’foo=bar&lorem=ipsum’
REF ::= ’fragment’
Fig. 2: Initial concrete grammar derived by AUTOGRAM from java.net.URL processing the input from Figure 1. Using membership queries, AUTOGRAM generalizes this into the grammar in Figure 4.
Fig. 3: How AUTOGRAM works. Given a program and a sample input (1), AUTOGRAM tracks the flow of input characters through the program to derive a concrete grammar (2). In each grammar rule, AUTOGRAM then determines whether parts are optional, can be repeated, or can be generalized to predefined items by querying the program whether inputs produced from the generalization step are still valid (3). The resulting grammar (4) generalizes the original grammar and can quickly produce millions of tests for the program under test—and all other programs with the same input language.

For each rule in the grammar, AUTOGRAM now attempts to generalize it. To this end, AUTOGRAM applies three rules:

Detecting optional fragments.

For each fragment of a rule, AUTOGRAM determines whether it is optional—that is, whether removing it still results in a valid input. This is decided by a membership query with the program whose grammar is to be learned.

In our example (Figure 2), AUTOGRAM can determine that the fragments PATH, USERINFO ’@’, ’:’ PORT, ’?’ QUERY, and ’#’ REF are all optional, because http://www.google.com is still a valid URL accepted by the java.net.URL parser. The PROTOCOL and HOST parts are not optional, though, because neither www.google.com (no protocol) nor http:// (no host) are valid URLs.

Detecting repetitions.

If AUTOGRAM detects that some item is repeated (i.e. it occurs at least twice in a rule), it also attempts mailto generalize the repetition by synthesizing alternate numbers of occurrences, and again extending the grammar accordingly if the input is valid. If an item is found to be repeatable zero to five times, AUTOGRAM assumes an infinite number of repetitions. In Figure 3, we show how AUTOGRAM generalizes possible repetitions given a JSON sample.

Generalizing grammar items.

AUTOGRAM maintains a user-configurable list of commonly occurring grammar items, encompassing identifiers, numbers, and strings. For each string in a rule that matches a predefined grammar item, AUTOGRAM determines the most general item that still produces valid inputs, but where the next most general item is invalid.

Using this rule, AUTOGRAM can determine that PORT generalizes to any natural number (as a sequence of digits), as all of these would be accepted; however, generalizations such as negative numbers or floating point numbers would be rejected. HOST generalizes to letters, numbers, and dots, but no other characters. PATH generalizes to an arbitrary sequence of printable characters, as do QUERY and REF. The PROTOCOL fragment, however, cannot be generalized to anything beyond ’http’.

As a result, AUTOGRAM produces the generalized grammar shown in Figure 4. This grammar identifies all optional parts (shown in square brackets), and generalizes all strings to the most general matching items (shown as underlined). The resulting grammar effectively describes valid ’http’ URLs; adding one input sample each for additional protocols (’ftp’, ’mailto’, ’file’, …) easily extends the set towards these protocols, too.

Being able to infer an accurate input grammar from a minimum of samples can be a great help in understanding programs and their input formats. First and foremost, though, our approach opens the door for widespread, varied, and fully automatic robustness testing (“fuzzing”) of arbitrary programs that process serial input. In our case, given the inferred URL grammar as a producer, we can easily generate millions of valid and varied URL inputs, which can then be fed into any (now uninstrumented) URL parser; other input and file formats would automatically be derived and tested in a similar fashion. In one sentence, given only a bare minimum of samples, AUTOGRAM produces a grammar that easily allows the creation of millions and millions of valid inputs.

SPEC ::= STRING [[’?’ [QUERY]] [’#’ [REF]]]
STRING ::= PROTOCOL ’://’ AUTHORITY [PATH]
AUTHORITY ::= [USERINFO ’@’] HOST
     [’:’ PORT]
PROTOCOL ::= ’http’
USERINFO ::= ’user:pass’
HOST ::= HOSTNAME
PORT ::= DIGITS
PATH ::= ABSOLUTEPATH
QUERY ::= ALPHANUMWITHSPECIALS
REF ::= ALPHANUMWITHSPECIALS
Fig. 4: AUTOGRAM grammar generalizing over Figure 2. Optional parts are enclosed in […]. Predefined nonterminals (Figure 8) are underlined.

In the remainder of this paper, we detail the following contributions:

  • After discussing the state of the art in inferring input grammars (Section II), Section III contributes a formalization of how AUTOGRAM determines grammars from inputs, extending the informal description in the new idea paper of Höschele and Zeller [11].

  • In Section IV, we describe the generalization steps specific to AUTOGRAM, showing how AUTOGRAM can derive general grammars from a minimum of sample inputs. No other technique can infer general input models from single samples alone.

  • Section V evaluates AUTOGRAM in terms of completeness and soundness of the inferred grammars; we show that the grammars produced are both complete and sound.

The paper closes with conclusion and future work in Section VI. The AUTOGRAM prototype and all experimental data is available for research purposes upon request.

0 http://user:pass@www.google.com:80/command?foo=bar… initial input from Figure 1
1 http://user:pass@www.google.com:10/command?foo=bar… generalize PORT to POSITIVEINTEGER
2 http://user:pass@www.google.com:01/command?foo=bar… generalize PORT to DIGITS
4 http://user:pass@www.google.com:Az1/command?foo=bar… can’t generalize PORT to ALPHANUMS
5 http://user:pass@sub.domain-0.top:80/command?foo=bar… generalize HOST to HOSTNAME
6 http://user:pass@Az1;:-+=!?*()/#$%&@:80/command?foo=bar… can’t generalize to ALPHANUMWITHSPECIALS
7 http://…:80/command?…bar#Az1;:-+=!?*()/#$%& generalize REF to ALPHANUMWITHSPECIALS
8 http://…:80/command?…bar#Az1;:-+=!?*()/#$%& ntr can’t generalize REF to PRINTABLES
9 http://Az1;:-+=!?*()/#$%&@@www.google.com:80/command?… can’t generalize USERINFO to PRINTABLES
10 Az://user:pass@www.google.com:80/command?foo=bar… can’t generalize PROTOCOL to ALPHAS
11 http://…:80/command?Az1;:-+=!?*()/#$%&@#fragment generalize QUERY to ALPHANUMWITHSPECIALS
12 http://…:80/command?Az1;:-+=!?*()/#$%&@ ntr#fragment can’t generalize QUERY to PRINTABLES
13 http://…:80/some/path-to/file0123.ext?foo=bar… generalize PATH to ABSOLUTEPATH
14 http://…:80some/path-to/file0123.ext?foo=bar… can’t generalize PATH to PATH
Fig. 5: Refining the grammar through membership queries. For each rule, AUTOGRAM synthesizes input fragments (underlined) that may further generalize the rule. By querying the program for whether a synthesized input is valid () or not (), AUTOGRAM can systematically generalize the concrete grammar from Figure 2 to the abstract grammar in Figure 4.

Ii Background

AUTOGRAM contributes to three fields: language induction, test generation, and specification mining.

Language Induction.

AUTOGRAM addresses the problem of language induction

, that is, finding an abstraction that best describes a set of inputs. Traditionally, language induction was motivated from natural language processing, learning from (typically tagged) sets of concrete inputs; the recent book by Heinz et al. 

[9] very well represents the state of the art in this field.

Only recently have researchers turned towards learning the languages of program inputs. The first approach to infer context-free grammars from programs is AUTOGRAM by Höschele and Zeller [11]; given a program and a set of sample inputs, AUTOGRAM dynamically tracks the data flow of inputs through the program and aggregates inputs that share the same path into syntactical entities, resulting in well-structured, very readable grammars. AUTOGRAM follows the AUTOGRAM approach to infer grammars.

The GLADE tool by Bastani et al. [1] starts from a set of samples and than uses membership queries—that is, querying the program under test whether synthesized inputs are part of the language or not—to derive a context-free grammar that is useful for later fuzzing. The Learn&Fuzz approach by Godefroid et al. [8] uses machine learning to infer context-free structures from program inputs for better fuzzing, and is shown to be applicable to formats as complex as PDF. Compared to AUTOGRAM (and AUTOGRAM), neither GLADE nor Learn&Fuzz need or make use of program structure. This results in a simpler application, but also possibly less structured and less readable grammars; for the purpose of fuzzing, however, these deficits need not matter.

Whether they focus on natural language or program input, though, all of these approaches rely on the presence of a large set of sample inputs to induce the language from; consequently, features and quality of the resulting grammars depend on the variability of the input samples. AUTOGRAM is unique in requiring only a minimal set of samples; instead, it leverages active learning to systematically generalize the well-structured grammar induced by a single sample input already.

Test Generation.

Techniques for test generation also have seen a rise in popularity in the last decade. For small programs, and at the unit level, a wide range of techniques focuses on establishing a wide variance between generated runs, reaching branch coverage through symbolic constraint solving [2, 12] or search-based approaches [5]. For the system level of larger programs, the length of paths typically prohibits pure constraint-solving and search-based approaches. To scale, test generators thus either need a model of how the input is structured [6, 10], or again sample inputs to mutate. Given these, tools can again use search-based [13] or symbolic approaches [7] to achieve coverage.

Specification Mining.

The grammars inferred by AUTOGRAM can also be interpreted as specifications, notably as system-level preconditions. By learning from executions, AUTOGRAM is similar in spirit to the DAIKON approach by Ernst et al. [4], which infers pre- and postconditions at the function level. In contrast to DAIKON, though, AUTOGRAM needs only a minimum of sample inputs, as it generalizes these automatically. In the absence of function-level preconditions, such active learning is only possible at the system level, as the program under test decides whether an input is valid or not. Generally speaking, tools like AUTOGRAM can dramatically improve dynamic specification mining (including grammar mining), as they provide a large variety of inputs (and thus executions) for a large class of programs.

Iii Grammar Inference

We start with a formal description of how AUTOGRAM infers grammars from sample inputs. In short, AUTOGRAM tracks the path of each input character as well as values derived thereof throughout a program execution, annotating variables and functions with the character intervals that flow into them (Section III-A). These intervals are then arranged to form hierarchical interval trees (Section III-B) which reflect the subsumption hierarchy in the grammar. After resolving possible overlaps (Section III-C), AUTOGRAM clusters those elements that are processed in the same way by a program (Section III-D) to finally derive grammars (Section III-E).

Iii-a Tainting

The first step in AUTOGRAM is to use dynamic tainting to track the dataflow of input characters.

Iii-A1 Machine Model

Definition 1.

A program is a mapping of fully-qualified method names to sequences of program statements.

Definition 2.

A program state is a tuple , where is a list of tuples of fully-qualified method names and a program counter . In further discussion, we will write , also we will use to refer to the top-most tuple in , for the second tuple from the top and so on. is a function which maps program variables to values.

The JVM is a stack machine, which means e.g. the instruction iadd is defined to pop two integers of the stack111This is the value stack, not the function stack , add them and push the result to the stack. For our presentation, we will assume that the JVM is a register machine, so our version of iadd, written as %p = iadd %a %b adds the values in registers %a and %b and stores the result in %p. This does not hurt correctness of our model, because the JVM byte code can be translated to code for a register machine (e.g. [3]), but it does make the formalization much more readable. Also the formalization can be applied to other instruction sets (e.g. LLVM-IR) without significant changes.

Thereby the set of program variables contains heap locations, class variables and object variables, as well as local variables and registers. Registers correspond to elements on the stack in executions on the JVM.

If a program is executed in a program state , the input state, this leads to a state , the output state. We will denote program execution as . A statement consists of an Opcode, as defined in the Java Byte Code Specification and parameters, which are part of the program itself.

is defined in terms of a helper function , which executes a single program statement. Again, means that execution of the program statement in program state leads to a program state . With this helper, is defined as

This definition of is not total, as the sequence of applications of may never end, i.e. the program may not terminate.

In further discussion, we will use to refer to the sequence of program statements that is used in a program execution. We write for statements and iff occurs before in an execution. Consequently, statements from different executions can not be compared.

Also, we will use for the set of variables such that in .

We are not going to report the definition of for all opcodes in the Java Byte Code, as most of them are straight forward. Table I provides some examples.

For branching instructions, we change the value of , such that the next application of executes the instruction after the true- or false-side of the branch respectively. Method invocations push a new to the list , such that the lookup in the next step returns the newly invoked method. Method return statements pop the topmost value from the list .

Iii-A2 Taint Propagation

During execution, the program reads from files. We refer to an input byte sequence, the content of an input file, as . If there is more than one input file, each file gets an id , we refer to the input from this file as . In our implementation, we use the file names of the input files rather than the id.

Definition 3.

For each , the tuple is a taint tag . A function which maps variables to taint tags is a taint tag mapping.

Taint tag members compare by the byte index. That is, iff (they are from the same input source) and (the smaller tag is to the left of the larger one).

In the following taint tags usually appear as sets . The distinction between individual tags and sets of tags is merely a technical one, so we will refer to those sets as taint tags as well.

A taint tag is consecutive with respect to iff for all , all such that it is .

For our grammar mining, there is just one relevant input file. We will then assume that all taint tags refer to this input file and skip mentioning the id of this file.

:
:
:
:
:
TABLE I: Definitions of the most common JVM instructions.

For tainting, program states are extended with a taint tag mapping, and the program execution function is extended to also update the taint tag mapping. Table I gives the semantics for the most common JVM instructions. In all cases, is defined to reflect the updates to .

For the sake of readability, we will use for the taint tags of all variables in a set . For a sequence of statements that were executed in a program run, We will use , and .

Iii-B Interval Trees

Fig. 6: Interval tree for the URL example in Figure 1

After obtaining the taints for all variables during an execution, the next step in AUTOGRAM is to create a structural representation of the provided sample inputs. To this end, we create tree representations that are approximations of the parse trees which we will later use to infer a grammar. An example of such an interval tree is shown in Figure 6, representing the decomposition of the URL in Figure 1 by java.net.URL. We now show how to obtain such an interval tree from the program execution.

Let be the sequence of statements that was executed in a program run. Let be the subsequence of such that is the first statement executed in a method  and is the return of the same invocation of .

Due to the semantics of Java, for any two method invocations and , it is either , or . That is because if a method  calls another method , then  has to return before  can return.

We now associate each method with the characters it processes. First, we extend the definition of taint tag mappings from variables to statements and methods:

Definition 4.

Let be the set of variables during the execution of a statement , then .

Definition 5.

Let be the set of statements executed in method , then .

Next, we assign each method a consecutive interval  between the first and the last character processed in this method:

Definition 6.

Let be the part of the input that has been processed within the method invocation . Then, the method input interval of  is

Lemma 1.

In Definition 6 is consecutive .

Proof.

If is consecutive, then holds by construction. If , then is consecutive because is consecutive by construction. ∎

Definition 7.

Let a block be a parameter or return of a method, a load or store of a field or array, or a sequence of method invocations such that , is the caller of and there is no method invocation with such that is the caller of or the callee of . We extend the definition of to blocks such that . We only consider blocks such that is consecutive.

Definition 8.

Two blocks and are similar if they are both the same parameter, return, a load or store of a field or array in and off the same method. If both blocks are sequences of method calls and they are similar if for all , and are calls of the same method.

As method calls in a program run form a tree, those intervals can be arranged in a so-called interval tree.

Definition 9.

For a program execution , the interval tree of consists of nodes all intervals of all blocks . If method  was called by , the interval is a child of in the interval tree or . We associate each node with the set of blocks that contains all blocks such that .

In the interval tree in Figure 6, the topmost node has a block containing the constructor URL()). This constructor called the method String.substring() with "http://user:pass@www.google.com:80/command" and also called URL.isValidProtocol() with "http".

Definition 10.

For an interval tree , let be all intervals that were derived from calls to a method .

Lemma 2.

If an interval is a child of (that is, was called by ), then holds.

Proof.

Let . Thus, there is a program statement such that and . was called by , thereby , but then and thereby . ∎

Input characters can be processed at multiple program locations, so the converse does not hold.

For building the interval trees that are the input of our grammar learning heuristics we only consider method invocations and other program events like field accesses such that

is consecutive. Let be the set of all occurring consecutive intervals. We can build a tree by creating nodes for each with children such that and .

Iii-C Resolving Overlap

In recursive descent parsers, we observe quite often that intervals which are siblings in the interval tree overlap. This is caused by lookahead. As recursive descent parsers usually read input from left to right, this is the most common type of overlap. This leads to a definition of overlap-free interval trees.

Definition 11.

An interval tree is overlap-free iff for all , either or .

In an overlap-free interval tree, an interval overlaps only with its children, and children are always contained in their parents entirely.

There is a simple algorithm to derive a overlap-free interval tree from an interval tree. We utilize that recursive descent parsers read their input from left to right, so any overlap occurs between children of the same parent, and the overlap corresponds to lookahead. Thereby, we can resolve overlap in an interval tree as follows:

First of all, we derive an order on intervals. An interval is smaller than an interval if or and .

For any node with children and such that , , and , we derive a replacement interval . We recursively remove all children with if , or replace them with nodes with if .

In cases where there exists blocks in such that they occur after all blocks in in the pre-order of the call tree, we instead derive a replacement node with . We recursively remove all children with if , or replace them with nodes with if . The intuition behind this is to deal with parser implementations that remember the last character and therefore have show patterns that could be interpreted as a sort of lookback similar to lookahead. This allows us to identify input fragments that are later used in the program and avoid splitting them up during the overlap resolving stage.

As another observation, if is valid, needs to read the entire input. That is because in an language, a word with a valid prefix can always be invalidated by an invalid suffix. So can only accept after it read all of the input. Thereby, for all valid inputs the root of the interval tree is the entire input.

Iii-C1 Fixing Alignment of Last Leaf

Due to the way interval trees are constructed the last input fragment is likely to be missaligned especially if it is a single character. For inner fragments the resolving of overlap uses the call tree to remove ambiguities and determine where a node should belong. We also use the dynamic call graph to check if we might propagate the leaf corresponding to the last input fragment to be a child to a node closer to the root. For an ancestor with and , we check if there is a block of method calls in such that a block in contains the caller of . In this case we can recursively remove from all children of and propagate it to be a direct child of .

Iii-D Clustering Interval Nodes

After the construction of a set of overlap-free interval trees we are now at the stage to identify syntactic elements. The intuition for this stage is that syntactic elements of the same type or more precisely derivations of the same non-terminal symbols are processed in the same way by a program. This means the corresponding characters will be processed by the same functions and will be stored in the same fields and variables. We will therefore apply a simple clustering to the set of all nodes of all trees , that groups together nodes with similar labels.

Definition 12.

A cluster is defined as a pair of a set of interval nodes and a representative node . All clusters form a partition of such that . Let be the cluster such that .

We implemented this as a greedy algorithm that starts with an initially empty set of clusters. Our heuristic sequentially processes all nodes such that it tries to find a cluster such that is similar to . If a cluster could be found, is added to such that . Otherwise we add a new cluster to . The relative similarity of two nodes and is computed by checking for how many of the first blocks according to pre-order in the dynamic call tree in for which we can find similar blocks in . We call this number and compute the same with the roles of and reversed as . Let be the number of blocks in a node .

Definition 13.

The relative similarity is computed as

(1)

In our experiments we applied this heuristic with which showed to be a reasonable number for our subjects. For nodes and a relative similarity value is considered to be sufficient in our experiments in order to add them to the same cluster and therefore being called similar. In future work it might be necessary to define alternative heuristics for similarity, especially if the technique might be applied to parsing code that frequently uses backtracking for which a fixed limited number of blocks might not be sufficient to determine the similarity of nodes since large amounts of blocks might correspond to invocations that have been discarded.

Iii-E Deriving Grammars

After clustering the nodes of all interval trees as described in Section III-D the clusters can be used by our heuristics to derive a context-free grammar with non-terminal symbols and the corresponding production rules.

Iii-E1 Complex and Single Character Clusters

The first observation after the clustering stage is that input fragments consisting of single characters usually end up in clusters with other single characters. They are also usually not similar to nodes that are supposed to belong to the same grammatical categories. The reason for this is that due to lookahead a parser will treat these single character fragments differently than the complex ones. In addition to parsing them as a numeric character, parsers will frequently access them to determine what parsing function should be called next or for common tasks like trying to skip whitespace. This results in additional blocks in the corresponding nodes that have no similar blocks in nodes corresponding to longer fragments.

Therefore we identify a set of complex clusters and a set of single-character clusters that only contain nodes with such that . The next steps will only use clusters in to identify non-terminal symbols and derive productions and the knowledge from clusters in is integrated into the derived grammar in a post processing step.

Iii-E2 Identifying Single Non-Terminal Substitution

Input languages frequently represent entities from the input domain of a program in a way that closely corresponds to the way these entities are represented in the code. Such entities are usually modeled as structs or classes by programmers and the relationship between those entities will modeled as type relationships. These types therefore are closely related to non-terminal symbols in a formal grammar for the input language. When we look at JSON as an example, we can see such a correspondence for the symbol VALUE:

VALUE ::= OBJECTARRAYSTRINGTRUEFALSENULLNUMBER

The symbol VALUE can be substituted for 7 other non-terminal symbols which correspond to subclasses of an abstract class JsonValue. Since our JSON library implements recursive descent, these values are read by the method readValue() that depending on the next character decides which specialized parsing method like for example readArray() it needs to call. This also means that the first block in an interval tree node for such a input fragment will be a sequence of method calls such that is an invocation of readValue() and is an invocation of one of the specialized parsing methods. According to the heuristics described in Section III-D all nodes corresponding to input fragments representing arrays are put in the same cluster but they are not similar to nodes for objects or numbers. The fact that arrays and the other entities are all values is not explicitly visible in our structural decomposition at this point in the inference process.

We address this by searching for common prefixes in the first blocks of nodes. Let and be clusters, and are the first blocks in and . If and have a common prefix of size we can split all nodes in and . For a node in or let be the first block in with . We split that consist of sequences of calls to the same methods and their postfixes , . Using these blocks we derive new nodes and such that and with being the only child of . We replace in the interval tree and cluster with and transfer all children of to . We also create a new cluster that will contain all nodes .

Using this transformation we can make the relationship between input fragments explicitly visible in form of intermediate node in the interval trees and a common cluster.

Iii-E3 Create Non-Terminal Symbols and Productions

At this stage we derive the non-terminal symbols from the complex clusters. For each cluster we define a corresponding non-terminal symbol . We can derive a simple set of productions by identify all observed substitution sequences from the children of each with . Let be the children of ordered by the position of the corresponding input fragments, then is a possible substitution for . If is a leaf the character sequence corresponding to the interval is a possible substitution for . Child nodes that correspond to single characters are instead represented by a corresponding terminal symbol instead of . Applying these definitions provides us with a preliminary set of production rules that precisely capture the structure of the observed sample inputs and do not include any generalization. The only generalization up to this point comes from the assumptions in Section III-D.

Iii-E4 Post Processing

Merge Symbols

The clustering in Section III-D and the processing in Section III-E2 might still result in similar syntactic elements being part of different clusters. They are therefore represented by different non-terminal symbols. We address this with a tunable heuristic that merges compatible symbols. For clusters and let and be the first blocks in and . We merge the clusters if and match exactly or if they are similar enough according to a relaxed heuristic that computes a similarity score that allows for partial prefix matches of blocks. The new cluster also is assigned a new symbol that is substituted for all occurrences of and .

Process Single Characters

Up to this stage we did not account for the possibility that single-character fragments can also be instances of non-terminal symbols. To address this we apply a heuristic that for each cluster and a node that computes a similarity score that allows for partial prefix matches of blocks and is unidirectional only tries to find similar of blocks from in blocks of . If the character occurs in a fragment represented by a node in , is greater than for all other and also exceeds a configurable threshold, we replace the terminal symbol in the corresponding production with .

Iii-F Naming Nonterminals

In order to make it easier to for users to read the grammars learned by AUTOGRAM we try to propose meaningful names for non-terminal symbols. For each symbol with a cluster we collect the names from elements of blocks with , e.g. names of methods which processed the corresponding fragments or parameters and fields in which a value derived from the fragment was stored. For each cluster we therefore get a multiset of strings that we use to propose a name. We implemented a simple heuristic that first filters the strings by removing common prefixes or suffixes like get, set and parse and then identifies the most often occurring substring that is as long as possible.

Iv Generalizing Grammars

Since we aim to learn grammars from very small sample sets down to one single sample input our initially derived grammar will be a very close fit to these samples. In the following we describe several heuristics for generalizing these grammars.

Iv-a Optional Elements

The first generalization is the identification of optional elements. For all sequences of symbols on the right-hand side of productions we try to identify optional subsequences of length down to . For each initial sample that contributed to the derivation of this sequence and each subsequence of length we derive new inputs by omitting the fragments corresponding to . We execute the program with theses inputs and check if they are accepted by the program and the data-flow of the fragments following the omitted part has not changed. In that case we consider the subsequence optional and modify the grammar accordingly. In case of our URL example we start with the concrete grammar in Figure 2 and starting at SPEC derive new inputs by omitting decreasingly long subsequences. Figure 7 shows the accepted and rejected inputs that lead to the generalizations in Figure 4.

0 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment initial input from Figure 1
1 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
3 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
4 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
5 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
6 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
7 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
15 http://user:pass@www.google.com:80/command?foo=bar&lorem=ipsum#fragment
Fig. 7: Identifying optional elements through membership queries. For each rule, AUTOGRAM synthesizes new inputs that omit the fragments corresponding to a subsequence of elements. By querying the program for whether a synthesized input is valid () or not (), AUTOGRAM can systematically generalize the concrete grammar from Figure 2 to the abstract grammar in Figure 4.

Iv-B Generalizing Strings

DIGIT ::= /[0-9]/
DIGITS ::= DIGIT+
POSITIVEINTEGER ::= /[1-9]/ [DIGITS]
INTEGER ::= [’-’] POSITIVEINTEGER
ALPHA ::= /[A-Z]/ — /[a-z]/
ALPHAS ::= ALPHA+
ALPHANUM ::= ALPHADIGIT
ALPHANUMS ::= ALPHANUM+
WHITESPACE ::= ’ ’ — ’\(\backslash\)t’
WHITESPACES ::= WHITESPACE+
WHITESPACENEWLINE ::= ’ ’ — ’\(\backslash\)t’ — ’\(\backslash\)n’ — ’\(\backslash\)r’
WHITESPACENEWLINES ::= WHITESPACENEWLINE+
HOSTNAME ::= (ALPHANUM — ’.’ — ’-’)+
PATH ::= (ALPHANUM — ’.’ — ’-’ — ’/’)+
ALPHANUMWHITESPACE ::= ALPHADIGITWHITESPACE
ALPHANUMWHITESPACES ::= ALPHANUMWHITESPACE+
ABSOLUTEPATH ::= ’/’ [PATH]
ALPHANUMWITHSPECIAL ::= ALPHANUM — ’;’ — ’:’ —
     ’-’ — ’+’ — ’=’ — ’!’ — ’*’ — ’(’ —
     ’)’ — ’/’ — ’$’ — ’%’ — ’&’ — ’@’
ALPHANUMWITHSPECIALS ::= ALPHANUMWITHSPECIAL+
ALPHANUMWITHSPECIALWHITESPACE ::=
    ALPHANUMWHITESPACE
ALPHANUMWITHSPECIALWHITESPACES ::=
    ALPHANUMWITHSPECIALWHITESPACE+
PRINTABLENEWLINE ::= ALPHANUMWITHSPECIALWHITESPACENEWLINE
PRINTABLENEWLINES ::= PRINTABLE+
PRINTABLE ::= ALPHANUMWITHSPECIALWHITESPACE — ’?’ — ’#’
PRINTABLES ::= PRINTABLE+
Fig. 8: Predefined (user-configurable) nonterminals in AUTOGRAM.

Another important step is to generalize terminal strings in grammars. When we learn a grammar from an input like in Figure 1 we only observe one specific value for HOST, PORT and other symbols. We try to generalize the observed values to regular expressions. Figure 8 shows a set of non-terminal symbols that define regular languages. The inclusion relationship between these regular languages can be used to derive a directed graph such that for languages and , there is a path from to if . For a set of strings our heuristic starts by identifying the smallest regular language such that . In order to check if we can safely generalize to , we derive new inputs from all occurrences of each in the initial sample set which are replaced by a representative member of . Similar to the heuristic for optionality, we run the program with these inputs and check if they are accepted and if the data-flow for the unmodified parts of the input remain the same. If this is successful for all derived inputs, we can generalize to . Our heuristic traverses step by step until the derived inputs are not accepted or show modified data-flow.

For our URL example, we generalize individual non-terminal symbols as illustrated in Figure 5. For the concrete port number 80 we find the smallest regular language POSITIVEINTEGER that contains this fragment. To confirm that the program will also accept other elements of POSITIVEINTEGER we replace 80 with 10 and run the program to check if this derived input is accepted, which in this case is successful. The next possible step is to try and generalize to the next larger regular language which is DIGITS. We again test by replacing 80 with 01 which is again successful. However we are not able to generalize further to ALPHANUMS since replacing 80 with Az1 is rejected by the program.

Iv-C Repetitions

Many context free languages use repeating features like sequences. We implemented a simple heuristic to detect repeating expressions and try to generalize them using the same mechanism that is used for finding optional elements. For each occurrence of a repetition of an expression in the set of sample inputs we derive new inputs in which the repetition is replaced by occurrences of the expression. We currently do this for . We run the program with these inputs and check if they are accepted and if the data-flow for the unmodified parts of the input remain the same. If this is successful for all derived inputs we generalize the reputation to a non-empty sequence of the expression of an arbitrary length.

If the repeated expression is a sequence of a non-terminal symbol , we check if all nodes of occurrences of the first or the last element in the initial sample set correspond to the application of a specific production rule while all others use a different production rule. If this is the case we specialize the first or last element and the other occurrences to these production rules. This helps to deal with common patterns like repeating lines, where all lines except the last one are terminated by a newline symbol.

V Evaluation

Let us now demonstrate AUTOGRAM on a set of example subjects. As test subjects, we use the programs listed in Table II; this extends the set from AUTOGRAM [11] with INI4J.

Subject Data Format Format Purpose Soundness Completeness
java.lang.URL URL Uniform Resource Locators; used as Web addresses 100.0% 50.3%
Apache Commons CSV111https://commons.apache.org/proper/commons-csv/ CSV Comma-separated values; used in spreadsheets 100.0% 100.0%
java.util.Properties Java property files Configuration files for Java programs using key/value pairs 100.0% 13.4%
INI4J222http://ini4j.sourceforge.net/ INI Configuration files consisting of sections with key/value pairs 100.0% 16.1%
Minimal JSON333https://github.com/ralfstx/minimal-json JSON Human-readable text to transmit nested data objects 100.0% 100.0%
TABLE II: Test Subjects and Evaluation Results

V-a Soundness and Completeness

We evaluate each grammar by two measures:

Soundness.

For each of the grammars produced by AUTOGRAM, we use them as producers to derive 20,000 randomly generated strings. These then serve as inputs for subjects they were derived from. The soundness measure is the percentage of inputs that are accepted as valid. A 100% soundness indicates that all inputs generated from the grammar are valid.

Completeness.

For each of the languages in Table II, we create a golden grammar  based on the language specification. We test whether the respective AUTOGRAM-generated grammar accepts the 20,000 random strings generated by . A 100% completeness indicates that the grammar encompasses all inputs of the golden grammar.

When the grammars are used for test generation (fuzzing), a high soundness translates into a high test efficiency, as only few inputs would be rejected. A high completeness correlates with test effectiveness, since the grammar would cover more features, and thus exercise more functionality.

Table II summarizes our results for soundness and completeness across all subjects. We now discuss these in detail.

V-B URLs

In Figure 4, we show the AUTOGRAM grammar obtained from the java.net.URL class and the one sample in Figure 1. In our evaluation, the inferred URL grammar is 100% sound (all inputs derived from it are complete); however, it is only 50.3% complete. This is due to the rule for USERINFO, whose format user:password cannot be generalized using our predefined items; in our evaluation, this leads to every URL containing a random user/password combination being rejected. This, however, is a case that can be very easily fixed by either amending the produced grammar or introducing a predefined item for user:password.

As all grammars produced by AUTOGRAM, the grammar is very readable; this is due to AUTOGRAM naming nonterminals after associated variables and functions. We have not formally evaluated readability, but we list the grammars as raw outputs such that readers can assess their understandability themselves.

V-C Comma-Separated Values


Firstname;Lastname;Phone
Roger;Smith;34534534
”Anne”;”Perkins Watson”;”204059”
Leila;Jackson;9569784
Dough;Clinton;1298483

MAIN ::= CSVPARSER NEXTRECORD*
CSVPARSER ::= HASH NEXTTOKEN*
HASH ::= ALPHANUMWHITESPACES
NEXTTOKEN ::= ’;’ [KEY] —
     ’;’ [ENCAPSULATEDTOKEN]
ENCAPSULATEDTOKEN ::= ’”’ KEY ’”’
KEY ::= ALPHANUMWHITESPACES
NEXTRECORD ::= NEXTTOKEN_FIRST NEXTTOKEN*
NEXTTOKEN_FIRST ::=\(\backslash\)n’ [KEY] —
     ’\(\backslash\)n’ [ENCAPSULATEDTOKEN]
Fig. 9: CSV grammar generalized and derived from a single sample.

The Apache Commons CSV reader uses a pure lexical processing for its input, which is also reflected in the resulting grammar (Figure 9). However, AUTOGRAM nicely identifies that values are optional and that strings can contain arbitrary characters, features not present in the original sample. The grammar is both 100% sound and complete.

V-D Java Properties


Version=1
WorkingDir=mydir
# comment
User=Bob
Password=12345

MAIN ::= LOAD
LOAD ::= LINE* LINE_LAST
LINE ::= [S2] ’\(\backslash\)n’
S2 ::= [S3] ARG ’ ’* ’=’ [’ ’* ARG]
S3 ::= ’# comment \(\backslash\)n’
ARG ::= ALPHANUMWHITESPACES
LINE_LAST ::= S2
Fig. 10: Java properties grammar generalized and derived from a single sample.

The Java properties parser also is a simple scanner, resulting in nonterminals for which AUTOGRAM cannot find a name (S2 and S3). The key=value structure is well identified, as are the optional values (S2). The grammar is 100% sound; the comment ( termS3), however, cannot be generalized further by AUTOGRAM, resulting in a low completeness if 13.4% in our evaluation, as 86.6% of golden inputs would sport different comments. Again, this is very easy to amend.

V-E INI Files


[Application]
Version = 1
WorkingDir = mydir
[User]
User = Bob
Password = 12345

LOAD ::= LINE_FIRST LINE* LINE1
LINE_FIRST ::= LINE2_FIRST\(\backslash\)n’
LINE2_FIRST ::= ’[’ SECTION ’]’
SECTION ::= ALPHANUMWHITESPACES
LINE ::= LINE2\(\backslash\)n’
LINE2 ::= ’[’ SECTION ’]’ —
    KEY ’ ’* ’=’ [’ ’*  [VALUE]]
KEY ::= ALPHANUMWHITESPACES
VALUE ::= ALPHANUMWHITESPACES
LINE1 ::= KEY ’ ’* ’=’ [’ ’* [VALUE]]
Fig. 11: INI grammar generalized and derived from a single sample.

From the INI4J parser and one input sample, AUTOGRAM derives the grammar in Figure 11. As with Java properties, the key=value structure is well identified (LINE1 and LINE2), and the grammar is 100% sound. However, it is only 16.1% complete, because the golden grammar also produces underscores for identifiers, which are not found in our sample. Adding a sample identifier with an underscore would easily fix this.

V-F JSON Inputs


{
  ”glossary”: {
    ”title”: ”example glossary”,
        ”GlossSeeAlso”:
            [”GML”, ”XML”] ,
        ”bool1”  : true,
        ”bool2”  : false,
        ”number1” : 2349872,
        ”number2” : -45242,
        ”number3” : 2349.872,
        ”number4” : -98.72,
        ”empty” : null,
        ”number5” : 2372e71,
        ”number6” : 123e-31,
        ”number7” : 23.72e71,
        ”number8” : 12.83e-33
  }
}

MAIN ::= VALUE
VALUE ::= STRINGFALSETRUEOBJECTARRAYNULLNUMBER
STRING ::= ’”’ HASH ’”’
HASH ::= ALPHANUMWITHSPECIALWHITESPACES
FALSE ::= ’false’
TRUE ::= ’true’
OBJECT ::= ’{’
    [ STRINGINTERNAL ’:’ VALUE
     (’,’ STRINGINTERNAL ’:’ VALUE)* ] ’}’
STRINGINTERNAL ::= ’”’ HASH ’”’
ARRAY ::= ’[’ [ VALUE ( ’,’ VALUE )* ] ’]’
NULL ::= ’null’
NUMBER ::= INTEGER [[FRACTION] [EXPONENT]]
EXPONENT ::= ’e’ [’-’] POSITIVEINTEGER
FRACTION ::= ’.’ POSITIVEINTEGER
Fig. 12: JSON grammar generalized and derived from a single sample. Whitespace processing omitted.

The most complex input language we have studied with AUTOGRAM is JSON from the Minimal JSON parser. Our input sample is set up to cover the JSON data types, and these are all reflected in the resulting grammar (Figure 12). Via its membership queries, AUTOGRAM has generalized that objects and arrays can have an arbitrary number of members, and also generalized all number rules. (Minimal JSON has its own parser for numeric values.) This grammar represents all valid JSON inputs; in our experiments, it is 100% sound and 100% complete.

V-G Threats to Validity

All results in the evaluation are subject to threats to validity. In terms of external validity, we do not claim that any of the results or measures shown would or even can generalize to general-purpose programs; rather, our results show the potential of grammar inference even in the absence of large sample sets. In terms of internal validity, it is clear that the completeness of samples easily determines the completeness of the resulting grammars; we have thus set up our samples with a minimum of features.

Vi Conclusion

With AUTOGRAM, it is possible to derive a complete input grammar given a program and only a minimum of input samples, tracking the flow of input characters and derived values through the program to derive an initial structure, and using active learning to systematically generalize the inferred grammar. The resulting grammars have a large variety of applications such as reverse engineering of input formats over automatically deriving parsers to decompose inputs into their constituents. The first and foremost application, however, will be test generation, allowing for massive exploration of the input space simply by deriving a producer from the grammar.

Despite its successes, AUTOGRAM is but a milestone on a long path of challenges. Future topics in learning input languages from programs include:

Sample-free input learning.

This challenge is easily posed: Given a program, is it possible to synthesize an input sample that could serve as starting point for AUTOGRAM? This is a question of test generation: Essentially, we are looking for a sample input (or a set thereof) that covers a maximum of different input features, and thus functionality. Unfortunately, the conditions a valid input has to fulfill are so numerous and complex that current approaches to test generation are easily challenged.

Context-sensitive features.

Many input formats use context sensitive features like prepending sizes to data blocks or check-sums in binary formats. Since these are usually checked to verify the integrity of an input, it is possible to observe these events during program execution by adding instrumentation to corresponding methods and include them in the trace. This could be used to learn input specifications that are more powerful than context free grammars.

Multi-Layer Grammars.

Complex input formats come at various layers, such as a lexical and a syntactical layer for parsing programming languages first into tokens and then into trees; or a transport layer with checksums around the actual content, which comes in its own format. As each layer will need its own language description, the challenge will be to identify and separate them, and to come up with a coupled model for both.

If we were to put all these into one grand challenge, then it would be as follows: Given a compiler binary and no other information, derive an input model that can be used as a sound and precise language reference both for humans and automated testing tools. This is a hard challenge, and it may still take many years to solve it; but with AUTOGRAM, we feel we may have gotten a bit closer.

References

  • [1] O. Bastani, R. Sharma, A. Aiken, and P. Liang. Synthesizing program input grammars. CoRR, abs/1608.01723, 2016. To appear at PLDI 2017.
  • [2] C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pages 209–224, Berkeley, CA, USA, 2008. USENIX Association.
  • [3] B. Davis, A. Beatty, K. Casey, D. Gregg, and J. Waldron. The case for virtual register machines. In Proceedings of the 2003 workshop on Interpreters, virtual machines and emulators, pages 41–49. ACM, 2003.
  • [4] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Trans. Softw. Eng., 27(2):99–123, Feb. 2001.
  • [5] G. Fraser and A. Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 416–419. ACM, 2011.
  • [6] P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-based whitebox fuzzing. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pages 206–215, New York, NY, USA, 2008. ACM.
  • [7] P. Godefroid, M. Y. Levin, D. A. Molnar, et al. Automated whitebox fuzz testing. In NDSS, volume 8, pages 151–166, 2008.
  • [8] P. Godefroid, H. Peleg, and R. Singh. Learn&Fuzz: Machine learning for input fuzzing. CoRR, abs/1701.07232, 2017.
  • [9] J. Heinz and J. M. Sempere. Topics in Grammatical Inference. Springer Publishing Company, Incorporated, 1st edition, 2016.
  • [10] C. Holler, K. Herzig, and A. Zeller. Fuzzing with code fragments. In USENIX Security Symposium, pages 445–458, 2012.
  • [11] M. Höschele and A. Zeller. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, pages 720–725, New York, NY, USA, 2016. ACM.
  • [12] N. Tillmann and J. De Halleux. Pex: White box test generation for .NET. In Proceedings of the 2nd International Conference on Tests and Proofs, TAP’08, pages 134–153, Berlin, Heidelberg, 2008. Springer-Verlag.
  • [13] M. Zalewski. American fuzzy lop. http://lcamtuf.coredump.cx/afl/.