Learning a high-level language description from a set of examples in that language is a long-studied and difficult problem. While early interest in this problem was motivated by the desire to automatically learn human languages from examples, more recently the problem has been of interest in the context of learning program input languages. Learning a language of program inputs has several relevant applications, including generation of randomized test inputs [GopinathF1ArXiV2019, AschermannNautilusNDSS2019, Wang19], as well as providing a high-level specification of inputs, which can aid both comprehension and debugging.
In this paper we focus on the problem of learning context-free grammars (CFGs) from a set of positive examples and a Boolean-value oracle . This is a similar setting as GLADE [BastaniGladePLDI2017]. Like GLADE, and unlike other recent related works [WuReinamFSE2019, HoscheleAutogramASE2016, GopoinathMimidFSE2020], we assume the oracle is black-box: our technique can only see the Boolean return value of the oracle. We adopted the use of an oracle as we believe that in practice, an oracle—e.g. in the form of a parser—is easier to obtain than good, information-carrying negative examples.
In this paper, we describe a novel algorithm, Arvada, for learning CFGs from example strings and an oracle . At a high-level, Arvada attempts to create the smallest CFG possible that accommodates all the examples. It uses two key operations—bubbling and merging—to generalize the language as much as possible, while not overgeneralizing beyond the language accepted by .
To create this context-free grammar, Arvada repeatedly performs the bubbling and merging operations on tree representations of the input examples. This set of trees is initialized with one “flat” tree per input example, i.e. the tree with a single root node whose children are the characters of the input string. The bubbling operation takes sequences of sibling nodes in the trees and adds a layer of indirection by replacing the sequence with a new node. This new node has the bubbled sequence of sibling nodes as children.
Then Arvada decides whether to accept or reject the proposed bubble by checking whether a relabeling of the new node enables sound generalization of the learned language. Essentially, labels of non-leaf nodes correspond to nonterminals in the learned grammar. Merging the labels of two distinct nodes in the trees adds new strings to the grammar’s language: the strings derivable from subtrees with the same label can be swapped. We call this the merge operation since it merges the labels of two nodes in the tree. If a valid merge occurs, the structure introduced by the bubble is preserved. Thus, merges introduce recursion when a parent node is merged with one of its descendants. If the label of the new node added in the bubbling operation cannot merge with any existing node in the trees, the bubble is rejected. That is, the introduced indirection node is removed, and the bubbled sequence of sibling nodes is restored to its original parent. These operations are repeated until no remaining bubbled sequence enables a valid merge.
In this paper, we formalize this algorithm in Arvada
. We introduce heuristics in the ordering of bubble sequences minimize the number of bubblesArvada must check before find a successful relabeling. We implement Arvada
in 2.2k LoC in Python, and make it available as open-source. We compareArvada to GLADE [BastaniGladePLDI2017], a state-of-the-art for grammar learning engine with blackbox oracles. We evaluate it on parsers for several grammars taken from the evaluation of GLADE, Reinam [WuReinamFSE2019], Mimid [GopoinathMimidFSE2020], as well as a few new highly-recursive grammars. On average across these benchmarks, Arvada achieves higher recall and higher F1 score over GLADE. Arvada incurs on a slowdown of over GLADE, while requiring as many oracle calls. We believe this slowdown is reasonable, especially given the difference in implementation language—Arvada is implemented in Python, while GLADE is implemented in Java. Our contributions are as follows:
We introduce Arvada, which learns grammars from inputs strings and oracle via bubble-and-merge operations.
We distribute Arvada’s implementation as open source: https://github.com/neil-kulkarni/arvada.
We evaluate Arvada on a variety of benchmarks against the state-of-the-art method GLADE.
Ii Motivating Example
Arvada takes as input a set of example strings and an oracle . The oracle returns True if its input string is valid and False otherwise. Arvada’s goal is to learn a grammar which maximally generalizes the example strings in a manner consistent with the oracle . That is, strings
in the language of the learned grammar should with high probability be accepted by the oracle:. We formally describe maximal generalization in Section III.
Fundamentally, Arvada learns a grammar by learning “parse trees” for the examples in . These parse trees are initialized with flat trees for each example in . Then, Arvada adds structure, turning sequences of sibling nodes into new subtrees. The particular subtrees Arvada keeps are those which enable generalization in the induced grammar.
From any set of trees we can derive an induced grammar. In particular, each non-leaf node in a tree with label and children with labels induces the rule . The induced grammar of is then the set of induced rules for all nodes in the trees.
For example, the trees in Fig. 2 induce the grammar:
Because of this mapping from trees to grammars, we will use the term “nonterminal” interchangeably with “label of a non-leaf node” when discussing relabeling trees.
We illustrate Arvada on a concrete example. We take the set of examples and oracle shown in Fig. 1. This oracle accepts inputs as valid only if they are in the language of the while grammar , shown at the top of the figure. Arvada treats as blackbox, that is, it has no structural knowledge of : is shown only to clarify the behavior of .
Arvada begins by constructing naïve, flat, parse trees from the examples. These are shown in Fig. 2. Essentially, these trees simply go from the start nonterminal to the sequence of characters in each example . Let designate the set of trees Arvada maintains at any point in its algorithm.
The fundamental operation Arvada performs is to bubble up a sequence of sibling nodes in the current trees into a new nonterminal. To bubble a sequence in the trees , we create a new nonterminal node with children . Then we replace all occurrences of in each with . Fig. 3 shows two such bubbles applied to the trees in Fig. 2. On top, we have bubbled the sequence hile into ; the second tree, unchanged, is not illustrated. On the bottom, we have bubbled (n+n) into ; the first tree is unchanged.
After bubbling a sequence , Arvada either accepts or rejects the bubble. Arvada only accepts a bubble if it enables valid generalization of the examples. That is, if a relabeling of the bubbled nonterminal—merging its label with the label of another existing node—expands the language accepted by the induced grammar, while maintaining the oracle-validity of the strings produced by the induced grammar.
Consider again Fig. 3. On top, we have the bubble . There is no terminal or nonterminal whose label can be merged with the label and retain a valid grammar: it can’t be merged with , since “hile” on its own is not accepted by . Nor can it be merged with the label of any individual character: as just one example, merging with L would cause the -invalid generalization “hile = n ; hile = (n+n)”.
On the bottom of Fig 3, we have the bubble . We can in fact merge the label with the label , the implicit nonterminal expanding to n. Notice that if we replace n with the strings derivable from , we get examples like while true & false do L = (n+n) and L = (n+n) ; L = ((n+n)+(n+n)), which are all valid. Conversely, if we replace occurrences of with n, we get examples like L = n ; L = n. We accept this bubble, which expands the language accepted by the induced grammar. Thus, and n are merged and relabeled as . The trees after the relabel are shown after (1) in Fig. 4. Note this merge has introduced recursive generalization; the induced grammar now includes the rules:
In practice, Arvada checks whether labels can be merged by checking candidate strings against the oracle. If the oracle accepts all these candidate strings, the relabeling is valid and the labels are merged. To create these candidates, Arvada creates mutated trees from the trees in where (1) subtrees rooted at are replaced subtrees rooted at , and (2) subtrees rooted at are replaced subtrees rooted at . The candidate strings are then the ones derived from these trees, i.e. the ordered sequence of a tree’s leaf nodes. Section III-C describes the conditions under which a bubble is accepted in more detail. Section III-D describes how to create these candidate strings, and the soundness issues this introduces.
[, l sep=1.2cm, for tree= s sep=0.25mm [w] [h] [i] [l] [e] [␣] [t] [r] [u] [e] [␣] [&] [f] [a] [l] [s] [e] [␣] [d] [o] [␣] [L] [␣] [=] [␣] [n]] [, l sep=1cm, for tree= s sep=0.5mm [L] [␣] [=] [␣] [n] [␣] [;] [␣] [L] [␣][=] [␣] [(] [n] [+] [n] [)]]
(1) Bubble ; merge (, n) into
[, l sep=1cm, for tree= s sep=0.25mm [w] [h] [i] [l] [e] [␣] [t] [r] [u] [e] [␣] [&] [␣] [f] [a] [l] [s] [e] [␣] [d] [o] [␣] [L] [␣] [=] [␣] [, fill=mint [n, l =0.75mm]] ] [, l sep =0.75cm, for tree= s sep=0.5mm, l sep = 3mm, l =1mm [L] [␣] [=] [␣] [, fill=mint [n ] ] [␣] [;] [␣] [L] [␣] [=] [␣] [, fill=mint, [(] [, fill=mint [n]] [+] [, fill=mint [n] ] [)] ] ]
(2) Bubble ; merge (, ) into
[, fill=mint,l sep=1cm, for tree= s sep=0.25mm [w] [h] [i] [l] [e] [␣] [t] [r] [u] [e] [␣] [&] [␣] [f] [a] [l] [s] [e] [␣] [d] [o] [␣] [, fill = mint [L] [␣] [=] [␣] [,[n, l=0.75mm]] ] ] [,fill=mint, for tree= s sep=0.5mm, l sep = 2mm, l =1mm [ , fill = mint [L] [␣] [=] [␣] [, [n, l=0.75mm]] ] [␣] [;] [␣] [ , fill = mint [L] [␣] [=] [␣] [, [(] [, [n, l=0.75mm]] [+] [, [n, l=0.75mm]] [)] ] ] ]
(3) 2-Bubble (, ); merge both into
[, for tree= s sep=0.25mm [w] [h] [i] [l] [e] [␣] [, fill=mint [t] [r] [u] [e] ] [␣] [&] [␣] [, fill=mint [f] [a] [l] [s] [e] ] [␣] [d] [o] [␣] [ [L] [␣] [=] [␣] [,[n]] ] ] [, for tree= s sep=0.5mm, l sep = 2mm, l =1mm [ [L] [␣] [=] [␣] [, [n]] ] [␣] [;] [␣] [ [L] [␣] [=] [␣] [, [(] [, [n]] [+] [, [n]] [)] ] ] ]
(4) Bubble ; merge (, ) into
[, for tree= s sep=0.25mm [w] [h] [i] [l] [e] [␣] [, fill=mint [, fill=mint [t] [r] [u] [e] ] [␣] [&] [␣] [, fill=mint [f] [a] [l] [s] [e] ] ] [␣] [d] [o] [␣] [ [L] [␣] [=] [␣] [,[n]] ] ] [, for tree= s sep=0.5mm, l sep = 2mm, l =1mm [ [L] [␣] [=] [␣] [, [n]] ] [␣] [;] [␣] [ [L] [␣] [=] [␣] [, [(] [, [n]] [+] [, [n]] [)] ] ] ]
Ii-A3 Double bubbling
After accepting a bubble, Arvada continues to try and create new bubbles. It bubbles different sequences of children in the current trees , checking if they are accepted, and updating accordingly. Fig. 4 shows a potential run of Arvada, with the state of the trees as they are updated by bubbles and label merging.
In Fig. 4, after (1) accepting the bubble , Arvada (2) finds and accepts the bubble , whose label can be merged with the start nonterminal . At this point, Arvada will find no more bubbles which can be merged with any existing nodes in . For example, if Arvada creates the bubble , it will find that the label cannot be merged with the label of any existing node and reject it.
To cope with this, Arvada also considers 2-bubbles. In a 2-bubble, two distinct sequences of children—say, and —in the trees are bubbled at the same time, i.e. replacing both with and some other with . The two sequences can be totally distinct, or sub/super sets, but not overlapping: (, ) is ok, as is (, ), but (rue␣&␣f, e␣&␣fal) is not. Arvada accepts a 2-bubble only if the labels and can be merged with each other, not with another existing node. Otherwise, either or could be accepted as a 1-bubble.
In the run in Fig. 4, (3) Arvada applies and accepts the 2-bubble (, ) and merges these sequences into . This 2-bubble enables one final single bubble to be applied and accepted: (4) can be merged with . After this, no more 1-bubbles or 2-bubbles can be accepted, so Arvada simply outputs the grammar induced by the final set of trees . Fig. 5 shows the grammar.
Ii-A5 Effect of bubbling order
First, note that multiple orderings of bubbles can result in an equivalent grammar. For example, we could have applied (, ) in (3), then bubbled up false alone in (4). Second, while Fig. 4 shows an ideal run, some accepted bubbles may impede further generalization of the grammar. For example, in the initial flat parse trees, can be merged with e. In the presence of the additional example “while n == n do skip”, this merge prevents maximal generalization.
As such, the order in which bubbles are applied and checked has a large impact on Arvada’s performance. In Section III-B, we describe heuristics that order the bubbles for exploration based on the context and frequency of the bubbled subsequence. These heuristics increase Arvada’s probability of successfully finding the maximal generalization of with respect to , as discussed in Section IV-B.
Ii-A6 Maximality of learned grammar
The grammar in Fig. 5 is not identical to that in Fig. 1. However, it contains all the rules in demonstrated by the examples : has taken on the role of numexpr, in the role of boolexpr, and is effectively stmt. However, the rule does not appear in Fig. 5. Fundamentally, this is because no substring derivable from this rule exists in ; as such, it is not part of ’s maximal generalization.
First, we formalize our problem statement. Arvada accepts as input a set of example strings and a Boolean-valued oracle which judges the validity of the strings. Arvada’s goal is to learn a context-free grammar which maximally generalizes the set of example in a manner consistent with .
Let be a set of input strings and be a Boolean-valued oracle accepting strings as input. Assume each is accepted by the oracle, i.e., . Let be a context-free grammar such that its language of strings is equal to , the set of strings accepted by the oracle . Since for each , then each . We call as the target grammar.
Thus, for each , there exists a derivation from the start symbol to , i.e. . This derivation is a sequence of nonterminal expansions according to some rules . Let be the set of rules in used in the derivation . Let , and be the subset of which contains only those rules . Intuitively, is the sub-grammar of which is exercised by the .
Finally: a grammar which maximally generalizes w.r.t. is a grammar such that , i.e. it accepts the same language as .
Iii-a Main Algorithm
Algorithm 1 shows the main Arvada algorithm. It works as follows. First, Arvada builds naïve, flat, parse trees from the input strings (Line 3). Considering each as a sequence of characters , the tree constructed for has a root node with the start symbol label and children with labels . Each has a single child whose label is the corresponding character . Fig. 2 shows these flat parse trees for the examples strings in Fig. 1, although the are not illustrated for simplicity.
Arvada tries to generalize these parse trees by merging nodes in the tree into new nonterminal labels (Line 4). To merge two nodes , in a tree, we replace all occurrences of the labels , with a new label . This creates new trees ; the merge is valid if the language of the induced grammar of only includes strings accepted by the oracle .
In practice, we check if a merge of , is valid by checking whether can replace in the example strings, and vice-versa. The strings derivable from an arbitrary nonterminal in are the concatenated leaves of the subtree rooted at . We check whether replaces by checking whether the strings produced by replacing strings derivable from by strings derivable from , are accepted by the oracle. That is, we take the strings derivable from the trees , with holes in place of strings derived from . Then we fill the holes with strings derivable by . If all the strings are accepted by , Arvada judges the replacement as valid. Section III-D details this check and its soundness.
Now the main Arvada loop starts. From the current -derived trees , Arvada gets all potential “bubbles” for the trees (Algorithm 1, Line 8). For each tree , GetBubbles collects all proper contiguous subsequences of children in . That is, if the tree contains a node with children , the potential bubbles include all subsequences of of length greater than one and less than . GetBubbles returns all these subsequences as 1-bubbles, and all non-conflicting pairs of these subsequences as 2-bubbles. Two subsequences are non-conflicting if they do not strictly overlap: they can be disjoint or one can be a proper subsequence of the other. So conflict, but and do not. The order in which Arvada explores these bubbles is important for efficiency; we discuss this further in Section III-B.
Then, for each potential bubble, Arvada tries applying it to the existing set of trees . Suppose we have a 1-bubble consisting of the subsequence . To apply this bubble, we replace any sequence of siblings with labels in the tree with a new subtree . Fig. 3 shows two such bubblings: hile is bubbled into the nonterminal at the top, and (n+n) is bubbled to on the bottom. If the bubbled nodes have structure under them, that structure is maintained: e.g., the bubbling of into at (4) in Fig. 4. For a 2-bubble, the same process is repeated for the two subsequences involved.
After applying the bubble, Arvada checks whether it should be accepted (Line 11). Section III-C formalizes CheckBubble, but essentially, CheckBubble accepts a bubble if the new nonterminals introduced in its application can be validly merged with some other nonterminal node in the tree.
If the new bubbled nonterminal allows a valid merge with some other nonterminal, CheckBubble returns True as well as the trees with the merge applied (Line 11). We update the best trees to reflect the successful merge (Line 13), and GetBubbles is called again on the new . If the bubble is not accepted, Arvada continues to check the next bubble returned by GetBubbles (Line 9).
The algorithm terminates when none of the bubbles are accepted, i.e. when the trees cannot be further generalized, and returns the grammar induced by the trees (Line 17).
We can guarantee the following about Arvada as long as merges are sound, once we consider the notion of partially merging two nonterminals, discussed in Section III-C2.
Existence Theorem: There exists a sequence of -bubbles, that, when considered by Arvada in order, enable Arvada to return a grammar s.t. , so long as the input examples are exercise all rules of .
Proof Outline: The optimal bubble order always chooses the right-hand-side of some in as the sequence to bubble, either as 1-bubble if there exists an expansion for in the trees already, or as a 2-bubble otherwise.
Refer to Appendix A-B for formal treatment of this and the Generalization Theorem, which shows that -bubbles monotonically increase the language of the learned grammar.
Iii-B Ordering Bubbles for Exploration
As described in paragraph 5) of Section II and alluded to above, the order of bubbles impacts the eventual grammar returned by Arvada. Unfortunately, the number of orderings of bubbles is exponential. To have an efficient algorithm in practice, we must make sure the algorithm finds the correct order of bubbles early in its exploration of bubble orders. As such, GetBubbles returns bubbles in an order more likely to enable sound generalization of the grammar being learned.
As described in the prior section, bubble sequences consist of proper contiguous subsequences of children in the current trees . We increase the maximum length of subsequences considered once all bubbles of shorter length do not enable any valid merges. These subsequences (and their pairs) form the base of 1-bubbles (and 2-bubbles) returned by GetBubbles.
Recall that a bubble should be accepted if the bubbled nonterminal(s) can be merged with an existing nonterminal (or each other). Thus, GetBubbles should first return those bubbles that are likely to be mergeable. We leverage the following observation to return bubbles likely to enable merges. Expansions of a given nonterminal often occur in a similar context. The -context of a sequence of sibling terminals/nonterminals in a tree is the tuple of siblings to the left of and siblings to right of .
Fig. 6 shows an example of a run of Arvada on the while language, after the application of the 1-bubble “skip” and the 2-bubble (“false”, “true”). The set of 4-contexts for the sequence “n␣==␣n” is . Similarly, “”’s 4-contexts are ; “” is a dummy element indicating the start of the example string. Note that “n==n” and “” share the 4-context
With this in mind, GetBubbles orders the bubbles in terms of their context similarity. Given two contexts and , where and , we have , where
where is the indicator function, returning 1 if its arguments are equal and 0 otherwise. This similarity function gives most weight to the context elements closest to the bubble.
With this in mind, we define set context similarity as the maximum similarity of two contexts within the set:
In our running example, the context similarity is 1 because n==n’s 4-context set is a subset of ’s 4-context set.
To form bubbles, GetBubbles first traverses all the trees currently maintained by Arvada. It considers each proper contiguous subsequence of siblings in the trees. For each subsequence , it collects the -contexts for , as well as the occurrence count of the subsequence . In Fig. 6, , and . In our implementation we take .
Arvada then creates a 2-bubble for each pair of sequences where both and . The similarity score of this 2-bubble is and its frequency score is the average frequency of the two sequences in the bubble . Additionally, for each sequence with , Arvada creates a 1-bubble . Let be the set of length-one subsequences. The similarity score of is and its frequency score is .
Finally, GetBubbles takes the top- bubbles as sorted primarily by similarity, and secondarily by frequency. Intuitively, high-frequency sequences may correspond to tokens in the oracle’s language. The order of bubbles is shuffled to prevent all runs of Arvada from getting struck in the same manner. We find to be effective in practice.
Iii-C Accepting Bubbles
The second key component of Arvada is deciding whether a given bubble should be accepted: this section formalizes how CheckBubble works. At the core of CheckBubble is the concept of whether two labels can be merged. We say that and can be merged, i.e. , if and only if —that is, all occurrences of can be replaced by in the grammar—and . We formalize how Replaces is checked in the next section.
Arvada accepts a 2-bubble with labels only if . Intuitively, this is because both bubbles should be kept only if they together expand the grammar. For example, suppose we apply the 2-bubble (“n␣==␣n”, “lse”) to the trees in Fig. 6, resulting in nonterminals and . While can merge with , does not contribute to this merging. So, (“n␣==␣n”) should be accepted only as a 1-bubble.
Recall that Arvada scores 1-bubbles highly if they are likely to merge with an existing nonterminal. Let be the nonterminal labels present in the current set of trees . Given a 1-bubble with label , we go through each and check whether . If is true for some , then CheckBubble accepts the bubble .
However, if cannot merge with any , Arvada also looks for partial merges. Partial merging works as follows. Let be the character nonterminal labels present in the current set of trees . A character nonterminal is a nonterminal whose expansions only of a single terminal element, e.g., or .
For each , the partial merging algorithm identifies all the different occurrences of in the right-hand-side of expansions in ’s induced grammar. For instance, in the grammar fragment (1) of Fig. 7, we see the nonterminal , corresponding to “n”, occurs 4 distinct times in right-hand-sides of expansions. The partial merging algorithm then modifies the grammar so that the occurrence of is replaced with a fresh nonterminal . Each expands to the same bodies as ; i.e. . This replacement process is illustrated in the grammar fragment (2) of Fig. 7: the four occurrences of have been replaced with , , , and . Finally, we get to the merging in partial merging: for each , the algorithm checks if . If for any , Arvada accepts the bubble , and is merged with all such . The which cannot be merged with are restored to the original nonterminal .
The term partial merge refers to the fact that we have effectively merged with some of the occurrences of in rule expansions. This step is useful when Arvada’s initial trees—which map each character to a single nonterminal—use the same nonterminal for characters that are conceptually separate. For instance, consider the 1-bubble ((n+n)), with label . Given the tree in Fig. 7, fails because “(n+n)” cannot replace the “n” in “then”. In fact, cannot merge with any initially. But the partial merge process splits into , , , , and Arvada finds that in fact merges with , and . So, it is merged with those nonterminals and accepted.
Note: though we consider only partial merges on character nonterminals for efficiency reasons, the concept of partial merging can be applied to any pair of nonterminals.
In summary, a 1-bubble with label is accepted if either: (1) for some , , or (2) for some , can be partially merged with .
Iii-D Sampling Strings for Replacement Checks
The final important element affecting the performance of Arvada is how exactly we determine whether the merge of two nonterminals labels is valid. Recall that if and only if and .
We implement as follows. From the current parse trees, we derive the replacee strings: the strings derivable from the parse trees in trees, but with holes instead of the strings derived from . Then, we derive a set of replacer strings: the strings derivable from in the trees. Finally, we create the set of candidate strings by replacing the holes in the replacee strings with the replacer strings. If rejects any candidate string, the merge is rejected, and Replaces returns false.
Fig. 8 shows how replacer and replacee strings are computed in the call to , i.e. whether can replace . Replacee strings for a node in the parse tree are computed by taking the product of replacee strings for all its children; the nonterminal being replaced becomes a hole.
Level-0 replacer strings for are just the strings that directly derivable from in the tree; in Fig. 8, the level-0 derivable strings of are 44+4, (3), 3, and the level-0 derivable strings of are 44, 4. Then, the set of level- derivable strings for a node is the set derived from taking the product of all level-() derivable strings for each child of a node. The level-1 replacer strings for are shown in Fig. 8.
When Replaces is run in the full MergeAllValid call or while evaluating a 1-bubble, we use only level-0 replacer strings. However, we found that level-1 replacer strings greatly increased soundness at a low runtime cost for 2-bubbles. Intuitively this is because nonterminals from new bubbles tend to have less structure underneath them than existing nonterminals in the trees. So it is faster to compute level-1 replacer strings for these new bubble-induced nonterminals.
Note that the both the number of replacee strings and of level-n derivable replacer strings grows exponentially. So, instead of taking the entire set of strings derivable in this manner, if there are more than of them, we uniformly sample of them. In our implementation we use , to make the number of parse calls reasonable in terms of runtime.
Unfortunately, this process allows unsound merges, where all candidate strings are accepted by the oracle, but the merge adds oracle-invalid inputs to the language of the learned grammar. First, because only candidates are sampled. Second, because the replacee strings are effectively “level 0”, and thus, not reflective of the current induced grammar from the trees. Third, because a candidate string is produced by replacing all its holes with a single replacer string, rather than filling holes with different replacer strings. Taking , for the level- replacer strings, and filling different holes with different replacer strings would ensure sound merges.
Since Arvada considers 2-bubbles, it is effectively in the total length of examples . So, to improve performance as gets large and reduce the likelihood of creating “breaking” bubbles, in our implementation we use a simple heuristic to pre-tokenize the values at leaves, rather than considering each character as a leaf. We group together sequences of contiguous characters of the same class (lower-case, upper-case, whitespace, digits) into leaf tokens. Punctuation and non-ASCII characters are still treated as individual characters. We then run the Arvada as described previously. To ensure generalization, we add a last stage which tries to expand these tokens into the entire character class: e.g. if , we check whether can be replaced by any sequence of lower-case letters, letters, or alphanumeric characters. We construct the replacee strings as described above, and sample 10 strings from the expanded character classes as replacer strings.
We seek to answer the following research questions:
[label=RQ0., wide=0pt, leftmargin=*]
Do Arvada’s mined grammars generalize better (have higher recall) than state-of-the-art?
Do Arvada’s mined grammars produce more valid inputs (have higher precision) than state-of-the-art?
How does the nondeterminism in Arvada cause its behavior to vary across different invocations?
How does Arvada
’s performance compare to that of deep-learning approaches?
What are Arvada’s major performance bottlenecks?
What do Arvada’s mined grammars look like?
We evaluate Arvada against state-of-the-art blackbox grammar inference tool GLADE [BastaniGladePLDI2017] on 11 benchmarks.
The first 8 benchmarks consist of an ANTLR4 [ParrANTLRSPE1995] parser for the ground-truth grammar as oracle and a randomly generated set of training examples . is sampled to cover all of the rules in the ground-truth grammar, while keeping the length of each example small. The test set is randomly sampled from the ground-truth grammar. Essentially, this ensures that the maximal generalization of covers the entire test set. Other than turtle and while, these benchmarks come from prior work [BastaniGladePLDI2017, WuReinamFSE2019, GopoinathMimidFSE2020]:
arith: operations between integers, can be parenthesized
fol: a representation of first order logic, including qualifiers, functions, and predicates
json: JSON with objects, lists, strings with alpha-numeric characters, Booleans, null, integers, and floats
lisp: generic s-expression language with “.” cons’ing
mathexpr: binary operations and a set of function calls on integers, floats, constants, and variables
turtle: LOGO-like DSL for Python’s turtle
while: simple while language as shown in Fig. 1
xml: supporting arbitrary attributes, text, and a few labels
The next 3 benchmarks use as oracle a runnable program, and use a random input generator to create and the test set. consists of the first 25 oracle-valid inputs generated by the generator, and the test set of the next 1000 oracle-valid inputs generated. In this case, there is no guarantee that the maximal generalization of covers the test set.
curl: the oracle is the curl[curl] url parser. We use the grammar in RFC 1738 [URLRFC] to generate and test set.
tinyc: the oracle is the parser for tinyc [tinyc], a compiler for a subset of C. We use the same golden grammar as in Mimid [GopoinathMimidFSE2020] to generate and the test set.
Iv-B Accuracy Evaluation
First, we evaluate the accuracy of Arvada and GLADE’s mined grammars with respect to the ground-truth grammar We ran both Arvada and GLADE with the same oracle example strings. Three key metrics are relevant here:
Recall: the proportion of inputs from the held-out test set—generated by sampling the golden grammar/generator—that are accepted by the mined grammar. We use a test set size of 1000 for all benchmarks.
Precision: the proportion of inputs sampled from the mined grammar that are accepted by the golden grammar/oracle. We sample 1000 inputs from the mined grammar to evaluate this.
the harmonic mean of precision and recall. It is trivial to achieve high recall but low precision (mined grammar captures any string) or low recall but high precision (mined grammar captures only the string in); F1 measures the tradeoff between the two.
Results. As Arvada is nondeterministic in the order of bubbles explored, we ran it 10 times per benchmark. As GLADE is deterministic, we ran it only once per benchmark.
Table 11 shows the overall averaged results, Fig 9 the individual runs. We see from the table that on average, Arvada achieves higher recall than GLADE on all benchmarks, and it achieves higher F1 score on all but 2 benchmarks. Arvada achieves over 2 higher recall on 9 benchmarks, and over 2 higher F1 score on 7 benchmarks.
Even for those benchmarks where Arvada does not have a higher F1 score on average, Fig. 8(c) shows that Arvada outperforms GLADE on some runs. For nodejs, on 5 runs, Arvada achieves a higher F1 score, ranging from 0.37 to 0.55. For curl, on 2 runs Arvada achieves F1 scores greater than or equal to GLADE’s: 0.78 and 0.86. It makes sense that GLADE performs well for curl: the url language is regular, and the first phase of GLADE’s algorithm works by building up a regular expressions. Nonetheless Fig. 8(a) shows that Arvada achieves consistently higher recall on curl.
Overall, on average across all runs and benchmarks, Arvada achieves 4.98 higher recall than GLADE, while maintaining its precision. So, on our benchmarks, the answer to RQ1 is in the affirmative, while the answer to RQ2 is not. Given that Arvada still achieves a 3.13 higher F1 score on average, and that higher generalization (in the form of recall) is much more useful if the mined grammar is used for fuzzing, we find this to be a very positive result.
However, we see from the standard deviations in Table 11 that Arvada’s performance varies widely on some benchmarks, notable fol, lisp, while, and fol. Fig. 9, which shows the raw data, confirms this. In Fig. 8(a), we see that the performance on the lisp benchmark is quite bimodal. All of the mined grammars with recall around 0.25 fail to learn to cons parenthesized s-expressions. This may be because the minimal example set did not actually have an example of this nesting. On nodejs, the two runs with recall less than 0.1 find barely any recursive structures, suggesting that on larger example sets, Arvada may get lost in bubble order. Overall, the answer to RQ3 is that Arvada’s nondeterministic bubble ordering can have very large impacts on the results. We discuss possible mitigations in Section V.
Iv-C Comparison to Deep Learning Approaches
Recently there has been interest in using machine learning to learn input structures. For instance, Learn&Fuzz trains a seq-2-seq model to model the structure of PDF objects[LearnFuzz]; it uses information about the start and end of pdf objects as well as the importance of different characters in its sampling strategy. DeepSmith [Cummins19] trains an LSTM to model OpenCL kernels for compiler fuzzing, adding additional tokenization and pre-processing stages to CLGen [Cummins17].
A natural question is how Arvada compares to these generative models. We trained the LSTM model from CLGen [Cummins17]
, the generative model behind DeepSmith, on our benchmarks. We removed all the OpenCL-specific preprocessing stages from the pipeline. We used the parameters given as example in the CLGen repo, creating a 2-layer LSTM with hidden dimension 128, trained for 32 epochs. We used\n!!\n as an EOF separator. Each sample consisted of 100 characters, split into different inputs where the EOF separator appeared.
Table 11 shows the runtime of the model on each benchmark, as well as the precision achieved on the first 1000 samples taken from the model. Generally, we see that the precision is much lower than that of GLADE or Arvada. On arith, the model over-trains on the EOF separator, adding \n and ! throughout samples. Since the model is generative—it can generate samples but not provide a judgement of sample validity—, we cannot measure Recall as in Table 11. However, qualitative analysis of the samples suggests there is not much learned recursive generalization. For json, 602 of the 625 valid samples are a single string (e.g., "F"); the other 21 valid samples are numbers, false, or . For nodejs, of the 111 valid samples, 26 are empty, 24 are a single identifier (e.g. a_0), 18 are a parenthesized integer or identifier (e.g,. (242)), and 17 are a single-identifier throw, e.g. throw (a_0).
These results are not entirely unexpected, because the LSTM underlying CLGen is learning solely from the input examples. Both Arvada and GLADE extensively leverage the oracle, effectively creating new input examples from which to learn. This explains why the runtimes look so different between Tables 11 and 11. We see in Table 11 that the total time to setup and train the model is around 3 minutes for all benchmarks, and the core training time is around 10-20 seconds. We see the model training time is slightly higher for tinyc and nodejs, which had longer input examples.
Overall, we expect these deep-learning approaches to be more well-suited to a case where an oracle is not available, but large amounts of sample inputs are. These models may also be more reliant input-format specific pre-processing steps, like those used on OpenCL kernels in CLGen and DeepSmith.
Iv-D Performance Analysis
The next question is about Arvada’s performance. Table 11 shows the average Arvada runtime and number of queries performed for each benchmark, and the same statistics for GLADE. On 7 of 11 benchmarks, Arvada is on average slower than GLADE; overall across benchmarks, this amounts to an average slowdown. This is quite respectable, since Arvada has a natural runtime disadvantage due to being implemented in Python rather than Java. For the three benchmarks on which Arvada is over slower than GLADE, it has huge increases in F1 score: for fol, for xml, and for tinyc.
The story for oracle queries performed is inversed; Arvada requires more oracle queries on average on only 4 benchmarks. For all of these except nodejs, Arvada also had much higher F1 scores. However, nodejs
is a benchmark with high variance. On the run with highest F1 score (0.55, higher than GLADE’s 0.34),Arvada takes 86,051 s to run and makes 270k oracle calls. On the fastest run, where Arvada only gets F1 score 0.14, Arvada takes 17,775 s and makes 41k oracle calls. That is, the higher performance cost correlates with the slower runs on this benchmark: 5 of the 6 slower runs also have higher F1 scores.
Overall across all benchmarks, Arvada performs only 0.87 as many oracle queries as GLADE. This is encouraging as it gives more room for performance optimizations.
Fig. 12 breaks down the average percent of runtime spent in Arvada’s 3 most costly components: calling the oracle; creating, scoring, and ordering bubbles; and sampling string for replacement checks. The error bars show standard deviation; note the aforementioned high variance for nodejs appears here too. On the minutes-long benchmarks on which Arvada is at least 10 seconds slower than GLADE, of the runtime is spent in sampling strings for replacement. The current implementation of this re-traverses the trees after each bubble to create these examples.
On the particularly slow benchmarks, tinyc and nodejs, Arvada spends a long time ordering bubbles. This makes sense because of the larger example length of the benchmarks. It is nonetheless encouraging to see this room for improvement, as GetBubbles re-scores the full set of bubbles each time a bubble is accepted. It should be possible to bring down runtime by only scoring the bubbles that are modified by the application of the just-accepted bubble. On nodejs, Arvada also spends a long time in oracle queries, because the time for each query is much longer (300 ms vs. 3ms for tinyc).
Overall, Arvada has runtime and number of oracle queries comparable with GLADE, while achieving much higher recall and F1 score. As for RQ3, when the length of the examples in is small, oracle calls dominate runtime. As example length grows, the ordering and scoring of bubbles—particularly computing context similarity—starts to dominate runtime.