Extracting Grammars from a Neural Network Parser for Anomaly Detection in Unknown Formats

by   Alexander Grushin, et al.
Galois, Inc.

Reinforcement learning has recently shown promise as a technique for training an artificial neural network to parse sentences in some unknown format. A key aspect of this approach is that rather than explicitly inferring a grammar that describes the format, the neural network learns to perform various parsing actions (such as merging two tokens) over a corpus of sentences, with the goal of maximizing the total reward, which is roughly based on the estimated frequency of the resulting parse structures. This can allow the learning process to more easily explore different action choices, since a given choice may change the optimality of the parse (as expressed by the total reward), but will not result in the failure to parse a sentence. However, the approach also exhibits limitations: first, the neural network does not provide production rules for the grammar that it uses during parsing; second, because this neural network can successfully parse any sentence, it cannot be directly used to identify sentences that deviate from the format of the training sentences, i.e., that are anomalous. In this paper, we address these limitations by presenting procedures for extracting production rules from the neural network, and for using these rules to determine whether a given sentence is nominal or anomalous, when compared to structures observed within training data. In the latter case, an attempt is made to identify the location of the anomaly. Additionally, a two pass mechanism is presented for dealing with formats containing high-entropy information. We empirically evaluate the approach on artificial formats, demonstrating effectiveness, but also identifying limitations. By further improving parser learning, and leveraging rule extraction and anomaly detection, one might begin to understand common errors, either benign or malicious, in practical formats.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning grammar with a divide-and-concur neural network

We implement a divide-and-concur iterative projection approach to contex...

AMR Parsing with Action-Pointer Transformer

Abstract Meaning Representation parsing is a sentence-to-graph predictio...

Semi-Supervised Anomaly Detection - Towards Model-Independent Searches of New Physics

Most classification algorithms used in high energy physics fall under th...

A Fast Unified Model for Parsing and Sentence Understanding

Tree-structured neural networks exploit valuable syntactic parse informa...

Consistent CCG Parsing over Multiple Sentences for Improved Logical Reasoning

In formal logic-based approaches to Recognizing Textual Entailment (RTE)...

Structured Training for Neural Network Transition-Based Parsing

We present structured perceptron training for neural network transition-...

RL-GRIT: Reinforcement Learning for Grammar Inference

When working to understand usage of a data format, examples of the data ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Grammatical inference is desirable in many domains, limited not only to Natural Language Processing (NLP), but also the processing of data that adheres to unknown or underspecified formats. For example, one use case arises when enterprise systems seek security by pre-processing data, before passing it to any application. To ensure compatibility with an application, it may be necessary to infer the grammar for the format that the application accepts. With large, well-adopted formats, such as Portable Document Format (PDF), there exists an ecosystem of programs (parsers) for reading and writing files in the format. These parsers do not always adhere to the specification’s original intent, and sometimes add features or contain bugs, resulting in a modified, de facto specification, which is difficult to interpret. Understanding these modifications to design an effective pre-filter requires grammatical inference algorithms, which can represent and capture the complicated data structures often found in data formats.

Given a set of example input sentences in some unknown format, an effective grammatical inference algorithm must generate a grammar that is general enough to correctly and completely describe this format, i.e., that will parse all sentences in that format, but no other sentences. In order to improve the generality of grammatical inference for non-trivial formats, recent work has often leveraged advances in Machine Learning

(ML), particularly, via artificial neural networks such as deep recurrent autoencoders (

Drozdov et al., 2019, 2020) and transformers (Wang et al., 2019). These referenced studies have focused on constituency parsing, which is performed by concatenating adjacent, related atoms within a sentence to form larger atoms. Here, an atom can be a single token in a parsed sentence, or can consist of multiple such tokens that were merged together during previous parsing steps; e.g., [breaklines]text|’if’| would be an atom formed from the atoms [breaklines]text|’i’| and [breaklines]text|’f’|. (We use single quotes to indicate an atom, and not otherwise; since a separate lexer was not used in this study, each input token is a character, though the fundamental algorithm does not have that constraint.) The aforementioned grammatical inference approaches are designed primarily for NLP applications, and do not have the appropriate expressiveness for data type recurrency. For example, in the JavaScript Object Notation (JSON) format, a nested object may appear as follows: [breaklines]text| "a": "b": "c": "d" |. Here, both the [breaklines]text|"a"| and the [breaklines]text|"b"| keys map to object values, but the above approaches might represent the first object as the atom [breaklines]text|’ "b": "c": "d" ’|, and the second one as the atom [breaklines]text|’"c" : "d" ’|, i.e., as different data types.

Recently, Reinforcement Learning (RL) has been used as the basis for a novel, flexible grammatical inference scaffolding called RL-GRIT (Cowger et al., 2020; Woods, 2021). The goal of RL is to find a policy, defined as a mapping of observations to actions, which attempts to maximize some reward. In RL-GRIT, an observation is the current set of atoms, while an action involves the merging of two atoms into a new higher-level atom (parsing is thus performed in a bottom-up fashion). Importantly, to overcome the aforementioned limitations associated with simply concatenating atoms, there are special merge action types that enable data type recurrency: when merging two atoms, the parser has the option of replacing one of the atoms with a special subgrammar token, which can act as a wildcard that can match multiple atoms, or removing one of the atoms entirely, which allows multiple atoms to be recursively collapsed into a single atom, achieving an effect similar to that of the Kleene star. The reward is roughly based on the estimated frequency of the atoms generated when performing parsing actions on a corpus of sentences. The learned policy, represented as a neural network, can then be used as a parser, which can be applied to generate parse trees for new sentences. This RL-based approach learns a parser directly, rather than first learning a grammar. The primary reason for this is that the learning process benefits from a search space with progressive improvements in the measured reward, rather than the "all-or-nothing" design of traditional grammars, which either accept or reject a sentence. While this approach was shown to be promising at automatically generating parsers for formats with data type recurrency (Cowger et al., 2020; Woods, 2021), it also has limitations. First, as is often the case with neural network-based approaches, there is a lack of explainability: the parser has learned some grammar, but it does not provide the production rules for that grammar. Furthermore, because the parser is always expected to take an action at each step of the parsing process, it will produce a parse tree for any input sentence, even if the format of this sentence is entirely different from the format of sentences that were used for training. Thus, unlike traditional parsers, which will fail upon encountering a deviation from the grammar, the RL-based parser cannot be used directly to determine whether some sentence is nominal (i.e., valid) or anomalous (invalid) compared to the training data.

In this paper, we extend the RL-based grammatical inference approach by introducing a rule extraction technique, which analyzes parse trees output by the trained parser and extracts a set of production rules describing the format. We design a representation for the production rules, initially based directly on the actions that are taken by the parser. Subsequently, we identify and mitigate limitations of this approach to increase the expressive power of the representation. As the merges are binary, the resulting representation somewhat resembles the Chomsky Normal Form (Chomsky, 1959), though proving equivalence (or lack thereof) is a subject for future work. We also provide a technique for anomaly detection, which uses the resulting representation to determine whether some new sentence adheres to or deviates from the format. To achieve this, our technique checks whether the new sentence is parsed using some of the same production rules that were extracted from sentences that are known to be nominal, or whether certain unexpected (new) production rules are applied. In the latter case, the technique also attempts to find the location of the anomaly within the sentence, by identifying regions of the sentence that are parsed by the unexpected rules. Finally, we further extend rule extraction and anomaly detection to formats where sentences can contain regions with a high degree of entropy, interspersed with low-entropy regions – e.g., the mixing of control bytes, which denote structure within a format, with user data contained in the payload of a format. These high-entropy regions are not amenable to grammatical inference; therefore, when such a region appears in some new sentence, it will likely be parsed by some unexpected rules, and will thus be labeled as anomalous, even if it is not anomalous in the context of the grammar being learned, as it represents a subformat which is not integral to the containing format’s structure. We propose to deal with this by applying our approach in two passes: once anomaly detection (potentially, with some modifications) has identified high-entropy regions, they are removed, yielding a simpler set of sentences; then, the “train a parser, extract rules, detect anomalies” pipeline is applied a second time, to these simplified sentences, to identify potential problems in the low-entropy regions while allowing the algorithm to ignore high-entropy regions.

2 Approach

Figure 1: An outline of the approach that is presented in this paper, with very simple examples (on the right side of the figure) of a sentence, a parse tree, and a rule). The thick black arrow (on the left) represents the training of the parser on nominal sentences; thin black arrows represent data flows. Details are provided in the text.

Our approach is outlined in Fig. 1. Given some set of example sentences that are nominal (i.e., valid, according to some unknown format), we use RL to train a parser on some subset of these sentences. We apply the trained parser to another subset of these sentences, in order to generate parse trees , and to then extract a set of rules (we use disjoint subsets of for training vs. rule extraction, though this need not be the case). Given some new sentences (whose format is unknown), we generate parse trees for these sentences as well, and extract rules . By comparing the rules and , determinations can be made regarding whether each sentence in is nominal or anomalous, and in the latter case, where the anomalies might potentially exist. We elaborate upon these steps in the following subsections.

2.1 Learned actions and production rules

Details of the reinforcement learning approach are provided in Woods (2021); here, we focus specifically on the actions that the artificial neural network-based parser learns to perform. At every step of the parse, an action involves selecting two atoms [breaklines]text|’a’| and [breaklines]text|’b’|, and merging them into a new atom. There are three types of merges; the most straightforward of these simply concatenates the two atoms into the atom [breaklines]text|’ab’|; this can be expressed via the production rule [breaklines]text|’ab’ -> ’a’ ’b’|. The anchored merge takes two atoms, [breaklines]text|’a’| and [breaklines]text|’b’|, and merges them into the atom [breaklines]text|’a’| or [breaklines]text|’b’| (essentially, deleting the other atom), depending on whether the anchored merge is left- or right-biased; these merges are described, respectively, as: [breaklines]text|’a’ -> ’a’ ’b’| and [breaklines]text|’b’ -> ’a’ ’b’|. Finally, in the subgrammar merge, one of the two merged atoms is replaced with a special, wildcard-like token [breaklines]text|’G’| not occurring in the input language. Like the anchored merges, subgrammar merges can be left- or right-biased: [breaklines]text|’aG’ -> ’a’ ’b’| and [breaklines]text|’Gb’ -> ’a’ ’b’|. Notably, the anchored merge is designed to fill a similar role to that of the Kleene star [breaklines]text|*| (e.g., it can be applied multiple times to parse a sentence such as [breaklines]text|abbb| to produce [breaklines]text|abb|, then [breaklines]text|ab|, and then [breaklines]text|a|), and the subgrammar merge is designed to fill a similar role to that of the alternation operator [breaklines]text|||.

2.2 Production rule extraction

Figure 2: Example nominal (left) and anomalous (right) Simple-JSON sentences, with parse trees; the leaves of the parse tree correspond to single-character tokens in the sentence. Each node of the parse tree is labeled with the atom that it represents. Some nodes are also labeled with the production rule that generated the atom (i.e., that has this atom on the left side). Purple rules (and generated atoms) correspond to regular merges; grey rules correspond to anchored merges, and red rules correspond to subgrammar merges. Unexpected rules are underlined.

For a sentence consisting of tokens, there are possible merges as a first action, since all such actions from Woods (2021) involve two adjacent atoms. For the same reason, the resulting tree from a sentence with tokens will be a binary tree with non-leaf nodes. Each of these nodes corresponds to a higher-level atom, generated via the application of some production rule. Thus, by examining the parse tree, we can easily obtain the set of production rules that were used to parse the sentence. For example, consider the sentence [breaklines]text|abc|, and its parse tree, illustrated on the left side of Fig. 2; some of the non-leaf nodes are labeled with the associated rules (additional extracted rules, not shown in the figure, are [breaklines]text|’G’ -> ” ’a’| and [breaklines]text|’G’ -> ” ’b’|). In the general case, if we apply the parser to a different input sentence, then this might result not only in a different parse tree, but also, in a different set of rules. Thus, to obtain a comprehensive view of the grammar that the parser has learned, it is desirable to apply it to a large set of input sentences in a particular format, and to combine the sets of rules that were extracted.

For the specific example in Fig. 2 (left), the production rules happen to cover all sentences described by the grammar [breaklines]text‘S -> ” (’a’ | ’b’ | ’c’ | S+) ”‘, referred to as the Simple-JSON grammar (Cowger et al., 2020; Woods, 2021); in particular, the [breaklines]text|’G’| atom (appearing on the left side of the production rule [breaklines]text|’G’ -> ’G’ ”|) approximately corresponds to [breaklines]text|S|. However, while the rules will successfully parse any valid Simple-JSON sentence, will they also fail at parsing any sentence that is not in the Simple-JSON format, i.e., that is anomalous from the perspective of Simple-JSON? We will address this question in the following two subsections.

2.3 Anomaly detection

During both parser training and rule extraction, our assumption was that all sentences are valid examples of a given format. Now, suppose that we are presented with some new sentence , whose format is unknown. Our goal is to use the trained parser and the extracted rules to determine whether is valid in the given format, or whether it deviates from it. This determination can be made by applying the parser to , and obtaining a set of rules for that sentence. Then, is labeled as nominal if ; otherwise, it is labeled as anomalous. In other words, we determine whether some of the rules that were extracted when parsing are unexpected, in that they were never used to parse valid/nominal sentences. As an example, consider the right side of Fig. 2, where we apply the parser to a sentence that is a corrupted version of the sentence on the left side, with the [breaklines]text|’a’| token deleted. The analysis of the parse tree reveals a number of unexpected rules, such as [breaklines]text|’GG’ -> ’G’ ’G’|, and the anomaly is thus correctly detected.

2.4 Enhancing the representation

There are, however, situations where our approach may fail to detect that a sentence is anomalous. First, consider a sentence such as [breaklines]text|a|. While it is anomalous (due to a missing bracket), it is successfully parsed via the rule [breaklines]text|’G’ -> ” ’a’|, which is expected (i.e., in ); thus, the sentence is labeled as nominal. This shortcoming exists because our representation lacks the notion of a start symbol, which is a nonterminal symbol that must correspond to the entire sentence that is being parsed. In our grammar representation, we now introduce a similar notion of a start rule, which is any rule that appears at the root of the parse trees that are produced by the parser . In particular, when rules are extracted from a parse tree, any rule at the root is labeled as a start rule; in our notation, we denote a start rule by enclosing its left side with hyphens. For example, in the left (nominal) parse tree of Fig. 2, the root of the tree is represented by the rule [breaklines]text|’G’ -> ’G’ ”|, so we write it as [breaklines]text|-’G’- -> ’G’ ”|. Given this enhancement, when performing anomaly detection on the aforementioned example [breaklines]text|’a’|, the resulting parse tree contains a single rule [breaklines]text|-’G’- -> ” ’a’| at its root. As the original set of rules does not contain the rule [breaklines]text|-’G’- -> ” ’a’|, even though it does contain the rule [breaklines]text|’G’ -> ” ’a’|, the sentence [breaklines]text|’a’| is now be correctly labeled as anomalous.

Figure 3: An example sentence that that does not adhere to the Simple-JSON format, but that would be labeled as nominal without precedence constraints (left); the blue arrow indicates that the left side of one production rule is fed into the right side of another production rule. We illustrate some of the precedence constraints between production rules that parse valid/nominal Simple-JSON sentences (right); each node is labeled with a rule; a blue edge indicates that the left side of one production rule becomes the first atom on the right side of another production rule; an orange edge indicates that one rule produces the second atom that is used by another rule.

Now, consider the sentence [breaklines]text|ab|, and its parse tree, shown on the left side of Fig. 3. This sentence is anomalous, due to an extra letter between the first two brackets. However, the anomaly is not detected, because none of the extracted rules (including the start rule) are unexpected. To detect this sort of anomaly, we must consider not only which rules are applied to parse it, but also, in what order. In particular, the goal is to determine exactly which production rules (or terminal symbols) might be used as inputs for which subsequent production rules. For example, is it acceptable for the output of the rule [breaklines]text|’G’ -> ” ’a’| to be used as an input to the rule [breaklines]text|’G’ -> ’G’ ’G’|, as indicated by the dashed arrow in the figure? To allow such questions to be answered, when the rule extraction procedure analyzes parse trees derived from nominal sentences, it considers not only rules, but also, precedence constraints of the form , indicating that in a parse tree, there is a non-leaf node with a rule where the first (left) child has a rule , and the second (right) child has a rule ; each of or might be a token, rather than a rule, if they come from leaf nodes in the parse tree. Together, the rules and the precedence constraints can be illustrated as a graph, with the former represented via nodes, and the latter captured as pairs of edges. If we ignore, for clarity, the distinction between start rules and other rules, and only consider constraints where both and are rules, rather than tokens, then for Simple-JSON, the graph is shown on the right side of Fig. 3. Now, when performing anomaly detection, we determine not only if the rules extracted from a new input sentence exist in , but also, whether or not the precedence constraints for that sentence exist in ; is labeled as nominal if and only if and . Returning to the example in Fig. 3, we observe that one of the extracted precedence constraints is as follows: [breaklines]text|( ’G’ -> ’G’ ’G’ > ’G’ -> ” ’a’ ^’G’ -> ’G’ ” )|. This constraint does not appear in the original set of constraints in Fig. 3 (there is no edge from [breaklines]text|’G’ -> ” ’a’| to [breaklines]text|’G’ -> ’G’ ’G’|); thus, the example is now correctly labeled as anomalous.

2.5 Anomaly localization

It is of practical interest to not only determine whether a sentence is anomalous, but also to localize the anomalous token(s) within the sentence. We attempt to achieve this by determining which of the sentence’s tokens are “covered” by the unexpected rules, or by rules with unexpected precedence constraints . A token is covered by a rule if this rule applies at some node in the parse tree for the sentence, and the token is a child of that node. As an example, returning to the right side of Fig. 2, the unexpected rule [breaklines]text|’G’ -> ” ”| covers the first two tokens in the sentence. We consider this anomaly to be correctly localized because these tokens are adjacent to the token that was deleted. We note that the unexpected rule [breaklines]text|’G’ -> ’G’ ”| covers the last token, and thus gives another candidate (though incorrect) location for an anomaly. The rule [breaklines]text|’GG’ -> ’G’ ’G’| does not cover any tokens, because it applies at a node that has no leaf nodes as children. We have experimented with an alternative notion of coverage, where a token need not be a child of a node with an unexpected rule, so long as it is a descendent, but found that this resulted in poorer localization performance.

2.6 Sentence simplification

While the Simple-JSON format is highly structured, there are many formats where example sentences tend to contain a high degree of randomness/entropy. For example, consider the Key-List language, where each sentence is a list of keys, such as [breaklines]text|/cjc /i /sp|; each key begins with a forward slash, and the keys are separated by spaces. The keys resemble those encountered in a dictionary within the PDF file format, but with randomly-generated content for each key. Due to this randomness, the language cannot be described with a compact set of rules; thus, each sentence will likely result in unexpected rules, and be labeled as anomalous. To overcome this issue, we can attempt to differentiate between the low-entropy and high-entropy regions of an example sentence by applying our rule extraction algorithm, but filtering out production rules that occur infrequently within the parse trees from which they were extracted, treating such rules as unexpected. Then, leveraging the hypothesis that high-entropy regions will be labeled as anomalous, we can use the remaining rules to perform anomaly localization as described previously. Subsequently, we simplify each sentence in our datasets , by replacing each high-entropy region with the special high-entropy token [breaklines]text|’&’|; then, a sentence such as [breaklines]text|/cjc /i /sp| will be converted to [breaklines]text|/& /& /&|. Finally, we apply the pipeline of Fig. 1 a second time to the resulting simplified sentences , which should now be describable by a compact set of rules. Unfortunately, for the Key-List language, the approach is problematic, because each low-entropy region consists of at most two tokens: a space and a forward slash. Suppose that the second forward slash is removed in [breaklines]text|/cjc /i /sp| to produce the anomalous sentence [breaklines]text|/cjc i /sp|); then, the entire region [breaklines]text|cjc i| will likely be covered by unexpected rules, treated as high-entropy, and collapsed to [breaklines]text|’&’|; the simplified sentence will be [breaklines]text|/& /&|, which is nominal; the anomaly will thus be missed. To address this issue, in addition to the aforementioned topological notion of coverage, we alternatively introduce symbolic coverage, where a token is covered symbolically if there exists some rule in (after filtering) with this token contained on the right side of the rule. For the Key-List language, the rule [breaklines]text|’/ ’ -> ’/’ ’ ’| would cover any instance of a forward slash or space token, and so [breaklines]text|/cjc i /sp| would be correctly simplified to [breaklines]text|/& & /&|, which preserves the anomaly. On the other hand, topological coverage can be useful in situations where the low-entropy regions are larger, and where the same token may appear in both high-entropy and low-entropy regions, which is the case for many formats.

3 Evaluation

In our initial set of experiments, we applied our approach to randomly-generated sentences in the Simple-JSON grammar; the generation procedure is described in Woods (2021). We generated nominal sentences for training, nominal sentences for production rule extraction, nominal sentences for validation (explained in the caption to Table 1) and sentences for the evaluation of anomaly detection, with of these sentences made anomalous in some way, as described below, and the rest nominal. Training sentences were generated first, and duplicate sentences were permitted; when generating the remaining sentences, duplicates were not permitted, and the sentences were randomly shuffled, before being split into sets for rule extraction, validation and evaluation. Since longer sentences are more likely to be unique, the training sentences tended to be shorter: tokens (characters), on average, compared with tokens for the extraction, validation, and nominal evaluation sentences (for anomalous evaluation sentences, the exact token counts depended on the type of anomaly).

True Positive Rate
Localization Rate Localization Ratio
Deleted Bracket 94.7% (100.0%) 18.7% (18.2%) 9.4% (9.2%)
Deleted Letter 100.0% (100.0%) 100.0% (100.0%) 12.6% (12.3%)
Inserted Letter 100.0% (100.0%) 61.0% (62.1%) 7.4% (7.5%)
Each row corresponds to a particular anomaly, while columns capture performance metrics. A given cell lists the average values of each metric without vs. with a validation set; specifically, the first value is obtained by averaging across all trials, while the second value (given in parentheses) is obtained by averaging only across those trials where the false positive rate was on the validation set; there were such trials. For nominal evaluation sentences, the false positive rate was without a validation set, and with a validation set. The localization rate denotes the percentage of sentences where the anomaly is correctly localized (i.e., adjacent to or within the set of tokens that were labeled as anomalous by the algorithm); the localization ratio denotes the percentage of tokens that were labeled as anomalous, amongst the sentences where the anomaly was correctly localized.
Table 1: Anomaly detection and localization performance for Simple-JSON

The production rule extraction, anomaly detection, and anomaly localization algorithms are deterministic, though their results depend on the stochastically-trained RL-based parser. To determine the effects of this stochasticity, we performed independent experimental trials, where in each trial, a parser was trained on the sentences, with a distinct set of randomly-generated neural network weights, and different random choices made during the RL process. Training details are found in Woods (2021). For a given trial, we applied the anomaly detection and localization procedures once to the nominal sentences in the evaluation dataset, and three times to the anomalous sentences, each time, with a different anomaly injected into the sentences: the first anomaly was the deletion of a single, randomly-chosen bracket ([breaklines]text|| or [breaklines]text||); the second one was the deletion of a single, randomly-chosen letter ([breaklines]text|a|, [breaklines]text|b|, or [breaklines]text|c|); the final anomaly was the insertion of a single, randomly-chosen letter ([breaklines]text|a|, [breaklines]text|b|, or [breaklines]text|c|) into some randomly-chosen location within the sentence. To gauge the performance of anomaly detection, we captured the proportion (expressed as a percentage) of sentences that were labeled as anomalous; if the sentences were nominal, then this provides us with the false positive rate; otherwise, this is the true positive rate. For the anomalous sentences, we additionally captured two anomaly localization metrics, which are both based on the set of tokens that the algorithm labels as potentially anomalous. The first such metric is the localization rate, which is the proportion of sentences where this set of tokens contains the inserted token (for letter insertion anomalies), or a token that was immediately adjacent to the deleted token (for letter and bracket deletion anomalies). The localization ratio is the proportion of tokens that were labeled as potentially anomalous, amongst only those sentences where localization was correct; for example, in the sentence [breaklines]text|ab|, if only the token [breaklines]text|a| is labeled as potentially anomalous, then the ratio is ; if all tokens are labeled as such, then the ratio is . A lower ratio indicates a more precise localization.

The results of the experiment are summarized in Table 1. For anomaly detection, the true and false positive rates show that using a validation set can potentially mitigate the variability that exists in the effectiveness of model training between different trials. In a practical setting, we might discard a parser if its validation false positive rate is not sufficiently low, and attempt training again. It is worth noting that even across the trials with perfect anomaly detection performance on the validation set, there was some variation in the rules that were learned from nominal sentences: for example, in some cases, the parser learned to perform right-biased merges, such as [breaklines]text|’G’ -> ’a’ ”|, rather than left-biased merges, such as [breaklines]text|’G’ -> ” ’a’|. Furthermore, instead of the anchored merge rule [breaklines]text|’G’ -> ’G’ ’G’| (presented earlier in the paper), it was common to see either [breaklines]text|” -> ’G’ ”| or [breaklines]text|” -> ” G’| (with a bias that is opposite of the bias of other rules).

While our approach can thus be very effective for detecting several different types anomalies in the Simple-JSON format, the effectiveness of anomaly localization depends significantly upon the anomaly type, and is not necessarily improved by using a validation set. Letter deletions are correctly localized in all cases; however, for bracket deletions and letter insertions, localization rates are much lower, though still above the localization ratios; note that with a purely random strategy, which labels some proportion of tokens as potentially anomalous, the expected localization rate and ratio would both be , i.e., the same. For both anomaly types, the localization problem can be inherently ambiguous; for example, given the anomalous sentence [breaklines]text|ab|, it is not possible to determine whether the missing right bracket was the sixth token in the uncorrupted sentence [breaklines]text|ab| or the last token in [breaklines]text|ab|; similarly, in [breaklines]text|ab|, it is not clear whether [breaklines]text|a| or [breaklines]text|b| is the extra letter. However, we have observed situations where unexpected rules do not cover any tokens at all (for example, this is the case for the sentence in Fig. 3), or only cover the tokens that could not possibly be anomalous; thus, the localization algorithm could potentially be improved.

To evaluate our simplification capability (Section 2.6

), we applied it to the Key-List dataset, where each sentence consisted of one to five keys, and each key consisted of a slash and one to three lowercase letter tokens, with the number of keys, the number of letters, and the choice for each letter drawn from uniform distributions; for the anomalous sentences, one randomly-chosen space or forward slash was deleted from a sentence. The pipeline in

Fig. 1 was applied in two passes; the first application was done to simplify the sentences, following the symbolic notion of coverage, with any rule that appears fewer than times (in the parse trees from which was extracted) filtered from ; this would result in the rules [breaklines]text|’/ ’ -> ’ ’ ’/’| and [breaklines]text|’/ ’ -> ’/’ ’/’| being the only ones remaining. Then, the pipeline was applied to the simplified sentences, in order to perform anomaly detection. During training (on both the original and the simplified sentences), we increased the reward associated with anchor merges by a factor of , relative to the Simple-JSON case (specifically, the parameter in Woods (2021) was set to ); this encouraged the learning of anchor merge rules such as [breaklines]text|’/&’ -> ’/&’ ’ /&’|. Apart from this change, procedures were similar to those used for Simple-JSON. On the evaluation dataset (which also consisted of nominal and anomalous sentences), the false positive rate was , even without a validation dataset; the true positive rate was . Missed detections occurred when the sentence consisted of just one key with the slash removed (e.g., [breaklines]text|cjc|); such a sentence is simplified to a single token [breaklines]text|’&’|, which was treated as nominal.

Finally, we performed experiments where we applied different variations of the two pass procedure to the Simple-JSON-Stream dataset, which was described in Woods (2021). This dataset consisted of sentences in the Simple-JSON format, but with a prefix and a suffix, each consisting of to

random tokens, which could include all lowercase letters, [breaklines]text|| and [breaklines]text||, with equal probability. We attempted different variations on the procedure, e.g., with both symbolic and topological coverage, and with different rule filtering thresholds, but were not able obtain adequate anomaly detection results. Topological simplification was able to perfectly identify the high-entropy prefixes and suffixes for some sentences; for example, [breaklines]text|hfsawplaygictfxk| was correctly simplified to [breaklines]text|&a&|. However, in many other cases, some simplification errors were present; e.g., [breaklines]text|vuptffaxlnnjhbbaplvalinjhxrmcjb| was simplified to [breaklines]text|&&ba&|, with the first [breaklines]text|b| region erroneously treated as high-entropy. These errors occurred because the RL process, when applied to the original sentences, was somewhat sensitive to the high-entropy regions, and the rules that were learned for parsing the low-entropy regions of Simple-JSON-Stream were not the same as the ones that were learned for Simple-JSON, and did not capture these regions as effectively. In turn, the simplification errors resulted in a set of simplified sentences that often deviated somewhat from the Simple-JSON format, and could not be described as compactly. As a result, the rule sets that were learned during the second pass were not sufficiently effective for anomaly detection. We postulate that future improvements to the RL algorithm would mitigate these issues.

4 Discussion and Future Work

While anomaly detection is a heavily researched problem, most work has focused on detecting anomalies in signals that are not modeled by production rules; for example, when detecting anomalies in the behavior of an aircraft, these signals may consist of real-valued variables and low-level switching events (Das et al., 2010). Where production rules are an appropriate modeling paradigm, anomaly detection in unknown formats can, in theory, be solved if grammatical inference is solved. Given a grammar, we can generate a parser for that grammar, and then apply that parser to a sentence; if the parser fails, then the sentence can be labeled as anomalous. However, it can sometimes be more natural to learn a parser, rather than a grammar, as was done in (Cowger et al., 2020; Woods, 2021). As the resulting parser successfully produces a parse tree for any sentence, it cannot be used directly for anomaly detection. A key contribution of this paper is that we have extended this approach to extract a grammar from the parser, and to use this grammar for detecting and localizing anomalies.

The approach raises a key theoretical question: can our production rules (based on regular merges, anchored merges, and subgrammar merges), together with precedence constraints, describe any context-free language, or only a subset? This question remains the subject of future work; specifically, we must formally analyze the extent to which the use of precedence constraints compensates for the fact that only a single subgrammar token [breaklines]text|’G’| is used. This allows for more effective learning, by decreasing the space of possible actions; however, it also results in a scarcity of nonterminal symbols: in our rule representation, each nonterminal symbol (e.g., [breaklines]text|’G’|) is a sequence of terminal symbols (tokens) and/or the subgrammar token [breaklines]text|’G’|. On the other hand, in a standard grammar formulation (e.g., the Chomsky Normal Form of Chomsky (1959)), an unlimited number of nonterminal symbols can be used, and this eliminates the need for explicitly defining precedence constraints. For example, in Section 2.4 the rules [breaklines]text|’G’ -> ” ’a’| and [breaklines]text|’G’ -> ’G’ ’G’| might be represented as [breaklines]text|X -> ” ’a’| and [breaklines]text|Y -> Y Z|, respectively. Then, there would be no concern about the output of the first rule being used as an input to the second rule, since [breaklines]text|X|, [breaklines]text|Y| and [breaklines]text|Z| are distinct symbols.

On the empirical side, it is of interest to reevaluate the approach with a broader range of formats, as well as more complex anomalies (including anomalies that involve the insertion, deletion or substitution of multiple tokens, rather than just a single token). Another possible research direction is to not only improve the accuracy of anomaly localization, but to provide automated suggestions on how a given anomaly could be corrected. Here, we can potentially leverage machine learning approaches that have been developed for localizing and correcting syntax errors in programs written in languages such as Java (Santos et al., 2018) or Python (Bhatia and Singh, 2016). Roughly speaking, these approaches train models that predict the probability of a given token in a program, given previous tokens; if the probability is low, then the token might be labeled as anomalous, and a suggested fix might involve replacing this token with a higher-probability token. In some sense, these approaches may be viewed as complementary to our work, because they have been developed for programming languages with well-known formats, and rely on the presence of a compiler to determine if a particular suggested fix eliminates the anomaly; for unknown languages/formats, our rule anomaly detection procedure could potentially take on this role.

5 Conclusions

In this paper, we presented an approach for extracting production rules from a neural network parser that was trained via reinforcement learning, and for using extracted rules to detect and localize anomalies. We demonstrated the effectiveness of the approach on datasets consisting of sentences in a non-regular (context-free) format (Simple-JSON) and a format that contains high-entropy regions (Key-List). This suggests that with the extensions, the approach shows promise for performing anomaly detection in unknown formats. At the same time, we have found that it is a challenge to apply the approach when high-entropy and low-entropy regions may contain some of the same tokens, as is the case in the Simple-JSON-Stream format. It may be possible to mitigate this issue by further tuning the underlying RL algorithm, such that the presence of high-entropy regions has a less significant impact on the production rules that are learned for the low-entropy regions. Our hope is that as these improvements are made, our approach will be sufficiently powerful to help better understand unknown formats. Since the RL-based approach is promising for the inference of non-trivial grammars, which underlie many real-world data formats, we believe this will prove useful for improving pre-filters aimed at making enterprise systems safer and more secure.


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-C-0073. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). The authors thank Richard Jones for his feedback and advice in editing this work.


  • Bhatia and Singh (2016) Sahil Bhatia and Rishabh Singh. Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks. arXiv e-prints, art. arXiv:1603.06129, March 2016.
  • Chomsky (1959) Noam Chomsky. On certain formal properties of grammars. Information and Control, 2(2):137 – 167, 1959. ISSN 0019-9958. doi: https://doi.org/10.1016/S0019-9958(59)90362-6. URL http://www.sciencedirect.com/science/article/pii/S0019995859903626.
  • Cowger et al. (2020) Sam Cowger, Yerim Lee, Nichole Schimanski, Mark Tullsen, Walter Woods, Richard Jones, EW Davis, William Harris, Trent Brunson, Carson Harmon, et al. Icarus: Understanding de facto formats by way of feathers and wax. In 2020 IEEE Security and Privacy Workshops (SPW), pages 327–334. IEEE, 2020.
  • Das et al. (2010) Santanu Das, Bryan Matthews, Ashok Srivastava, and Nikunj Oza. Multiple kernel learning for heterogeneous anomaly detection: Algorithm and aviation safety case study. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 47–56. ACM, 2010.
  • Drozdov et al. (2019) Andrew Drozdov, Patrick Verga, Yi-Pei Chen, Mohit Iyyer, and Andrew McCallum. Unsupervised labeled parsing with deep inside-outside recursive autoencoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1507–1512, 2019.
  • Drozdov et al. (2020) Andrew Drozdov, Subendhu Rongali, Yi-Pei Chen, Tim O’Gorman, Mohit Iyyer, and Andrew McCallum. Unsupervised parsing with s-diora: Single tree encoding for deep inside-outside recursive autoencoders. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4832–4845, 2020.
  • Santos et al. (2018) Eddie Santos, Joshua Campbell, Dhvani Patel, Abram Hindle, and José Amaral.

    Syntax and sensibility: Using language models to detect and correct syntax errors.

    In Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 311–322. IEEE, 2018.
  • Wang et al. (2019) Yau-Shian Wang, Hung-Yi Lee, and Yun-Nung Chen. Tree transformer: Integrating tree structures into self-attention. arXiv preprint arXiv:1909.06639, 2019.
  • Woods (2021) Walt Woods. RL-GRIT: Reinforcement learning for grammar inference. In 2021 IEEE Security and Privacy Workshops (SPW). IEEE, 2021.