Recurrent neural networks (RNNs) are one of the most powerful learning models for processing sequential data and, like neural networks in general, have often been considered as “black-box” models. This black-box nature is largely due to the fact that RNNs, as much as any neural architecture, are designed to capture structural information from the data and store learned knowledge in the synaptic connections between nodes (or weights) . This makes inspection, analysis, and verification of captured knowledge difficult or near-impossible . One approach to this problem is to investigate if and how we might extract symbolic knowledge from trained RNNs, since symbolic knowledge is usually regarded as easier to understand. Surprisingly, this is an old problem that was treated by Minsky in the chapter titled “Neural Networks. Automata Made up of Parts” in his text “Computation, Finite and Infinite Machines” 
. Specifically, if one treats the information processing of a RNN as a mechanism for representing knowledge in symbolic form where a set of rules that govern transitions between symbolic representations are learned, then the RNN can be viewed as an automated reasoning process with production rules, which should be easier to understand.
Prior work focused on extracting symbolic knowledge from recurrent networks. As an example, Borges et al. 
proposed to extract symbolic knowledge from a nonlinear autoregressive model with exogenous inputs model (NARX). Also, it has been demonstrated that by representing information of long-term dependencies in the form of symbolic knowledge 
, RNNs’ ability for handling long-term dependencies can be improved. In sentiment analysis, recent work shows that recurrent networks can be explained by decomposing their decision making process and identifying patterns of words which are “believed” to be important for the decision making. If the words are viewed as symbols, then their patterns can be regarded as representing the rules for determining that sentiment. One of the most frequently adopted rule extraction approaches is to extract deterministic finite automata (DFA) from recurrent networks trained to perform grammatical inference [8, 9, 10, 11, 12, 13, 14]. Approaches following this direction are categorized as compositional 
. In particular, the vector space of a RNN’s hidden layer is first partitioned into finite elements, where each part is treated as a state of a certain DFA. Then, transitions rules that are associated with the alphabet at that time connecting these states are extracted (also known as production rules). Using a DFA to represent production rules is motivated by the need for conducting comprehensive analysis and understanding of the computational abilities of recurrent networks.
Recent work  demonstrates that extracting DFAs from a second-order RNN  is not only possible, but the extraction is relatively stable even when the hidden layer of a second-order RNN is randomly initialized. The latter is important because it has been argued that when a RNN is viewed as a dynamical system, its training process is too sensitive to the initial state of the model . This implied that the following DFA extraction may be unstable which has been shown to be not the case.
, little is known if the aforementioned compositional approaches can be effectively applied to other recurrent networks, especially those that have demonstrated impressive performance on various sequence learning tasks, e.g. long-short-term-memory networks (LSTM)
, gated-recurrent-unit networks (GRU)
, multiplicative integration recurrent neuron networks (MI-RNN), etc. Another equally important yet missing study is how and if DFA extraction will be affected by the data source on which recurrent networks are trained.
In this work, we greatly expand upon previous work [13, 8] and study the effect of DFA extraction performance on different types of recurrent networks and data sets with different levels of complexity. Specifically, the recurrent networks investigated in this study include Elman-RNN, second-order RNN, MI-RNN, LSTM and GRU. We follow previous work [8, 9, 22, 12, 11, 13, 14] by adopting a family of seven relatively simple, yet important, regular grammars, which were originally proposed by Tomita  and widely studied and used as benchmarks for DFA extraction. Given a recurrent model and a Tomita grammar, the performance is evaluated by measuring both the quality of DFAs extracted from this model, and the success rate of extracting the correct DFA which is identical to the unique DFA associated with the grammar used for training that model. Both metrics are evaluated for multiple random trails in order to evaluate the overall performance of DFA extraction for all recurrent networks on all Tomita grammars. In summary, we make the following contributions:
We analyze and categorize regular grammars with a binary alphabet (including Tomita grammars) by defining a entropy that describes their complexity. We discuss the difference between our defined entropy and the entropy of shift space defined for describing symbolic dynamics, and show that our definition is more informative for describing the complexity of regular grammars and can be extended to multiclass classification problems.
We propose an alternative metric – the averaged edit distance – for describing the complexity of regular grammars with a binary alphabet. We show that this metric is closely related to our defined entropy of regular grammars. In addition, through experiments, we demonstrate that the average edit distance reflects a more defining complexity of Tomita grammars and our defined entropy is more computationally efficient to calculate.
We conduct a careful experimental study of evaluating and comparing different recurrent networks for DFA extraction. Our results show that among all RNNs investigated, RNNs with quadratic (or approximate quadratic) forms of hidden layer interaction, i.e. second-order RNN and MI-RNN, provide the most accurate and stable DFA extraction for all Tomita grammars. In particular, on grammars with a high level of complexity, second-order RNN and MI-RNN achieve much better success rates of DFA extraction than other recurrent networks.
In this section, we first briefly introduce DFAs and regular grammars, followed by the set of Tomita grammars used in this study. Then we introduce existing rule extraction methods, especially the compositional approaches which are most widely studied for extracting DFA from recurrent networks.
Ii-a Deterministic Finite Automata
Based on the Chomsky hierarchy of phrase structured grammars , a regular grammar is associated with one of the simplest automata, a deterministic finite automata (DFA). Specifically, given a regular grammar , it can be recognized and generated by a DFA , which can be described by a five-tuple . is the input alphabet (a finite, non-empty set of symbols) and is a finite, non-empty set of states. represents the initial state while represents the set of final states.111Note that can be the empty set, . denotes a set of deterministic production rules. Every grammar also recognizes and generates a corresponding language, a set of strings of symbols from alphabet . It is important to realize that DFA covers a wide range of languages which means that all languages whose string length and alphabet size are bounded can be recognized and generated by a DFA 
. Also, when replacing the deterministic transition with stochastic transition, a DFA can be converted as a probabilistic automata or hidden Markov model, which enables the use of graphical models for grammatical inference. For a more thorough and detailed treatment of regular language and finite state machines, please refer to .
Ii-B Tomita Grammars
Tomita grammars  denote a set of seven regular grammars that have been widely adopted in the study of extracting DFA from RNNs. In principle, when compared with regular grammars associated with large finite-state automata, Tomita grammars should be easily learnable, given that the DFAs associated with Tomita grammars have between three and six states. These grammars all have alphabet , and generate an infinite language over . For each Tomita grammar, we refer to the binary strings generated by this grammar as its associated positive examples and other binary strings as negative examples. A description of Tomita grammars is provided in Table I and the DFA for grammar 2 is shown as an example in Figure 1.
Despite being relatively simple, Tomita grammars actually represent regular grammars with a wide range of complexity. As shown in Table I, the distinction between positive and negative samples for different grammars are very different. For instance, grammars 1, 2 and 7 represent the class of regular languages that define a string set that has extremely unbalanced positive and negative strings. This could represent real-world cases where positive samples are significantly outnumbered by negative ones. In contrast, grammars 5 and 6 define the class of regular languages that have equal or a relatively balanced number of positive and negative strings. In particular, on grammar 5, the numbers of positive and negative strings are the same for string sets with even length. This indicates that the difference between positive and negative strings in these grammars is much smaller than the case for grammars 1,2 and 7. The difference between the numbers of positive and negative strings for grammar 3 an 4 lies between the above cases. When constructing RNNs to learn Tomita grammars, with either case discussed, RNNs are forced to recognize the various levels of difference between positive and negative samples.
It is worth mentioning that the popularity of Tomita grammars in studying DFA extraction problem is also due to the fact that the ground truth DFAs for Tomita grammars are available. This enables previous studies [13, 8] to determine the impact of different factors on the performance of DFA extraction by comparing extracted DFAs with ground truth DFAs. More complex/or real-world data sets may not well support preliminary studies of DFA extraction, since for those datasets, uncertainties will be introduced into the evaluation (e.g. what is the ground truth DFA or if there even exists ground truth DFAs that define the data?). We believe this uncertainty can affect any conclusion of whether a DFA extraction can be stably performed.
Ii-C Rule Extraction for recurrent networks
A survey  on rule extraction methods for recurrent networks categorizes them as (1) compositional approaches, which categorize the cases when rules are constructed based on the hidden layers – ensembles of hidden neurons – of a RNN; (2) decompositional approaches, where rules are constructed based on individual neurons; (3) pedagogical approaches, which construct rules by regarding the target RNN as a black box and have no access to the inner state of this RNN; and (4) eclectic approaches, which represent a hybrid of decompositional and pedagogical approaches. Most aforementioned approaches conduct rule extraction in a post hoc manner. That is, rule extraction is performed with an already trained RNN and a data set containing samples to be processed by this RNN.
Using DFA as rules extracted from RNNs [13, 11, 12, 8, 9] as been very common. In these studies, a RNN is viewed as representing the state transition diagram of a state process – input, statenext state – of a DFA. Correspondingly, a DFA extracted from a RNN can globally describe the behavior of this RNN. Recent work [7, 28, 29] proposes to extract instance-based rules. Specifically, individual rules are extracted from data instances and each extracted rule represents a pattern. As shown in the case for sentiment analysis, a pattern is a combination of important words identified from a sentence processed by a RNN to be interpreted. To construct a global rule set that describes the most important patterns learned by the target RNN, extracted individual rules need to be aggregated using statistical methods . A rule set constructed in this manner usually lacks formal representation and may not be suitable for conducting a more thorough analysis of the behaviors of a RNN. In this work, we follow previous work [13, 11, 12, 8, 9] and represent rules by DFA.
Among all of the above mentioned approaches, both pedagogical and compositional approaches have been applied to extract DFAs from RNNs. In the former category, a recent work  proposes to build a DFA by only querying the outputs of a RNN for certain inputs. This method can be effectively applied to regular languages with small sizes of alphabet, however, it cannot scale to languages with a large size alphabet. This is mainly due to the fact this method replies on the algorithm  which has polynomial complexity. As a result, the extraction process becomes extremely slow when a target RNN performs complicated analysis when processing sophisticated data . The compositional approaches are much more commonly adopted in previous studies [13, 32, 15, 8]. In these works, it is commonly assumed that the vector space of a RNN’s hidden layer can be approximated by a finite set of discrete states , where each rule refers to the transitions between states. As such, a generic compositional approach can be described by the following basic steps:
[wide, labelwidth=!, labelindent=0pt]
Collect the values of a RNN’s hidden layers when processing every sequence at every time step. Then quantize the collected hidden values into different states.
Use the quantized states and the alphabet-labeled arcs that connect these states to construct a transition diagram.
Reduce the diagram to a minimal representation of state transitions.
Previous research has mostly focused on improving the quantization step 222For a more detailed discussion the other two steps, please refer to our previous work  and a survey .. The efficacy of different quantization methods relies on the following hypothesis. The state space of a RNN, which is well trained to learn a regular grammar, should already be fairly well separated with distinct regions that represent the corresponding states in some DFA. This hypothesis, if true, implies that much less effort is required for the quantization step. Indeed, various quantization approaches including equipartition-based methods [13, 11] and clustering methods [33, 8] have been adopted and demonstrated that this hypothesis holds for second-order recurrent networks.
Iii Problem Formulation
Different rule extraction approaches proposed in previous works introduced in Section II essentially describes the process of developing or finding a rule that approximates the behaviors of a target RNN . In the following, we generalize the rule extraction problem in a formal manner.
Definition 1 (Rule Extraction Problem).
Given a RNN denoted as a function where is the data space, is the target space, and a data set with samples and . Let denote a rule which is also a function with its data and target space identical to that of . The rule extraction problem is to find a function such that takes as input a and a then outputs a rule .
As introduced in Section I, in this study we aim at investigating if and how will DFA extraction be affected when we apply DFA extraction to different recurrent networks trained on data sets with different levels of complexity. More specifically, in our case where the underlying data sets are generated by Tomita grammars, we denote by a data set generated by a grammar . Also, where , represents the space of labels for positive and negative strings recognized by . Then in our evaluation framework, we fix the extraction method as the compositional approach (introduced in Section II-C) and evaluate the performance obtained by when its input, i.e. and trained on 333Data set is split into a training set and a test set as typically done for supervised learning.
as typically done for supervised learning., vary across different grammars and different recurrent networks respectively. It is important to note that, by comparing the extraction performance obtained by a given model across different grammars, we then examine how sensitive is each model with respect to the underlying data for DFA extraction problem.
According to above definition, it is clear that the performance of DFA extraction can be evaluated by measuring the quality of extracted DFAs. To be more specific, the extraction performance is evaluated by two metrics. The first metric is the accuracy of an extracted DFA when it is tested on the test set for a certain grammar. The second metric is the success rate of extracting DFAs that are identical to the ground truth DFA associated with a certain grammar, hence should perform perfectly on the test set generated by this grammar. In other words, given a grammar, the success rate of a recurrent model measures how frequently can the correct DFA associated with this grammar be extracted. These metrics quantitatively measure the abilities of different recurrent networks for learning different grammars. In particular, the first metric reflects the abilities of different recurrent networks for learning “good” DFAs. Due to its generality, the first metric is also frequently adopted in much research work [34, 35, 7, 9]. The second metric, which is more rigorous in comparison with the first metric, reflects the abilities of these models for learning correct DFAs. It is important to note that our evaluation framework is agnostic to the underlying extraction method since we imposes no constraint on . As such, this evaluation framework can also be adopted for comparing different rule extraction methods and will be included in our future work.
Iv The Complexity of Tomita Grammars
In this section, we analyze the complexity of Tomita grammars by defining two metrics – the entropy and average edit distance for regular grammars. In principle, these metrics are defined to to measure how balanced are the sets of positive and negative strings, and the difference between these sets. Accordingly, a grammar with higher complexity has more balanced string sets and less difference between these sets.
Following prior work , we plot two example graphs for grammar 2 and 5 in Figure 2 to better illustrate the differences between Tomita grammars.444The plots for other grammars are provided in the appendix.. For each grammar, every concentric ring of its plot reflects the distribution of its associated positive and negative strings with a certain length (ranging from 1 to 8). Specifically, in each concentric ring, we arrange all strings with the same length in lexicographic order.555We do not impose any constraint on the order of string arrangement; other orders, e.g. gray codes, can also be selected for visualization. As previously discussed in Section II-B, the percentages of positive (or negative) strings for different grammars are very different. Especially, on grammar 5, the number of positive strings is equal to that of negative strings when the length of strings is even. Empirically, a data set consisting of balanced positive and negative samples should be desirable for training a model. However, as will be shown in the evaluation part of this study, this may make the learning difficult. For instance, when there are equal numbers of positive and negative strings for grammar 5, by flipping any binary digit of a string to its opposite (e.g. flipping a 0 to 1 or vice versa), any positive or negative string can be converted into a string with the opposite label. This implies that, in order to correctly recognizes all positive strings for grammar 5, a RNN will need to handle such subtle changes. Moreover, since this change can happen to any digit, a RNN must account for all digits without neglecting any.
In the following, we introduce our defined entropy, followed by the definition of average edit distance. Then we show that these two metrics are closely related.
Iv-a Entropy of Tomita Grammars
Given an alphabet , we denote the collection of all strings of symbols from with length as . For grammar , let and represent the sets of positive and negative strings defined by in , respectively. Then we have . Let and denote the size of and , i.e., and , hence we have . The percentage of positive strings in is . To simplify the notation, here we use and to represent and respectively.
Assuming that all strings in are randomly distributed, we then denote the expected times of occurrence for an event – two consecutive strings having different labels – by . We have the following definition of entropy for regular grammars with a binary alphabet.
Definition 2 (Entropy).
Given a regular grammar with alphabet , its entropy is defined as:
where is the entropy calculated for strings with the length of .
From Definition 2, we have the following proposition:
A detailed proof is provided in the Appendix A. It is easy to find that following our definition, the value of entropy lies between and . Based on the values of entropy for different grammars, we have the following theorem for categorizing regular grammars with a binary alphabet.
Given any regular grammar with alphabet , it can be categorized into one of following classes:
[label=(), wide, labelwidth=!, labelindent=0pt]
Polynomial class. Any grammar in this class has the entropy , if and only if the number of positive strings defined by has a polynomial form of , i.e. .
Exponential class. Any grammar in this class has the entropy , if and only if the number of positive strings defined by has an exponential form of , with the bases less than , i.e. where and ;
Proportional class. Any grammar in this class has the entropy , if and only if the number of positive strings defined by is proportional to , i.e. , where .
where indicates that some negligible terms are omitted when approaches infinity.
For Tomita grammars, we categorize grammar 1, 2 and 7 into the polynomial class, grammar 3, 4 into the exponential class and grammar 5, 6 into the proportional class according the values of their entropy. A detailed proof for Theorem 1 and the calculation of entropy for Tomita grammars are provided in Appendix B.
It should be noted that the concept of entropy has previously been introduced in the field of grammatical inference [36, 37]. The definition of entropy introduced in these studies is derived from information theory and is used to measure the relative entropy between stochastic regular grammars. An alternative definition of entropy that is closely related to our definition is introduced for measuring the “information capacity” of a wide class of shift spaces in symbolic dynamics . More formally, the definition of shift space is as follows.
Definition 3 (Shift Space).
Given a full shift , which is the collection of all bi-infinite sequences of symbols from , denote a sequence as . A shift space is a subset of and for some collection of blocks that are forbidden over .
An example shift space is the set of binary strings with no three consecutive 1’s, i.e. , where . This shift space describes the same set of strings accepted by grammar 4. The entropy of a shift space is as follows.
Definition 4 (Entropy of Shift Space).
The entropy of a shift space is defined by:
where denotes the number of N-blocks in .
In Definition 4 when the blocks and forbidden blocks of a shift space are regarded as positive and negative samples for , can then be viewed as representing the data space described by a regular grammar. Despite that both Definition 2 and Definition 4 can describe the complexity of , Definition 2 is constructed by considering the distributions of both positive and negative strings, while Definition 4 only considers positive strings. This more informative nature of Definition 2 has various benefits. For instance, when training a RNN and assuming a training set could coarsely reflect the real distributions of positive and negative samples, then by calculating the entropy according to Definition 2
, we can estimate the complexity of the entire data set. In addition, it is important to note Definition2 can be easily generalized to a multi-class classification case, as one can always calculate the expected number of flips from samples. More formally, for a k-class classification task with strings of length , let denote the number of strings in the th class. Then we have:
By substituting (4) in Definition 2, we are able to compute the entropy for this k-class classification task. As for Definition 4, it only defines the entropy in an one-versus-all manner. Moreover, it should be noted that not all regular grammars can have their data space be represented as a shift space, especially for grammars that lack the shift-invariant and closure properties . As such, Definition 4 cannot be applied for estimating the complexity of these grammars.
Iv-B Average Edit Distance of Tomita Grammars
We now formally define the average edit distance of regular grammars with a binary alphabet in order to measure the difference between the sets of positive and negative strings for a given grammar . We first revisit the definition of edit distance , which measures the minimum number of operations – substituting one symbol for another, or flipping a “1” to “0” or vice versa in our case666The operations include insertion, deletion and substitution of one symbol from a string. It should be noticed that since we fix the length of strings, we omit the operations including insertion and deletion and only consider substitution. – needed to covert a positive or negative string into another.
Definition 5 (Edit Distance).
Given two strings and in , rewrites into in a one-step operation if the following single-symbol substitution holds:
Let denote rewrites into by operations of a single-symbol substitution. Then the edit distance between and denoted by is the smallest such that .
Since we only consider single-symbol substitution, in our case is equal to the Hamming distance between and . In the following, we expand Definition 5 to calculate the average edit distance between the set of positive strings, and the set of negative strings, for grammar . Given a positive string and a negative string , the edit distance between and all negative strings, and the edit distance between and all positive strings can be expressed as:
Then we have the following definition of average edit distance:
Definition 6 (Average Edit Distance).
Given a grammar , the average edit distance between the positive and negative strings defined by is:
calculates the average edit distance for strings with their length equal to . and denote and , respectively.
Using Definition 6, we can categorize Tomita grammars into three classes that are identical to the classes previously introduced in Theorem 1. Detailed calculation of the average edit distance for each grammar is provided in the Appendix C.
[label=(), wide, labelwidth=!, labelindent=0pt]
For grammar 1, 2 and 7, ;
For grammar 3 and 4, ;
For grammar 5 and 6, .
Iv-C Relationship Between Entropy and Average Edit Distance
By comparing the values of entropy and average edit distance defined for Tomita grammars, it is evident that these two defined metrics are closely related to each other. In particular, while the entropy of grammar 5 and 6 has the maximum value of 1, their average edit distance has the minimum value of 1. Additionally, the entropy of grammar 1, 2 and 7 has the minimum value of 0, while their average edit distance approaches infinity as increases.
More formally recall that in (7), and calculate the summed edit distance for all positive strings and all negative strings, respectively, when the length of strings is . Then we have the following proposition.
A detailed proof is provided in Appendix D
. Assuming that a random variabletakes a value from
with the probability offor selecting and for selecting , then Proposition 2 calculates , which is the expected summation of the edit distance between positive and negative strings for any grammar .
In Table II, we show the values of entropy and average edit distance for each grammar calculated by varying the length of generated strings from 8 to 14. Clearly, as increases, the entropy of grammars 1, 2 and 7 monotonically decreases while their average edit distance monotonically increases. For other grammars, both their entropy and average edit distance change in the directions that are opposite to that for grammar 1, 2 and 7. In addition, the results shown in Table II also demonstrate the difference between entropy and average edit distance. Specifically, when only observing the entropy, it is difficult to distinguish grammars 3 and 4 from grammars 5 and 6. The difference between these two classes of grammars is more clearly demonstrated when comparing their average edit distance. In particular, the average edit distance of grammars 5 and 6 is constantly equal to 1, while the average edit distance for grammars 3 and 4 keeps increasing as increases. This indicates that average edit distance reveals more information about a regular grammar when compared with entropy. However, it is also important to note that the time and space cost for calculating average edit distance can be significantly higher than that needed for calculating entropy. Thus there is a trade-off between the granularity and computational efficiency when using these two metrics.
In this section, we first present our experiment setup, including the data sets generated and configurations of the recurrent networks selected. We then introduce the DFA extraction procedure adopted in this study. Last, we provide experimental results and discussion.
V-a Data Set
We followed the approach introduced by [13, 8] to generate string sets for Tomita grammars. To be specific, we drew strings from the grammar specified in Table I and an oracle generating random 0 and 1 strings. The end of each string is set to symbol 2, which represents the “stop” symbol (or end-token as in language modeling). The strings drawn from a grammar were designated as positive samples while those from that random oracle as negative samples. Note that we verified each string from the random oracle to ensure they are not in the string set represented by that corresponding grammar before treating them as negative samples. It should be noticed that each grammar in our experiments represents one set of strings with an unbounded size. As such we restricted the length of the strings drawn within a certain range (listed in the column “Length” of Table I). In our experiments, we specify a lower bound on the string lengths to avoid training RNNs with empty strings. In order to use as many strings as possible to build the datasets, the lower bound should be set to be sufficiently small. We set the lower bound equal to the minimal number of states presented in the corresponding DFA, and the upper bound to allow each state of a DFA to be at least visited twice. In particular, for grammar 1, 2 and 7, it can be easily checked that the data sets for these three grammars to be very imbalanced for positive and negative samples. We up-sampled positive strings in our experiments for these three grammars for the training of RNNs.
We split the strings generated within the specified range of length for each grammar to build the training and testing sets according to the ratios listed in Table I. Both training and testing sets were used to train and test the RNNs accordingly, while only the testing sets were used for evaluating the extracted DFAs.
V-B Recurrent Networks Setup
|Model||Hidden Layer Update||Parameters|
Here we provide an unified view of the update activity of recurrent neurons for the recurrent networks used here. We investigated Elman-RNN, second-order RNN, MI-RNN, LSTM and GRU RNNs. These models were selected based on whether they were frequently adopted either in previous work on DFA extraction or in recent work on processing sequential data.
A recurrent model consists of a hidden layer containing recurrent neurons (an individual neuron designated as ), and a input layer containing input neurons (each designated as ). We denote the values of its hidden layer neuron at th and th discrete times as and . Then the hidden layer is updated by:
is the nonlinear activation function, anddenotes the weight parameters which modifies the strength of interaction among input neurons, hidden neurons, output neurons and any other auxiliary units. In most recurrent networks, the weight parameters usually comprise two separate weights, i.e. and . Then inputs and hidden value at the th discrete time are multiplied by weight and , respectively. Due to space constraints, a detailed description of the hidden layer update for each model is presented in Table III.
Here we follow previous research , which mainly uses either activation functions – sigmoid and tanh – to build recurrent networks. We choose both for Elman and second-order RNNs. We do not evaluate the impact of ReLU upon DFA extraction even thought it has been broadly applied for recurrent networks. This is due to the fact that DFA extraction needs to perform hidden vector clustering and the range of the ReLU function between 0 and infinity makes hidden vector clustering not obvious.
We used recurrent networks with approximately the same number of weight and bias parameters (shown in Table I). Specifically, with Elman-RNN and second-order RNN as examples, we denote the number of hidden neurons for these models as and , and denote the number of weight parameters of these models as and , respectively. Then we have and . By setting , we then determine the size of the hidden layer for each model.
For each model, we use one-hot encoding to process the input symbols. With this configuration, the input layer is designed to contain a single input neuron for each character in the alphabet of the target language. Thus, only one input neuron is activated at each time step.
We follow the approach introduced in 
and apply the following loss function to all recurrent networks:
This loss function is viewed as selecting a special “response” neuron and comparing it to the label . indicates the value of at time
after a model receives the final input symbol. By using this simple loss function, we expect to eliminate the potential effect of adding an extra output layer and introducing more weight and bias parameters. Through this design, we can ensure the knowledge learned by a model resides in the hidden layer and its transitions. During training, we optimize parameters through stochastic gradient descent and employ theRMSprop adaptive learning rate scheme .
V-C Procedure of DFA Extraction
Configuration of DFA Extraction
Recall the basic procedure for DFA extraction introduced in Section II. Here we specify our configurations for each step as follows:
[wide, labelwidth=!, labelindent=0pt]
By collecting all hidden vectors computed by a RNN on all strings from a data set generated as previously discussed, we quantize the continuous space of hidden vectors into a discrete space consisting of a finite set of states. In most previous research, this quantization is usually implemented with clustering methods including k-means clustering[33, 41, 32]42]
, and self-organizing maps. Here, we use k-means due to its simplicity and computational efficiency.
With each hidden vector assigned to a unique cluster, we construct a state transition table. In prior studies, this is conducted by breadth-first search (BFS)  or sampling approaches . A survey of these methods  shows that BFS approaches can construct a transition table relatively consistently but incurs high computational cost when the size of alphabet increases. Compared with BFS approaches, sampling approaches are more computationally efficient. However, they may introduce inconsistencies when constructing a transition table. To achieve a trade-off between computation efficiency and construction consistency, we follow [44, 8] and count the number of transitions observed between states. Then we only preserve the more frequently observed transitions.
With a transition diagram constructed, we utilize a standard and efficient DFA minimization algorithm  which has been broadly adopted in previous works for minimizing DFAs extracted from different recurrent networks and for other DFA minimization.
Random Trials for DFA Extraction
In order to more comprehensively evaluate the performance of DFA extraction for different recurrent networks trained on different grammars, in our experiments we perform multiple trials of DFA extraction for each RNN on every grammar. In particular, given a RNN and the data set generated by a grammar, we vary two factors – the initial value of the hidden vector of this RNN and the pre-specified value of (indicating the number of clusters) for k-means clustering that performs the DFA extraction. In our prior study , we empirically demonstrate that for a sufficiently well trained (100.0% accuracy on the training set) second-order RNN, the initial value of hidden layer has significant influence on the extraction performance when is set to small values. This impact can be gradually alleviated when increases. We observer that when is sufficiently large, the influence of randomly initializing the hidden layer is negligible. As such, in our experiments for every pair of a recurrent model and a grammar, we conducted 10 trials with random initialization of the hidden layer of that model 777For each trial, we select a different seed for generating the initial hidden activations randomly.. Within each trial, we train this recurrent model on the training set associated with this grammar until convergence. Then we apply DFA extraction on this model multiple times by ranging from 3 to 15. In total we perform DFA extraction 130 times for each model on each grammar. We tested and recorded the accuracy of each extracted DFA using the same test set constructed for evaluating the corresponding recurrent model. The extraction performance is then evaluated based on results obtained from these trials of extraction. Through this, we believe we alleviate the impact of different recurrent networks being sensitive to certain initial state settings and clustering configurations.
V-D Comparison of the Quality of the Extracted DFAs
Mean and variance of the accuracy obtained by DFAs extracted from all models on grammar 2, 4 and 5. We denote second-order RNN withsigmoid and tanh activation function by 2nd-Sig and 2nd-Tanh. Similarly, Elman-RNN with these two activation functions are denoted as Elman-Sig and Elman-Tanh respectively.
In the following experiments, we evaluate and compare the quality of DFAs extracted from different RNNs trained on different Tomita grammars. All models are trained to achieve 100.0% accuracy on the training sets constructed for all grammars. Particularly, LSTM and GRU converge much faster than other models. This is as expected, since the data sets are rather simple in comparison with more sophisticated sequential data, e.g. natural language and programming code, on which LSTM and GRU have demonstrated impressive successes. The training results are omitted here due to space constraints.
Given a particular recurrent model and a grammar, the quality of extracted DFAs is evaluated by calculating from multiple trails the mean and variance of the accuracy obtained by extracted DFAs. In Figure 3, we show the results for grammar 2, 4 and 5 (as representatives for the three categories of Tomita grammars introduced in Section IV). For each category of grammars, the results for other grammars are similar to the results obtained from these three representative grammars and are provided in the Appendix. In Figure 3, we observe that on grammars with lower complexity, i.e. grammar 2 and 4, different models behave similarly. Specifically, all models produce DFAs with gradually increasing accuracy and decreasing variance of accuracy, and eventually produce DFAs with near or equal to 100.0% accuracy. It is clear that random initialization of the hidden layer has an impact on the quality of extracted DFAs only when is relatively small and is alleviated when is sufficiently large. In particular, it can be noticed that among all extracted DFAs, DFAs extracted from second-order RNN achieve the highest accuracy with the lowest variance. Upon closer examination, we observe that the values of needed by second-order RNNs for extracting DFAs with near or equal to 100.0% accuracy are smaller than those needed by other models.
The quality of DFAs extracted from different recurrent models is rather diverse for more complex grammars. More specifically, on grammar 5, only second-order RNNs with sigmoid and tanh activation are able to extract DFAs that achieve 100.0% accuracy, while all other models fail. This reflects that all other models, except for second-order RNN, are sensitive to the complexity of the underlying grammars on which they are trained. In particular, DFAs extracted from Elman-RNN with sigmoid and tanh activation, LSTM, and GRU perform much worse that other models. For the Elman-RNN, its worse extraction performance may due to its simple recurrent architecture which somehow limits its ability to capture complicated symbolic knowledge. However, the worse results obtained by LSTM and GRU on grammar 5 are surprising. One possible explanation is that for a recurrent model with a more complicated update activity for its hidden layer, the vector space of its hidden layer may not be spatially separable. As a result, clustering methods developed based on Euclidean distance, including k-means, could not effectively identify different states. Instead, for LSTM and GRU, the gate units constructed in these models might function as decoders for recognizing states that are not spatially separated. Nevertheless, it is an open question on how to extract DFA from these more complicated models.
To better illustrate the influence of different grammars on the quality of extracted DFAs, Figure 4 plots the average accuracy of 130 DFAs extracted from each model trained on each grammar, and the entropy of each grammar calculated by setting . We only show the the results for second-order RNN and Elman-RNN with a sigmoid activation function since the results obtained by these models with the tanh activation are usually worse. As shown in Figure 4, except for second-order RNN and MI-RNN, the average accuracy obtained by DFAs extracted from each model decreases as the entropy of grammars increases. This result indicates that it is generally more difficult for recurrent models to learn a grammar with higher level of complexity. In general, DFAs extracted from second-order RNN have consistently higher accuracy across all grammars. This better performance of second-order RNN on DFA extraction raises questions regarding the quadratic interaction between input and hidden layers used by this model and whether such an interaction could improve other models DFA extraction.
V-E Comparison of the Success Rate of DFA Extraction
In the following experiments, we evaluate and compare different models for their rate of success in extracting the correct DFAs associated with the Tomita grammars. The results are presented in Figure 5. Recall that the success rate of a recurrent model obtained with a grammar is calculated as the percentage of extracted DFAs with 100.0% accuracy among all 130 DFAs. The ordinate labels the success rate in the range from 0.0% to 100.0%, and is increased by 10.0% on the abscissa. The horizontal axis is labeled for each Tomita grammar with their corresponding index.
In Figure 5 we observe that the alignment between the overall variations of success rates and the changes of grammars’ complexity is not as obvious as was previously shown in Figure 4. This is as expected because when calculating the success rate, DFAs with close to 100.0% accuracy are excluded. Recall from the results shown in Figure 4, it can be seen that there are a considerable amount of “good” but not correct extracted DFAs. Despite this difference in the success rates obtained by different models across different grammars, we find that on grammars with lower complexity, all models are capable of producing correct DFAs. In particular, all models achieve much higher success rates on grammar 1. This may due to the reason that the DFA associated with grammar 1 has the fewest number of states (two states as shown in Figure 6) and simplest state transitions among all other DFAs. Thus, the hidden vector space of all models is much easier to separate during training and identify during extraction. As for other grammars with lower complexity, their associated DFAs have both larger number of states and more complicated state transitions. Consider grammar 2 for example. While this grammar has the same level of complexity as grammar 1, its associated DFA has both a larger number of states and more complicated state transitions, as shown in Figure 1. As such, the success rates obtained by most recurrent models on these grammars (except for grammar 1) rarely exceeds 50.0%.
Another interesting observation is that the experimental results shown in this and the previous sections both indicate that the performance of DFA extraction for second-order RNN is generally better than or comparable to that of other models on grammars with lower levels of complexity. For grammars with higher levels of complexity, the second-order RNN enables a much more accurate and stable DFA extraction. Also, generally speaking, MI-RNN provides the second best extraction performance. Recall that second-order RNN has a quadratic form of interaction between weights and neurons (introduced in Section III). Thus, the multiplicative form of interaction used in MI-RNN can be regarded as an approximation to the quadratic interaction used in second-order RNN. This may imply that these special forms of interaction adopted by second-order RNN and MI-RNN are more suitable for generating spatially separable states and representing the state transition diagrams. Especially, we observe that on grammar 5 and 6, which have the highest complexity, only second-order RNN and MI-RNN are able to provide correct DFAs through extraction, while all other recurrent models fail. It is also worth noting that the Elman-RNN, especially Elman-RNN with the tanh activation function, 888This result implies that the choice of activation function may also affect DFA extraction. A study of activation effects is a problem that could be included in future work. obtains the worst success rates on most grammars. In particular, on grammar 2, while the accuracy of DFAs extracted from Elman-RNN is close to 100.0% (as shown in Figure 2(a)), the success rate of Elman-RNN is only around 10%. As for LSTM and GRU, their success rates are consistent with the results shown in Section V-D.
Vi Conclusion and Future Work
We conducted a careful experimental study on learning and extracting deterministic finite state automata (DFA) from different recurrent networks, in particular the Elman-RNN, second-order RNN, MI-RNN, LSTM and GRU, from the Tomita grammars. We observe that the second-order RNN provides the best and most stable performance of DFA extraction on Tomita grammars in general. In particular, on certain grammars, the performance of DFA extraction for second-order RNN is significantly better than other recurrent models. Our experiments also show that, for all models except for second-order RNN, their performance of DFA extraction varies significantly across different Tomita grammars. This inconsistency is explained through our analysis on the complexity of Tomita grammars. Specifically, we introduce two metrics – the entropy and average edit distance – for describing the complexity of regular grammars with binary alphabet. Based on our metrics, we categorize seven Tomita grammars into three classes where each class has similar complexity. The categorization is consistent with the results observed in the experiments.
We apply a generic compositional DFA extraction approach to all recurrent networks studied. Future work will include evaluating and comparing different DFA extraction approaches under the evaluation framework introduced in this study. Also, we could study and exploit the quadratic interaction taken by second-order RNN for training recurrent networks in order to combine desirable performance with reliable rule extraction. In addition, we intend to study the performance of DFA extraction on real-world applications. Another direction would be to see if other activation functions, such as ReLU can be used. It would also be interesting to explore this approach on large scale grammar problems and to extend our theoretical analysis to more general grammars and those with larger alphabets.
We gratefully acknowledge partial support from the National Science Foundation.
-  J. Du, S. Zhang, G. Wu, J. M. F. Moura, and S. Kar, “Topology adaptive graph convolutional networks,” CoRR, vol. abs/1710.10370, 2017.
-  C. W. Omlin and C. L. Giles, “Symbolic knowledge representation in recurrent neural networks: Insights from theoretical models of computation,” Knowledge based neurocomputing, pp. 63–115, 2000.
-  M. L. Minsky, Computation: finite and infinite machines. Prentice-Hall, Inc., 1967.
-  R. V. Borges, A. S. d’Avila Garcez, and L. C. Lamb, “Learning and representing temporal knowledge in recurrent networks,” IEEE Trans. Neural Networks, vol. 22, no. 12, pp. 2409–2421, 2011.
-  T. Lin, B. G. Horne, P. Tiño, and C. L. Giles, “Learning long-term dependencies in NARX recurrent neural networks,” IEEE Trans. Neural Networks, vol. 7, no. 6, pp. 1329–1338, 1996.
-  B. Dhingra, Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Linguistic knowledge as memory for recurrent neural networks,” CoRR, vol. abs/1703.02620, 2017.
-  W. J. Murdoch and A. Szlam, “Automatic rule extraction from long short term memory networks,” CoRR, vol. abs/1702.02540, 2017.
-  Q. Wang, K. Zhang, A. G. Ororbia II, X. Xing, X. Liu, and C. L. Giles, “An empirical evaluation of rule extraction from recurrent neural networks,” Neural Computation, vol. 30, no. 9, pp. 2568–2591, 2018.
-  G. Weiss, Y. Goldberg, and E. Yahav, “Extracting automata from recurrent neural networks using queries and counterexamples,” Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 5244–5253, 2018.
-  M. Cohen, A. Caciularu, I. Rejwan, and J. Berant, “Inducing regular grammars using recurrent neural networks,” CoRR, vol. abs/1710.10453, 2017.
-  C. W. Omlin and C. L. Giles, “Extraction of rules from discrete-time recurrent neural networks,” Neural Networks, vol. 9, no. 1, pp. 41–52, 1996.
-  M. Casey, “The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction,” Neural computation, vol. 8, no. 6, pp. 1135–1178, 1996.
-  C. L. Giles, C. B. Miller, D. Chen, H.-H. Chen, G.-Z. Sun, and Y.-C. Lee, “Learning and extracting finite state automata with second-order recurrent neural networks,” Neural Computation, vol. 4, no. 3, pp. 393–405, 1992.
-  R. L. Watrous and G. M. Kuhn, “Induction of finite-state languages using second-order recurrent networks,” Neural Computation, vol. 4, pp. 406–414, 1992.
-  H. Jacobsson, “Rule extraction from recurrent neural networks: Ataxonomy and review,” Neural Computation, vol. 17, no. 6, pp. 1223–1263, 2005.
-  C. L. Giles, D. Chen, C. Miller, H. Chen, G. Sun, and Y. Lee, “Second-order recurrent neural networks for grammatical inference,” in Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, vol. 2. IEEE, 1991, pp. 273–281.
-  J. F. Kolen, “Fool’s gold: Extracting finite state machines from recurrent network dynamics,” in Advances in neural information processing systems, 1994, pp. 501–508.
-  J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” inProceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, 2014, pp. 103–111.
-  Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. Salakhutdinov, “On multiplicative integration with recurrent neural networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 2856–2864.
-  K. Li and J. C. Príncipe, “The kernel adaptive autoregressive-moving-average algorithm,” IEEE transactions on neural networks and learning systems, vol. 27, no. 2, pp. 334–346, 2016.
-  M. Tomita, “Dynamic construction of finite automata from example using hill-climbing,” Proceedings of the Fourth Annual Cognitive Science Conference, pp. 105–108, 1982.
-  N. Chomsky, “Three models for the description of language,” IRE Transactions on information theory, vol. 2, no. 3, pp. 113–124, 1956.
-  J. Du, S. Ma, Y. Wu, S. Kar, and J. M. F. Moura, “Convergence analysis of distributed inference with vector-valued gaussian belief propagation,” Journal of Machine Learning Research, vol. 18, pp. 172:1–172:38, 2017.
-  J. E. Hopcroft, R. Motwani, and J. D. Ullman, “Introduction to automata theory, languages, and computation - international edition (2. ed),” 2003.
-  C. L. Giles, G.-Z. Sun, H.-H. Chen, Y.-C. Lee, and D. Chen, “Higher order recurrent networks and grammatical inference,” in Advances in neural information processing systems, 1990, pp. 380–387.
-  W. J. Murdoch, P. J. Liu, and B. Yu, “Beyond word importance: Contextual decomposition to extract interactions from lstms,” CoRR, vol. abs/1801.05453, 2018.
-  C. Singh, W. J. Murdoch, and B. Yu, “Hierarchical interpretations for neural network predictions,” CoRR, vol. abs/1806.05337, 2018.
-  G. Weiss, Y. Goldberg, and E. Yahav, “Extracting automata from recurrent neural networks using queries and counterexamples,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 5247–5256.
-  D. Angluin, “Learning regular sets from queries and counterexamples,” Inf. Comput., vol. 75, no. 2, pp. 87–106, 1987.
-  M. Gori, M. Maggini, E. Martinelli, and G. Soda, “Inductive inference from noisy examples using the hybrid finite state filter,” IEEE Transactions on Neural Networks, vol. 9, no. 3, pp. 571–575, 1998.
-  Z. Zeng, R. M. Goodman, and P. Smyth, “Learning finite state machines with self-clustering recurrent networks,” Neural Computation, vol. 5, no. 6, pp. 976–990, 1993.
M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144.
N. Frosst and G. E. Hinton, “Distilling a neural network into a soft decision tree,” in
Proceedings of the First International Workshop on Comprehensibility and Explanation in AI and ML 2017 co-located with 16th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2017), Bari, Italy, November 16th and 17th, 2017., 2017.
-  R. C. Carrasco, “Accurate computation of the relative entropy between stochastic regular grammars,” RAIRO-Theoretical Informatics and Applications, vol. 31, no. 5, pp. 437–444, 1997.
F. Thollard, P. Dupont, C. de la Higuera et al.
, “Probabilistic dfa inference using kullback-leibler divergence and minimality,” inICML, 2000, pp. 975–982.
-  D. Lind and B. Marcus, An introduction to symbolic dynamics and coding. Cambridge university press, 1995.
-  C. De la Higuera, Grammatical inference: learning automata and grammars. Cambridge University Press, 2010.
-  T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
P. Frasconi, M. Gori, M. Maggini, and G. Soda, “Representation of finite state automata in recurrent radial basis function networks,”Machine Learning, vol. 23, no. 1, pp. 5–32, 1996.
-  A. Sanfeliu and R. Alquezar, “Active grammatical inference: a new learning methodology,” in in Shape, Structure and Pattern Recogniton, D. Dori and A. Bruckstein (eds.), World Scientific Pub. Citeseer, 1994.
-  P. Tiňo and J. Šajda, “Learning and extracting initial mealy automata with a modular neural network model,” Neural Computation, vol. 7, no. 4, pp. 822–844, 1995.
-  I. Schellhammer, J. Diederich, M. Towsey, and C. Brugman, “Knowledge extraction and recurrent neural networks: An analysis of an elman network trained on a natural language learning task,” in Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 1998, pp. 73–78.
-  M. Feinberg, “Fibonacci-tribonacci,” The Fibonacci Quarterly, vol. 1, no. 1, pp. 71–74, 1963.
Appendix A Proof of Proposition 1
Given any concentric ring shown in Figure 2, let denote the number of consecutive runs of strings and and denote the number of consecutive runs of positive strings and negative strings in this concentric ring respectively. Then we have . Without loss of generality, we can choose the first position as in the concentric ring. Then we introduce an indicator function by representing that a run of positive strings starts at the -th position and otherwise. Since , we have
As such, we have
By substituting into (1), we have
Appendix B Proof of Theorem 1
In both the Definition 2 and Proposition 1, we use to cover certain particular cases, for instance when is set to odd value for grammar 5. In the following proof, without loss of generality, we use instead of for simplicity.
According to Proposition 1, for any regular grammar with binary alphabet, its entropy . It can be checked that the maximum value of is when . Also, the minimum value of is and can be reached when or . However, or are only allowed for grammars that either accept or reject any binary string, hence are not considered this theorem. As such, in our case, we take the value of entropy as minimum when or . In the following, we only discuss the former case due to space limit, the latter can be similarly derived.
For each class of grammars, given that their takes the corresponding form shown in Theorem 1, the proof for the sufficient condition is trivial and can be checked by applying L’Hospital’s Rule. As such, in the following we only provide a proof for the necessary condition.
From (9), we have:
It is easy to check that exists for regular grammars, then we separate the above equation as follows:
It should be noted that the second term in the above equation equals . Specifically, assuming that has the form of where ( cannot be larger than 2 for binary alphabet), then the denominator of the second term is infinity. If has the form of , then the numerator tends to zero while the denominator is finite. As such, we have
If , then we have , indicating that the dominant part of has a polynomial form of hence .
If , then we have , which gives that , where . If , then we have where . Furthermore, if , we have where .
Here we calculate the for grammar 4, 5 and 7 which falls into each of the three classes of grammars, respectively.
, where and . The calculation is similar to calculating Tribonacci number ;
. When is odd/even, /;
We can classify all positive strings associated with grammar 7 into groups: , , , , , and , where indicates 1 or more repetitions. By simple combinatorics, we have .
Appendix C Calculation of the Average Edit Distance of Tomita Grammars
[label=(), wide, labelwidth=!, labelindent=0pt]
For grammar 3 and 4, their average edit distance . Take grammar 3 as an example, it is easy to check that for any . For a negative string or , where denotes any substring that is recognized by grammar 3, and denotes any string that is rejected by grammar 3. The minimum substitutions required to convert into any , depends on the number of occurrences of , which can be larger than 1. As such, , hence .
For grammar 5 and 6, their average edit distance . Specifically, for grammar 5, we only consider the case when is even, otherwise there are no positive strings hence is empty. Given that is even, it is clear that and . Then we have . For grammar 6, it is easy to check that for any , and
Then we have when .
Appendix D Proof of Proposition 2
It is clear to show that
As such we have
where . Since the sequence is bounded and converges to zero, we have .