We address the problem of learning a probabilistic deterministic finite automaton (PDFA) from a trained recurrent neural network (RNN) Elman (1990). RNNs, and in particular their gated variants GRU Cho et al. (2014); Chung et al. (2014) and LSTM Hochreiter and Schmidhuber (1997), are well known to be very powerful for sequence modelling, but are not interpretable. PDFAs, which explicitly list their states, transitions, and weights, are more interpretable than RNNs Hammerschmidt et al. (2016), while still being analogous to them in behaviour: both emit a single next-token distribution from each state, and have deterministic state transitions given a state and token. They are also much faster to use than RNNs, as their sequence processing does not require matrix operations.
We present an algorithm for reconstructing a PDFA from any given black-box distribution over sequences, such as an RNN trained with a language modelling objective (LM-RNN). The algorithm is applicable for reconstruction of any weighted deterministic finite automaton (WDFA), and is guaranteed to return a PDFA when the target is stochastic – as an LM-RNN is.
Weighted Finite Automata (WFA) A WFA is a weighted non-deterministic finite automaton, capable of encoding language models but also other, non-stochastic weighted functions. Ayache et al. Ayache et al. (2018) and Okudono et al. Okudono et al. (2019) show how to apply spectral learning Balle et al. (2014) to an LM-RNN to learn a weighted finite automaton (WFA) approximating its behaviour.
Probabilistic Deterministic Finite Automata (PDFAs) are a weighted variant of DFAs where each state defines a categorical next-token distribution. Processing a sequence in a PDFA is simple: input tokens are processed one by one, getting the next state and probability for each token by table lookup.
WFAs are non-deterministic and so not immediately analogous to RNNs. They are also slower to use than PDFAs, as processing each token in an input sequence requires a matrix multiplication. Finally, spectral learning algorithms are not guaranteed to return stochastic hypotheses even when the target is stochastic – though this can remedied by using quadratic weighted automata Bailly (2011) and normalising their weights. For these reasons we prefer PDFAs over WFAs for RNN approximation. Formally:
Problem Definition Given an LM-RNN , find a PDFA approximating , such that for any prefix its next-token distributions in and in have low total variation distance between them.
Existing works on PDFA reconstruction assume a sample based paradigm: the target cannot be queried explicitly for a sequence’s probability or conditional probabilities Clark and Thollard (2004); Carrasco and Oncina (1994); Balle et al. (2013). As such, these methods cannot take full advantage of the information available from an LM-RNN111 It
is possible to adapt these methods to an active learning setting,
in which they may explicitly query an oracle for the target for exact probabilities.
However, this raises other questions: on which suffixes are prefixes compared? (If each has its own set, this requires calculation from every other prefix during comparison.)
How does one pool the probabilities of two prefixes when merging them?
We leave such an adaptation to future work.
It is possible to adapt these methods to an active learning setting, in which they may explicitly query an oracle for the target for exact probabilities. However, this raises other questions: on which suffixes are prefixes compared? (If each has its own set, this requires calculation from every other prefix during comparison.) How does one pool the probabilities of two prefixes when merging them? We leave such an adaptation to future work.. Meanwhile, most work on the extraction of finite automata from RNNs has focused on “binary” deterministic finite automata (DFAs) Omlin and Giles (1996); Cechin et al. (2003); Wang et al. (2017); Weiss et al. (2018); Mayr and Yovine (2018), which cannot fully express the behaviour of an LM-RNN.
Our Approach Following the successful application of L Angluin (1987) to RNNs for DFA extraction Weiss et al. (2018), we develop an adaptation of L for the weighted case. The adaptation returns a PDFA when applied to a stochastic target such as an LM-RNN. It interacts with an oracle using two types of queries:
Membership Queries: requests to give the target probability of the last token in a sequence.
Equivalence Queries: requests to accept or reject a hypothesis PDFA, returning a counterexample — a sequence for which the hypothesis automaton and the target language diverge beyond the tolerance on the next token distribution — if rejecting.
The algorithm alternates between filling an observation table with observations of the target behaviour, and presenting minimal PDFAs consistent with that table to the oracle for equivalence checking. This continues until an automaton is accepted. The use of conditional properties in the observation table prevents the observations from vanishing to on low probabilities. To the best of our knowledge, this is the first work on learning PDFAs from RNNs.
A key insight of our adaptation is the use of an additive variation tolerance
when comparing rows in the table. In this framework, two probability vectors are considered-equal if their probabilities for each event are within of each other. Using this tolerance enables us to extract a much smaller PDFA than the original target, while still making locally similar predictions to it on any given sequence. This is necessary because RNN states are real valued vectors, making the potential number of reachable states in an LM-RNN unbounded. The tolerance is non-transitive, making construction of PDFAs from the table more challenging than in L. Our algorithm suggests a way to address this.
Even with this tolerance, reaching equivalence may take a long time for large target PDFAs, and so we design our algorithm to allow anytime stopping of the extraction. The method allows the extraction to be limited while still maintaining certain guarantees on the reconstructed PDFA.
Note. While this paper only discusses RNNs, the algorithm itself is actually agnostic to the underlying structure of the target, and can be applied to any language model. In particular it may be applied to transformers Vaswani et al. (2017); Devlin et al. (2018). However, in this case the analogy to PDFAs breaks down.
The main contributions of this paper are:
An algorithm for reconstructing a WDFA from any given weighted target, and in particular a PDFA if the target is stochastic.
A method for anytime extraction termination while still maintaining correctness guarantees.
An implementation of the algorithm and an evaluation over extraction from LM-RNNs, including a comparison to other LM reconstruction techniques.
2 Related Work
to RNNs, successfully extracting deterministic finite automata (DFAs) from given binary-classifier RNNs. This work expands on this by adapting Lto extract PDFAs from LM-RNNs. To apply exact learning to RNNs, one must implement equivalence queries: requests to accept or reject a hypothesis. Okudono et al. Okudono et al. (2019) show how to adapt the equivalence query presented in Weiss et al. (2018) to the weighted case.
There exist many methods for PDFA learning, originally for acyclic PDFAs Rulot and Vidal (1988); Ron et al. (1998); Carrasco and Oncina (1999), and later for PDFAs in general Clark and Thollard (2004); Carrasco and Oncina (1994); Thollard et al. (2000); Palmer and Goldberg (2007); Castro and Gavaldà (2008); Balle et al. (2013)
. These methods split and merge states in the learned PDFAs according to sample-based estimations of their conditional distributions. Unfortunately, they require very large sample sets to succeed (e.g.,Clark and Thollard (2004) requires ~13m samples for a PDFA with ).
Distributions over can also be represented by WFAs, though these are non-deterministic. These can be learned using spectral algorithms, which use SVD decomposition and matrices of observations from the target to build a WFA Bailly et al. (2009); Balle et al. (2014); Balle and Mohri (2015); Hsu et al. (2008). Spectral algorithms have recently been applied to RNNs to extract WFAs representing their behaviour Ayache et al. (2018); Okudono et al. (2019); Rabusseau et al. (2019), we compare to Ayache et al. (2018) in this work. The choice of observations used is also a focus of research in this field Quattoni et al. (2017).
Sequences and Notations For a finite alphabet , the set of finite sequences over is denoted by , and the empty sequence by . For any and stopping symbol , we denote , and – the set of where the stopping symbol may only appear at the end.
For a sequence , its length is denoted , its concatenation after another sequence is denoted , its -th element is denoted , and its prefix of length is denoted . We use the shorthand and . A set of sequences is said to be prefix closed if for every and , . Suffix closedness is defined analogously.
For any finite alphabet and set of sequences , we assume some internal ordering of the set’s elements to allow discussion of vectors of observations over those elements.
Probabilistic Deterministic Finite Automata (PDFAs) are tuples such that is a finite set of states, is the initial state, is the finite input alphabet, is the transition function and is the transition weight function, satisfying for every .
The recurrent application of to a sequence is denoted by , and defined: and for every , . We abuse notation to denote: for every . If for every there exists a series of non-zero transitions reaching a state with , then defines a distribution over as follows: for every , .
Language Models (LMs) Given a finite alphabet , a language model over is a model defining a distribution over . For any , and , induces the following:
Prefix Probability: .
Last Token Probability: if , then and .
Last Token Probabilities Vector: if , .
Next Token Distribution: , defined: .
Variation Tolerance Given two categorical distributions and , their total variation distance is defined , i.e., the largest difference in probabilities that they assign to the same event. Our algorithm tolerates some variation distance between next-token probabilities, as follows:
Two event probabilities are called -equal and denoted if . Similarly, two vectors of probabilities are called -equal and denoted if , i.e. if . For any distribution over , , and , we denote if , or simply if is clear from context. For any two language models over and , we say that are -consistent on if for every prefix of . We call the variation tolerance.
Oracles and Observation Tables Given an oracle , an observation table for is a sequence indexed matrix of observations taken from it, with the rows indexed by prefixes and the columns by suffixes . The observations are for every , . For any we denote , and for every the -th row in is denoted . In this work we use an oracle for the last-token probabilities of the target, for every , and maintain .
Recurrent Neural Networks (RNNs) An RNN is a recursive parametrised function with initial state , such that is the state after time and is the input at time . A language model RNN (LM-RNN) over an alphabet is an RNN coupled with a prediction function , where is a vector representation of a next-token distribution. RNNs differ from PDFAs only in that their number of reachable states (and so number of different next-token distributions for sequences) may be unbounded.
4 Learning PDFAs with Queries and Counterexamples
In this section we describe the details of our algorithm. We explain why a direct application of L to PDFAs will not work, and then present our non-trivial adaptation. Our adaptation does not rely on the target being stochastic, and can in fact be applied to reconstruct any WDFA from an oracle.
Direct application of L does not work for LM-RNNs: L is a polynomial-time algorithm for learning a deterministic finite automaton (DFA) from an oracle. It can be adapted to work with oracles giving any finite number of classifications to sequences, and can be naively adapted to a probabilistic target with finite possible next-token distributions by treating each next-token distribution as a sequence classification. However, this will not work for reconstruction from RNNs. This is because the set of reachable states in a given RNN is unbounded, and so also the set of next-token distributions. Thus, in order to practically adapt L to extract PDFAs from LM-RNNs, we must reduce the number of classes L deals with.
Variation Tolerance Our algorithm reduces the number of classes it considers by allowing an additive variation tolerance , and considering -equality (as presented in Section 3) as opposed to actual equality when comparing probabilities. In introducing this tolerance we must handle the fact that it may be non-transitive: there may exist such that , but . 222We could define a variation tolerance by quantisation of the distribution space, which would be transitive. However this may be unnecessarily aggressive at the edges of the intervals.
To avoid potentially grouping together all predictions on long sequences, which are likely to have very low probabilities, our algorithm observes only local probabilities. In particular, the algorithm uses an oracle that gives the last-token probability for every non-empty input sequence.
4.1 The Algorithm
The algorithm loops over three main steps: (1) expanding an observation table until it is closed and consistent, (2) constructing a hypothesis automaton, and (3) making an equivalence query about the hypothesis. The loop repeats as long as the oracle returns counterexamples for the hypotheses. In our setting, counterexamples are sequences after which the hypothesis and the target have next-token distributions that are not -equal. They are handled by adding all of their prefixes to .
Our algorithm expects last token probabilities from the oracle, i.e.: where is the target distribution. The oracle is not queried on , which is undefined. To observe the entirety of every prefix’s next-token distribution, is initiated with .
Step 1: Expanding the observation table
is expanded as in L Angluin (1987), but with the definition of row equality relaxed. Precisely, it is expanded until:
Closedness For every and , there exists some such that .
Consistency For every such that , for every , .
The table expansion is managed by a queue initiated to , from which prefixes are processed one at a time as follows: If , and there is no s.t. , then is added to . If already, then it is checked for inconsistency, i.e. whether there exist s.t. but . In this case a separating suffix , is added to , such that now , and the expansion restarts. Finally, if then is updated with .
As in L, checking closedness and consistency can be done in arbitrary order. However, if the algorithm may be terminated before is closed and consistent, it is better to process in order of prefix probability (see section 4.2).
Step 2: PDFA construction
Intuitively, we would like to group equivalent rows of the observation table to form the states of the PDFA, and map transitions between these groups according to the table’s observations. The challenge in the variation-tolerating setting is that -equality is not transitive.
Formally, let be a partitioning (clustering) of , and for each let be the partition (cluster) containing . should satisfy:
Determinism For every , , : .
-equality (Cliques) For every and , .
For , , we denote the next-clusters reached from with , and . Note that satisfies determinism iff for every . Note also that the constraints are always satisfiable by the clustering
We present a 4-step algorithm to solve these constraints while trying to avoid excessive partitions: 333We describe our implementation of these stages in appendix C.
Initialisation: The prefixes are partitioned into some initial clustering according to the -equality of their rows, .
Determinism I: is refined until it satisfies determinism: clusters with tokens for which are split by next-cluster equivalence into new clusters.
Cliques: Each cluster is refined into cliques (with respect to -equality).
Determinism II: is again refined until it satisfies determinism, as in (2).
Note that refining a partitioning into cliques may break determinism, but refining into a deterministic partitioning will not break cliques. In addition, when only allowed to refine clusters (and not merge them), all determinism refinements are necessary. Hence the order of the last 3 stages.
Step 3: Answering Equivalence Queries
We sample the target LM-RNN and hypothesis PDFA a finite number of times, testing every prefix of each sample to see if it is a counterexample. If none is found, we accept . Though simple, we find this method to be sufficiently effective in practice. A more sophisticated approach is presented in Okudono et al. (2019).
4.2 Practical Considerations
We present some methods and heuristics that allow a more effective application of the algorithm to large (with respect to , ) or poorly learned grammars.
Anytime Stopping In case the algorithm runs for too long, we allow termination before is closed and consistent, which may be imposed by size or time limits on the table expansion. If reaches its limit, the table expansion continues but stops checking consistency. If the time or limits are reached, the algorithm stops, constructing and accepting a PDFA from the table as is. The construction is unchanged up to the fact that some of the transitions may not have a defined destination, for these we use a “best cluster match” as described in section 4.2. This does not harm the guarantees on -consistency between and the returned PDFA discussed in Section 5.
Order of Expansion As some prefixes will not be added to under anytime stopping, the order in which rows are checked for closedness and consistency matters. We sort by prefix weight. Moreover, if a prefix being considered is found inconsistent w.r.t. some , then all such pairs are considered and the separating suffix , with the highest minimum conditional probability is added to .
Best Cluster Match
Given a prefix and set of clusters , we seek a best fit for . First we filter for the following qualities until one is non-empty, in order of preference: (1) is a clique w.r.t. -equality. (2) There exists some such that , and is not a clique. (3) There exists some such that . If no clusters satisfy these qualities, we remain with . From the resulting group of potential matches, the best match could be the cluster minimising , . In practice, we choose from arbitrarily for efficiency.
Suffix and Prefix Thresholds
Occasionally when checking the consistency of two rows , a separating suffix will be found that is actually very unlikely to be seen after or . In this case it is unproductive to add to . Moreover – especially as RNNs are unlikely to perfectly learn a probability of for some event – it is possible that going through will reach a large number of ‘junk’ states. Similarly when considering a prefix , if is very low then it is possible that it is the failed encoding of probability , and that all states reachable through are not useful.
We introduce thresholds and for both suffixes and prefixes. When a potential separating suffix is found from prefixes and , it is added to only if . Similarly, potential new rows are only added to if .
Finding Close Rows
We maintain in a KD-tree indexed by row entries , with one level for every column . When considering of a prefix , we use to get the subset of all potentially -equal prefixes. ’s levels are split into equal-length intervals, we find to work well.
Choosing the Variation Tolerance
In our initial experiments (on SPiCes 0-3), we used . The intuition was that given no data, the fairest distribution over
is the uniform distribution, and so this may also be a reasonable threshold for a significant difference between two probabilities. In practice, we found thatoften strongly differentiates states even in models with larger alphabets – except for SPiCe 1, where quickly accepted a model of size 1. A reasonable strategy for choosing is to begin with a large one, and reduce it if equivalence is reached too quickly.
We note some guarantees on the extracted model’s qualities and relation to its target model. Formal statements and full proofs for each of the guarantees listed here are given in appendix A.
The model is guaranteed to be deterministic by construction. Moreover, if the target is stochastic, then the returned model is guaranteed to be stochastic as well.
If the algorithm terminates successfully (i.e., having passed an equivalence query), then the returned model is -consistent with the target on every sequence , by definition of the query. In practice we have no true oracle and only approximate equivalence queries by sampling the models, and so can only attain a probable guarantee of their relative -consistency.
-Consistency and Progress
No matter when the algorithm is stopped, the returned model is always -consistent with its target on every , where is the set of prefixes in the table . Moreover, as long as the algorithm is running, the prefix set is always increased within a finite number of operations. This means that the algorithm maintains a growing set of prefixes on which any PDFA it returns is guaranteed to be -consistent with the target. In particular, this means that if equivalence is not reached, at least the algorithm’s model of the target improves for as long as it runs.
6 Experimental Evaluation
We apply our algorithm to 2-layer LSTMs trained on grammars from the SPiCe competition Balle et al. (2016), adaptations of the Tomita grammars Tomita (1982) to PDFAs, and small PDFAs representing languages with unbounded history. The LSTMs have input dimensions - and hidden dimensions -. The LSTMs and their training methods are fully described in Appendix E.
Compared Methods We compare our algorithm to the sample-based method ALERGIA Carrasco and Oncina (1994), the spectral algorithm used in Ayache et al. (2018), and -grams. An -gram is a PDFA whose states are a sliding window of length over the input sequence, with transition function . The probability of a token from state is the MLE estimate , where is the number of times the sequence appears as a subsequence in the samples. For ALERGIA, we use the PDFA/DFA inference toolkit flexfringe Verwer and Hammerschmidt (2017).
Target Languages We train RNNs on a subset of the SPiCe grammars, covering languages generated by HMMs, and languages from the NLP, software, and biology domains. We train RNNs on PDFA adaptations of the Tomita languages Tomita (1982), made from the minimal DFA for each language by giving each of its states a next-token distribution as a function of whether it is accepting or not. We give a full description of the Tomita adaptations and extraction results in appendix D. As we show in (6.1), the -gram models prove to be very strong competitors on the SPiCe languages. To this end, we consider three additional languages that need to track information for an unbounded history, and thus cannot be captured by any -gram model. We call these UHLs (unbounded history languages).
UHLs 1 and 2 are PDFAs that cycle through 9 and 5 states with different next token probabilities. UHL 3 is a weighted adaptation of the 5th Tomita grammar, changing its next-token distribution according to the parity of the seen 0s and 1s. The UHLs are drawn in appendix D.
Extraction Parameters Most of the extraction parameters differ between the RNNs, and are described in the results tables (1, 2). For our algorithm, we always limited the equivalence query to samples. For the spectral algorithm, we made WFAs for all ranks , , and . For the -grams we used all . For these two, we always show the best results for NDCG and WER. For ALERGIA in the flexfringe toolkit, we use the parameters symbol_count=50 and state_count=N, with N given in the tables.
Evaluation Measures We evaluate the extracted models against their target RNNs on word error rate (WER) and on normalised discounted cumulative gain (NDCG), which was the scoring function for the SPiCe challenge. In particular the SPiCe challenge evaluated models on , and we evaluate the models extracted from the SPiCe RNNs on this as well. For the UHLs, we use as they have smaller alphabets. We do not use probabilistic measures such as perplexity, as the spectral algorithm is not guaranteed to return probabilistic automata.
Word error rate (WER): The WER of model A against B on a set of predictions is the fraction of next-token predictions (most likely next token) that are different in A and B.
Normalised discounted cumulative gain (NDCG): The NDCG of A against B on a set of sequences scores A’s ranking of the top most likely tokens after each sequence , , in comparison to the actual most likely tokens given by B, . Formally:
For NDCG we sample the RNN repeatedly, taking all the prefixes of each sample until we have prefixes. We then compute the NDCG for each prefix and take the average. For WER, we take full samples from the RNN, and return the fraction of errors over all of the next-token predictions in those samples. An ideal WER and NDCG is and , we note this with in the tables.
6.1 Results and Discussion
Tables 1 and 2 show the results of extraction from the SPiCe and UHL RNNs, respectively. In them, we list our algorithm as WL(Weighted L). For the WFAs and -grams, which are generated with several values of (rank) and , we show the best scores for each metric. We list the size of the best model for each metric. We do not report the extraction times separately, as they are very similar: the majority of time in these algorithms is spent generating the samples or Hankel matrices.
For PDFAs and WFAs the size columns present the number of states, for the WFAs this is equal to the rank with which they were reconstructed. For -grams the size is the number of table entries in the model, and the chosen value of is listed in brackets. In the SPiCe languages, our algorithm did not reach equivalence, and used between 1 and 6 counterexamples for every language before being stopped – with the exception of SPiCe1 with , which reached equivalence on a single state. The UHLs and Tomitas used 0-2 counterexamples each before reaching equivalence.
The SPiCe results show a strong advantage to our algorithm in most of the small synthetic languages (1-3), with the spectral extraction taking a slight lead on SPiCe 0. However, in the remaining SPiCe languages, the -gram strongly outperforms all other methods. Nevertheless, -gram models are inherently restricted to languages that can be captured with bounded histories, and the UHLs demonstrate cases where this property does not hold. Indeed, all the algorithms outperform the -grams on these languages (Table 2).
|Language ()||Model||WER||NDCG||Time (h)||WER Size||NDCG Size|
|SPiCe 0 (, )||WL||0.084||0.987||0.3||4988||4988|
|N-Gram||0.096||0.991||0.8||1118 (n=6)||1118 (n=6)|
|SPiCe 1 (, )||WL||0.093||0.971||0.4||152||152|
|N-Gram||0.337||0.897||0.8||8421 (n=4)||421 (n=3)|
|SPiCe 2 (, )||WL||0.08||0.972||0.8||962||962|
|N-Gram||0.278||0.894||0.8||1111 (n=4)||1111 (n=4)|
|SPiCe 3 (, )||WL||0.327||0.928||1.0||675||675|
|N-Gram||0.46||0.847||0.8||1111 (n=4)||11110 (n=5)|
|SPiCe 4 (, )||WL||0.301||0.829||0.7||4999||4999|
|N-Gram||0.099||0.968||0.8||186601 (n=6)||61851 (n=5)|
|SPiCe 6 (, )||WL||0.593||0.644||2.5||5000||5000|
|N-Gram||0.285||0.888||0.8||127817 (n=5)||127817 (n=5)|
|SPiCe 7 (, )||WL||0.626||0.642||0.5||4996||4996|
|N-Gram||0.441||0.812||0.7||133026 (n=5)||133026 (n=5)|
|SPiCe 9 (, )||WL||0.503||0.721||0.5||4992||4992|
|N-Gram||0.123||0.961||1.0||44533 (n=6)||44533 (n=6)|
|SPiCe 10 (, )||WL||0.651||0.593||0.9||4987||4987|
|N-Gram||0.348||0.845||0.8||153688 (n=5)||153688 (n=5)|
|SPiCe 14 (, )||WL||0.442||0.716||0.8||4999||4999|
|N-Gram||0.079||0.977||0.7||125572 (n=6)||46158 (n=5)|
Our algorithm succeeds in perfectly reconstructing the target PDFA structure for each of the UHL languages, and giving it transition weights within the given variation tolerance (when extracting from the RNN and not directly from the original target, the weights can only be as good as the RNN has learned). The sample-based PDFA learning method, ALERGIA, achieved good WER and NDCG scores but did not manage to reconstruct the original PDFA structure. This may be improved by taking a larger sample size, though it comes at the cost of efficiency.
|Language ()||Model||WER||NDCG||Time (s)||WER Size||NDCG Size|
|UHL 1 (, )||WL||0.0||1.0||15||9||9|
|N-Gram||0.129||0.966||259||63 (n=6)||63 (n=6)|
|UHL 2 (, )||WL||0.0||1.0||73||5||5|
|N-Gram||0.12||0.94||269||3859 (n=6)||3859 (n=6)|
|UHL 3 (, )||WL||0.0||1.0||55||4||4|
|N-Gram||0.189||0.991||268||63 (n=6)||63 (n=6)|
Tomita Grammars The full results for the Tomita extractions are given in Appendix D.
All of the methods reconstruct them with perfect or near-perfect WER and NDCG, except for -gram which sometimes fails. For each of the Tomita RNNs, our algorithm extracted and accepted a PDFA with identical structure to the original target in approximately 1 minute (the majority of this time was spent on sampling the RNN and hypothesis before accepting the equivalence query). These PDFAs had transition weights within the variation tolerance of the corresponding target transition weights.
On the effectiveness of n-grams
The n-gram models prove to be a very strong competitors for many of the languages. Indeed, n-gram models are very effective for learning in cases where the underlying languages have strong local properties, or can be well approximated using local properties, which is rather common (see e.g., Sharan et al. Sharan et al. (2016)). However, there are many languages, including ones that can be modeled with PDFAs, for which the locality property does not hold, as demonstrated by the UHL experiments.
As -grams are merely tables of observed samples, they are very quick to create. However, their simplicity also works against them: the table grows exponentially in and polynomially in . In the future, we hope that our algorithm can serve as a base for creating reasonably sized finite state machines that will be competitive on real world tasks.
We present a novel technique for learning a distribution over sequences from a trained LM-RNN. The technique allows for some variation between the predictions of the RNN’s internal states while still merging them, enabling extraction of a PDFA with fewer states than in the target RNN. It can also be terminated before completing, while still maintaining guarantees of local similarity to the target. The technique does not make assumptions about the target model’s representation, and can be applied to any language model – including LM-RNNs and transformers. It also does not require a probabilistic target, and can be directly applied to recreate any WDFA.
When applied to stochastic models such as LM-RNNs, the algorithm returns PDFAs, which are a desirable model for LM-RNN extraction because they are deterministic and therefore faster and more interpretable than WFAs. We apply it to RNNs trained on data taken from small PDFAs and HMMs, evaluating the extracted PDFAs against their target LM-RNNs and comparing to extracted WFAs and n-grams. When the LM-RNN has been trained on a small target PDFA, the algorithm successfully reconstructs a PDFA that has identical structure to the target, and local probabilities within tolerance of the target. For simple languages, our method is generally the strongest of all those considered. However for natural languages -grams maintain a strong advantage. Improving our method to be competitive on naturally occuring languages as well is an interesting direction for future work.
The authors wish to thank Rémi Eyraud for his helpful discussions and comments, and Chris Hammerschmidt for his assistance in obtaining the results with flexfringe . The research leading to the results presented in this paper is supported by the Israeli Science Foundation (grant No.1319/16), and the European Research Council (ERC) under the European Union’s Seventh Framework Programme (FP7-2007-2013), under grant agreement no. 802774 (iEXTRACT).
-  (1987) Learning regular sets from queries and counterexamples. Inf. Comput. 75 (2), pp. 87–106. External Links: Cited by: §1, §2, §4.1.
-  (2018) Explaining black boxes on sequential data using weighted automata. In Proceedings of the 14th International Conference on Grammatical Inference, ICGI 2018, Wrocław, Poland, September 5-7, 2018, pp. 81–103. External Links: Cited by: §1, §2, §6.
Grammatical inference as a principal component analysis problem. In
Proceedings of the 26th Annual International Conference on Machine Learning, pp. 33–40. Cited by: §2.
-  (2011) Quadratic weighted automata:spectral algorithm and likelihood maximization. In Proceedings of the Asian Conference on Machine Learning, C. Hsu and W. S. Lee (Eds.), Proceedings of Machine Learning Research, Vol. 20, pp. 147–163. External Links: Cited by: §1.
-  (2014) Spectral learning of weighted automata - A forward-backward perspective. Machine Learning 96 (1-2), pp. 33–63. External Links: Cited by: §1, §2.
-  (2013) Learning probabilistic automata: A study in state distinguishability. Theor. Comput. Sci. 473, pp. 46–60. External Links: Cited by: §1, §2.
-  (2016) Results of the sequence prediction challenge (spice): a competition on learning the next symbol in a sequence. In Proceedings of the 13th International Conference on Grammatical Inference,ICGI, pp. 132–136. External Links: Cited by: Appendix E, §2, §6.
-  (2015) Learning weighted automata. In Algebraic Informatics - 6th International Conference, CAI 2015, Stuttgart, Germany, September 1-4, 2015. Proceedings, pp. 1–21. External Links: Cited by: §2.
-  (1994) Learning stochastic regular grammars by means of a state merging method. In Grammatical Inference and Applications, R. C. Carrasco and J. Oncina (Eds.), Berlin, Heidelberg, pp. 139–152. External Links: Cited by: §1, §2, §6.
-  (1999) Learning deterministic regular grammars from stochastic samples in polynomial time. ITA 33 (1), pp. 1–20. External Links: Cited by: §2.
-  (2008) Towards feasible pac-learning of probabilistic deterministic finite automata. In Grammatical Inference: Algorithms and Applications, 9th International Colloquium, ICGI 2008, Saint-Malo, France, September 22-24, 2008, Proceedings, pp. 163–174. External Links: Cited by: §2.
State automata extraction from recurrent neural nets using k-means and fuzzy clustering. In Proceedings of the XXIII International Conference of the Chilean Computer Science Society, SCCC ’03, Washington, DC, USA, pp. 73–78. External Links: Cited by: §1.
On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259. External Links: Cited by: §1.
-  (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. External Links: Cited by: §1.
-  (2004) PAC-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research 5, pp. 473–497. External Links: Cited by: §1, §2.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1.
-  (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211. External Links: Cited by: §1.
-  (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. External Links: Cited by: Appendix C.
-  (2001) A bit of progress in language modeling. Computer Speech & Language 15 (4), pp. 403–434. Cited by: §2.
-  (2016-11) Interpreting Finite Automata for Sequential Data. arXiv e-prints, pp. arXiv:1611.07100. External Links: Cited by: §1.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §1.
A spectral algorithm for learning hidden markov models. CoRR abs/0811.4413. External Links: Cited by: §2.
-  (2018) Regular inference on artificial neural networks. In Machine Learning and Knowledge Extraction - Second IFIP TC 5, TC 8/WG 8.4, 8.9, TC 12/WG 12.9 International Cross-Domain Conference, CD-MAKE 2018, Hamburg, Germany, August 27-30, 2018, Proceedings, pp. 350–369. External Links: Cited by: §1.
-  (2019) Weighted automata extraction from recurrent neural networks via regression on state spaces. External Links: Cited by: §1, §2, §2, §4.1.
-  (1996) Extraction of rules from discrete-time recurrent neural networks. Neural Networks 9 (1), pp. 41–52. External Links: Cited by: §1.
-  (2007) PAC-learnability of probabilistic deterministic finite state automata in terms of variation distance. Theor. Comput. Sci. 387 (1), pp. 18–31. External Links: Cited by: §2.
-  (2017) A maximum matching algorithm for basis selection in spectral learning. CoRR abs/1706.02857. External Links: Cited by: §2.
-  (2019-16–18 Apr) Connecting weighted automata and recurrent neural networks through spectral learning. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, pp. 1630–1639. External Links: Cited by: §2.
-  (1998) On the learnability and usage of acyclic probabilistic finite automata. J. Comput. Syst. Sci. 56 (2), pp. 133–152. External Links: Cited by: §2.
-  (2000) Two decades of statistical language modeling: where do we go from here?. Proceedings of the IEEE 88 (8), pp. 1270–1278. Cited by: §2.
An efficient algorithm for the inference of circuit-free automata.
Syntactic and Structural Pattern Recognition, G. Ferraté, T. Pavlidis, A. Sanfeliu, and H. Bunke (Eds.), pp. 173–184. External Links: Cited by: §2.
-  (2016) Prediction with a short memory. CoRR abs/1612.02526. External Links: Cited by: §6.1, footnote 4.
Probabilistic DFA inference using kullback-leibler divergence and minimality. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp. 975–982. Cited by: §2.
-  (1982) Dynamic construction of finite automata from examples using hill-climbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, Ann Arbor, Michigan, pp. 105–108. Cited by: §D.1, §6, §6.
-  (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §1.
-  (2017-Sep.) Flexfringe: a passive automaton learning package. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 638–642. External Links: Cited by: §6.
-  (2014-07-01) PAutomaC: a probabilistic automata and hidden markov models learning competition. Machine Learning 96 (1), pp. 129–154. External Links: Cited by: §2.
-  (2017) An empirical evaluation of recurrent neural network rule extraction. CoRR abs/1709.10380. External Links: Cited by: §1.
-  (2018) Extracting automata from recurrent neural networks using queries and counterexamples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 5244–5253. External Links: Cited by: §1, §1, §2.
Appendix A Guarantees
We show that our algorithm returns a PDFA, and discuss the relation between the obtained PDFA and the target when anytime stopping is and isn’t used.
The algorithm returns a PDFA.
Let be the final clustering of achieved by the method in section 4.1. By construction, the algorithm returns a finite state machine with well defined states, initial state, transition weights and stopping weights. We show that this machine is deterministic and probabilistic, i.e.:
Deterministic: for every , is uniquely defined
Probabilistic: for every : , , and .
Proof of (1): By the final refinement of the clustering (Determinism II), and so by construction is assigned at most one value. If, and only if, , then is assigned some best available value. So is always assigned exactly one value.
Proof of (2): the values of and are weighted averages of probabilities, and so also in themselves. They also sum to as they are averages of distributions. Formally, for every :
where follows from the probabilistic behaviour of : for any . ∎
We consider extraction using noise tolerance from some target . For the observation table at any stage, we denote the size of the largest set of pairwise -distinguishable rows .
Let be an automaton constructed by the algorithm, whether or not it was stopped ahead of time. Let be the observation table reached before making , be the clustering of attained when building from (i.e., the states of ), and denote . Denote the cluster for each prefix, i.e. for every . In addition, for every cluster , denote the prefix from which was defined when building .
We show that as the algorithm progresses, it defines a monotonically increasing group of sequences on which the target and the algorithm’s automata are -consistent, and that this group is .
is always prefix closed.
begins as , which is prefix closed. Only two operations add to : closedness and counterexamples. When adding from closedness, the new prefix added to is of the form for and so remains prefix closed. When adding from a counterexample , is added along with all of its prefixes, and so remains prefix closed. ∎
For every , , i.e. .
We show this by induction on the length of . For i.e. for , by definition of the recursive application of , and = by construction (in the algorithm). We assume correctness of the lemma for . Consider , , denote . By the prefix closedness of , , and so by the assumption . Now by the definition of , . By the construction of , is defined such that for every s.t. , and so in particular for , as ). This results in , as desired. ∎
For every and , .
By construction of , in particular by the clique requirement for the clusters of , all of the prefixes satisfy , and in particular for : (recall that is initiated to and never reduced). is defined as the weighted average of for each of these , and so it is also -equal to i.e. , as desired. ∎
For every , are -consistent on .
This concludes the proof that are always -consistent on . We now show that the algorithm increases every finite number of operations, beginning with a direct result from theorem A.5:
Every counterexample increases by at least
Recall that counterexamples to proposed automata are sequences for which , and that they are handled by adding all their strict prefixes to . Assume by contradiction some counterexample for which does not increase. Then in particular , and by theorem A.5, , a contradiction. ∎
Always, . (i.e., every can only have had up to inconsistencies in its making.)
is initiated to , so its initial size is . is increased only following inconsistencies, cases in which there exist s.t. , but . Once some cause a suffix to be added to , by construction of the algorithm, for the remainder of the run (as is a suffix for which ). There are exactly pairs and so that is the maximum number of possible may have been increased in any run, giving the maximum size . ∎
(Note: If the -equality relation was transitive, it would be possible to obtain a linear bound in the size of . However as it is not, it is possible that a separating suffix may be added to that separates and while leaving them both -equal to to some other .)
Corollary A.8 (Progress).
For as long as the algorithm runs, it strictly expands a group of sequences on which the automata it returns is -consistent with its target .
From theorem A.5, is a group of sequences on which is always -consistent with . We show that is strictly expanding as the algorithm progresses, i.e. that every finite number of operations, is increased by at least one sequence.
The algorithm can be split into 4 operations: searching for and handing an unclosed prefix or inconsistency, building (and presenting) a hypothesis PDFA, or handling a counterexample. We show that each one runs in finite time, and that there cannot be infinite operations without increasing .
Finite Runtime of the Operations
Building : Finding and handling an unclosed prefix requires a pass over all , while comparing row values to – all finite as is finite (rows are also finite as is bounded by ’s size). Similarly finding and handling inconsistencies requires a pass over rows for all , also taking finite time.
Building an Automaton requires finding a clustering of satisfying the conditions and then a straightforward mapping of the transitions between these clusters. The clustering is built by one initial clustering (DBSCAN) over the finite set and then only refinement operations (without merges). As putting each prefix in its own cluster is a solution to the conditions, a satisfying clustering will be reached in finite time. Counterexamples Handling a counterexample requires adding at most new rows to . As is finite, this is a finite operation.
Finite Operations between Additions to P Handling an unclosed prefix by construction increases , and as shown in corollary A.6, so does handling a counterexample. Building a hypothesis is followed by an equivalence query, after which the algorithm will either terminate or a counterexample will be returned (increasing ). Finally, by A.7, the number of inconsistencies between every increase of is bounded. ∎
Appendix B Example
We extract from the PDFA presented in B.1 using prefix and suffix thresholds and variation tolerance . We limit the number of samples per equivalence query to . This extraction will demonstrate both types of table expansions, both types of clustering refinements, and counterexamples. Notice that in our example, the state is -equal with respect to next-token distribution to both and , though they themselves are not -equal to each other.
Extraction begins by initiating the table with , and the queue with . We will pop from the queue in order of prefix weight, though this is not necessary when not considering anytime stopping. At this point the table is:
The first prefix considered is , it is already in . It is consistent simply as it is not similar to any other . However it might not be closed. Its continuations are added to , to check its closedness later. is now .
Next is (which has prefix weight ). , which is not -equal to the only row in the table: . It follows that was not closed, and is added to . The table is now:
is also consistent simply as it has no -equal rows. Its continuations are added to