1 Introduction
We address the problem of learning a probabilistic deterministic finite automaton (PDFA) from a trained recurrent neural network (RNN) Elman (1990). RNNs, and in particular their gated variants GRU Cho et al. (2014); Chung et al. (2014) and LSTM Hochreiter and Schmidhuber (1997), are well known to be very powerful for sequence modelling, but are not interpretable. PDFAs, which explicitly list their states, transitions, and weights, are more interpretable than RNNs Hammerschmidt et al. (2016), while still being analogous to them in behaviour: both emit a single nexttoken distribution from each state, and have deterministic state transitions given a state and token. They are also much faster to use than RNNs, as their sequence processing does not require matrix operations.
We present an algorithm for reconstructing a PDFA from any given blackbox distribution over sequences, such as an RNN trained with a language modelling objective (LMRNN). The algorithm is applicable for reconstruction of any weighted deterministic finite automaton (WDFA), and is guaranteed to return a PDFA when the target is stochastic – as an LMRNN is.
Weighted Finite Automata (WFA) A WFA is a weighted nondeterministic finite automaton, capable of encoding language models but also other, nonstochastic weighted functions. Ayache et al. Ayache et al. (2018) and Okudono et al. Okudono et al. (2019) show how to apply spectral learning Balle et al. (2014) to an LMRNN to learn a weighted finite automaton (WFA) approximating its behaviour.
Probabilistic Deterministic Finite Automata (PDFAs) are a weighted variant of DFAs where each state defines a categorical nexttoken distribution. Processing a sequence in a PDFA is simple: input tokens are processed one by one, getting the next state and probability for each token by table lookup.
WFAs are nondeterministic and so not immediately analogous to RNNs. They are also slower to use than PDFAs, as processing each token in an input sequence requires a matrix multiplication. Finally, spectral learning algorithms are not guaranteed to return stochastic hypotheses even when the target is stochastic – though this can remedied by using quadratic weighted automata Bailly (2011) and normalising their weights. For these reasons we prefer PDFAs over WFAs for RNN approximation. Formally:
Problem Definition Given an LMRNN , find a PDFA approximating , such that for any prefix its nexttoken distributions in and in have low total variation distance between them.
Existing works on PDFA reconstruction assume a sample based paradigm: the target cannot be queried explicitly for a sequence’s probability or conditional probabilities Clark and Thollard (2004); Carrasco and Oncina (1994); Balle et al. (2013). As such, these methods cannot take full advantage of the information available from an LMRNN^{1}^{1}1
It is possible to adapt these methods to an active learning setting, in which they may explicitly query an oracle for the target for exact probabilities. However, this raises other questions: on which suffixes are prefixes compared? (If each has its own set, this requires calculation from every other prefix during comparison.) How does one pool the probabilities of two prefixes when merging them? We leave such an adaptation to future work.
. Meanwhile, most work on the extraction of finite automata from RNNs has focused on “binary” deterministic finite automata (DFAs) Omlin and Giles (1996); Cechin et al. (2003); Wang et al. (2017); Weiss et al. (2018); Mayr and Yovine (2018), which cannot fully express the behaviour of an LMRNN.Our Approach Following the successful application of L Angluin (1987) to RNNs for DFA extraction Weiss et al. (2018), we develop an adaptation of L for the weighted case. The adaptation returns a PDFA when applied to a stochastic target such as an LMRNN. It interacts with an oracle using two types of queries:

Membership Queries: requests to give the target probability of the last token in a sequence.

Equivalence Queries: requests to accept or reject a hypothesis PDFA, returning a counterexample — a sequence for which the hypothesis automaton and the target language diverge beyond the tolerance on the next token distribution — if rejecting.
The algorithm alternates between filling an observation table with observations of the target behaviour, and presenting minimal PDFAs consistent with that table to the oracle for equivalence checking. This continues until an automaton is accepted. The use of conditional properties in the observation table prevents the observations from vanishing to on low probabilities. To the best of our knowledge, this is the first work on learning PDFAs from RNNs.
A key insight of our adaptation is the use of an additive variation tolerance
when comparing rows in the table. In this framework, two probability vectors are considered
equal if their probabilities for each event are within of each other. Using this tolerance enables us to extract a much smaller PDFA than the original target, while still making locally similar predictions to it on any given sequence. This is necessary because RNN states are real valued vectors, making the potential number of reachable states in an LMRNN unbounded. The tolerance is nontransitive, making construction of PDFAs from the table more challenging than in L. Our algorithm suggests a way to address this.Even with this tolerance, reaching equivalence may take a long time for large target PDFAs, and so we design our algorithm to allow anytime stopping of the extraction. The method allows the extraction to be limited while still maintaining certain guarantees on the reconstructed PDFA.
Note. While this paper only discusses RNNs, the algorithm itself is actually agnostic to the underlying structure of the target, and can be applied to any language model. In particular it may be applied to transformers Vaswani et al. (2017); Devlin et al. (2018). However, in this case the analogy to PDFAs breaks down.
Contributions
The main contributions of this paper are:

An algorithm for reconstructing a WDFA from any given weighted target, and in particular a PDFA if the target is stochastic.

A method for anytime extraction termination while still maintaining correctness guarantees.

An implementation of the algorithm and an evaluation over extraction from LMRNNs, including a comparison to other LM reconstruction techniques.
2 Related Work
In Weiss et al Weiss et al. (2018), we presented a method for applying Angluin’s exact learning algorithm LAngluin (1987)
to RNNs, successfully extracting deterministic finite automata (DFAs) from given binaryclassifier RNNs. This work expands on this by adapting L
to extract PDFAs from LMRNNs. To apply exact learning to RNNs, one must implement equivalence queries: requests to accept or reject a hypothesis. Okudono et al. Okudono et al. (2019) show how to adapt the equivalence query presented in Weiss et al. (2018) to the weighted case.There exist many methods for PDFA learning, originally for acyclic PDFAs Rulot and Vidal (1988); Ron et al. (1998); Carrasco and Oncina (1999), and later for PDFAs in general Clark and Thollard (2004); Carrasco and Oncina (1994); Thollard et al. (2000); Palmer and Goldberg (2007); Castro and Gavaldà (2008); Balle et al. (2013)
. These methods split and merge states in the learned PDFAs according to samplebased estimations of their conditional distributions. Unfortunately, they require very large sample sets to succeed (e.g.,
Clark and Thollard (2004) requires ~13m samples for a PDFA with ).Distributions over can also be represented by WFAs, though these are nondeterministic. These can be learned using spectral algorithms, which use SVD decomposition and matrices of observations from the target to build a WFA Bailly et al. (2009); Balle et al. (2014); Balle and Mohri (2015); Hsu et al. (2008). Spectral algorithms have recently been applied to RNNs to extract WFAs representing their behaviour Ayache et al. (2018); Okudono et al. (2019); Rabusseau et al. (2019), we compare to Ayache et al. (2018) in this work. The choice of observations used is also a focus of research in this field Quattoni et al. (2017).
3 Background
Sequences and Notations For a finite alphabet , the set of finite sequences over is denoted by , and the empty sequence by . For any and stopping symbol , we denote , and – the set of where the stopping symbol may only appear at the end.
For a sequence , its length is denoted , its concatenation after another sequence is denoted , its th element is denoted , and its prefix of length is denoted . We use the shorthand and . A set of sequences is said to be prefix closed if for every and , . Suffix closedness is defined analogously.
For any finite alphabet and set of sequences , we assume some internal ordering of the set’s elements to allow discussion of vectors of observations over those elements.
Probabilistic Deterministic Finite Automata (PDFAs) are tuples such that is a finite set of states, is the initial state, is the finite input alphabet, is the transition function and is the transition weight function, satisfying for every .
The recurrent application of to a sequence is denoted by , and defined: and for every , . We abuse notation to denote: for every . If for every there exists a series of nonzero transitions reaching a state with , then defines a distribution over as follows: for every , .
Language Models (LMs) Given a finite alphabet , a language model over is a model defining a distribution over . For any , and , induces the following:

Prefix Probability: .

Last Token Probability: if , then and .

Last Token Probabilities Vector: if , .

Next Token Distribution: , defined: .
Variation Tolerance Given two categorical distributions and , their total variation distance is defined , i.e., the largest difference in probabilities that they assign to the same event. Our algorithm tolerates some variation distance between nexttoken probabilities, as follows:
Two event probabilities are called equal and denoted if . Similarly, two vectors of probabilities are called equal and denoted if , i.e. if . For any distribution over , , and , we denote if , or simply if is clear from context. For any two language models over and , we say that are consistent on if for every prefix of . We call the variation tolerance.
Oracles and Observation Tables Given an oracle , an observation table for is a sequence indexed matrix of observations taken from it, with the rows indexed by prefixes and the columns by suffixes . The observations are for every , . For any we denote , and for every the th row in is denoted . In this work we use an oracle for the lasttoken probabilities of the target, for every , and maintain .
Recurrent Neural Networks (RNNs) An RNN is a recursive parametrised function with initial state , such that is the state after time and is the input at time . A language model RNN (LMRNN) over an alphabet is an RNN coupled with a prediction function , where is a vector representation of a nexttoken distribution. RNNs differ from PDFAs only in that their number of reachable states (and so number of different nexttoken distributions for sequences) may be unbounded.
4 Learning PDFAs with Queries and Counterexamples
In this section we describe the details of our algorithm. We explain why a direct application of L to PDFAs will not work, and then present our nontrivial adaptation. Our adaptation does not rely on the target being stochastic, and can in fact be applied to reconstruct any WDFA from an oracle.
Direct application of L does not work for LMRNNs: L is a polynomialtime algorithm for learning a deterministic finite automaton (DFA) from an oracle. It can be adapted to work with oracles giving any finite number of classifications to sequences, and can be naively adapted to a probabilistic target with finite possible nexttoken distributions by treating each nexttoken distribution as a sequence classification. However, this will not work for reconstruction from RNNs. This is because the set of reachable states in a given RNN is unbounded, and so also the set of nexttoken distributions. Thus, in order to practically adapt L to extract PDFAs from LMRNNs, we must reduce the number of classes L deals with.
Variation Tolerance Our algorithm reduces the number of classes it considers by allowing an additive variation tolerance , and considering equality (as presented in Section 3) as opposed to actual equality when comparing probabilities. In introducing this tolerance we must handle the fact that it may be nontransitive: there may exist such that , but . ^{2}^{2}2We could define a variation tolerance by quantisation of the distribution space, which would be transitive. However this may be unnecessarily aggressive at the edges of the intervals.
To avoid potentially grouping together all predictions on long sequences, which are likely to have very low probabilities, our algorithm observes only local probabilities. In particular, the algorithm uses an oracle that gives the lasttoken probability for every nonempty input sequence.
4.1 The Algorithm
The algorithm loops over three main steps: (1) expanding an observation table until it is closed and consistent, (2) constructing a hypothesis automaton, and (3) making an equivalence query about the hypothesis. The loop repeats as long as the oracle returns counterexamples for the hypotheses. In our setting, counterexamples are sequences after which the hypothesis and the target have nexttoken distributions that are not equal. They are handled by adding all of their prefixes to .
Our algorithm expects last token probabilities from the oracle, i.e.: where is the target distribution. The oracle is not queried on , which is undefined. To observe the entirety of every prefix’s nexttoken distribution, is initiated with .
Step 1: Expanding the observation table
is expanded as in L Angluin (1987), but with the definition of row equality relaxed. Precisely, it is expanded until:

Closedness For every and , there exists some such that .

Consistency For every such that , for every , .
The table expansion is managed by a queue initiated to , from which prefixes are processed one at a time as follows: If , and there is no s.t. , then is added to . If already, then it is checked for inconsistency, i.e. whether there exist s.t. but . In this case a separating suffix , is added to , such that now , and the expansion restarts. Finally, if then is updated with .
As in L, checking closedness and consistency can be done in arbitrary order. However, if the algorithm may be terminated before is closed and consistent, it is better to process in order of prefix probability (see section 4.2).
Step 2: PDFA construction
Intuitively, we would like to group equivalent rows of the observation table to form the states of the PDFA, and map transitions between these groups according to the table’s observations. The challenge in the variationtolerating setting is that equality is not transitive.
Formally, let be a partitioning (clustering) of , and for each let be the partition (cluster) containing . should satisfy:

Determinism For every , , : .

equality (Cliques) For every and , .
For , , we denote the nextclusters reached from with , and . Note that satisfies determinism iff for every . Note also that the constraints are always satisfiable by the clustering
We present a 4step algorithm to solve these constraints while trying to avoid excessive partitions: ^{3}^{3}3We describe our implementation of these stages in appendix C.

Initialisation: The prefixes are partitioned into some initial clustering according to the equality of their rows, .

Determinism I: is refined until it satisfies determinism: clusters with tokens for which are split by nextcluster equivalence into new clusters.

Cliques: Each cluster is refined into cliques (with respect to equality).

Determinism II: is again refined until it satisfies determinism, as in (2).
Note that refining a partitioning into cliques may break determinism, but refining into a deterministic partitioning will not break cliques. In addition, when only allowed to refine clusters (and not merge them), all determinism refinements are necessary. Hence the order of the last 3 stages.
Once the clustering is found, a PDFA is constructed from it. Where possible, is defined directly by : for every , . For for which , is set as the best cluster match for , where
. This is chosen according to the heuristics presented in Section
4.2. The weights are defined as follows: for every , .Step 3: Answering Equivalence Queries
We sample the target LMRNN and hypothesis PDFA a finite number of times, testing every prefix of each sample to see if it is a counterexample. If none is found, we accept . Though simple, we find this method to be sufficiently effective in practice. A more sophisticated approach is presented in Okudono et al. (2019).
4.2 Practical Considerations
We present some methods and heuristics that allow a more effective application of the algorithm to large (with respect to , ) or poorly learned grammars.
Anytime Stopping In case the algorithm runs for too long, we allow termination before is closed and consistent, which may be imposed by size or time limits on the table expansion. If reaches its limit, the table expansion continues but stops checking consistency. If the time or limits are reached, the algorithm stops, constructing and accepting a PDFA from the table as is. The construction is unchanged up to the fact that some of the transitions may not have a defined destination, for these we use a “best cluster match” as described in section 4.2. This does not harm the guarantees on consistency between and the returned PDFA discussed in Section 5.
Order of Expansion As some prefixes will not be added to under anytime stopping, the order in which rows are checked for closedness and consistency matters. We sort by prefix weight. Moreover, if a prefix being considered is found inconsistent w.r.t. some , then all such pairs are considered and the separating suffix , with the highest minimum conditional probability is added to .
Best Cluster Match
Given a prefix and set of clusters , we seek a best fit for . First we filter for the following qualities until one is nonempty, in order of preference: (1) is a clique w.r.t. equality. (2) There exists some such that , and is not a clique. (3) There exists some such that . If no clusters satisfy these qualities, we remain with . From the resulting group of potential matches, the best match could be the cluster minimising , . In practice, we choose from arbitrarily for efficiency.
Suffix and Prefix Thresholds
Occasionally when checking the consistency of two rows , a separating suffix will be found that is actually very unlikely to be seen after or . In this case it is unproductive to add to . Moreover – especially as RNNs are unlikely to perfectly learn a probability of for some event – it is possible that going through will reach a large number of ‘junk’ states. Similarly when considering a prefix , if is very low then it is possible that it is the failed encoding of probability , and that all states reachable through are not useful.
We introduce thresholds and for both suffixes and prefixes. When a potential separating suffix is found from prefixes and , it is added to only if . Similarly, potential new rows are only added to if .
Finding Close Rows
We maintain in a KDtree indexed by row entries , with one level for every column . When considering of a prefix , we use to get the subset of all potentially equal prefixes. ’s levels are split into equallength intervals, we find to work well.
Choosing the Variation Tolerance
In our initial experiments (on SPiCes 03), we used . The intuition was that given no data, the fairest distribution over
is the uniform distribution, and so this may also be a reasonable threshold for a significant difference between two probabilities. In practice, we found that
often strongly differentiates states even in models with larger alphabets – except for SPiCe 1, where quickly accepted a model of size 1. A reasonable strategy for choosing is to begin with a large one, and reduce it if equivalence is reached too quickly.5 Guarantees
We note some guarantees on the extracted model’s qualities and relation to its target model. Formal statements and full proofs for each of the guarantees listed here are given in appendix A.
Model Qualities
The model is guaranteed to be deterministic by construction. Moreover, if the target is stochastic, then the returned model is guaranteed to be stochastic as well.
Reaching Equivalence
If the algorithm terminates successfully (i.e., having passed an equivalence query), then the returned model is consistent with the target on every sequence , by definition of the query. In practice we have no true oracle and only approximate equivalence queries by sampling the models, and so can only attain a probable guarantee of their relative consistency.
Consistency and Progress
No matter when the algorithm is stopped, the returned model is always consistent with its target on every , where is the set of prefixes in the table . Moreover, as long as the algorithm is running, the prefix set is always increased within a finite number of operations. This means that the algorithm maintains a growing set of prefixes on which any PDFA it returns is guaranteed to be consistent with the target. In particular, this means that if equivalence is not reached, at least the algorithm’s model of the target improves for as long as it runs.
6 Experimental Evaluation
We apply our algorithm to 2layer LSTMs trained on grammars from the SPiCe competition Balle et al. (2016), adaptations of the Tomita grammars Tomita (1982) to PDFAs, and small PDFAs representing languages with unbounded history. The LSTMs have input dimensions  and hidden dimensions . The LSTMs and their training methods are fully described in Appendix E.
Compared Methods We compare our algorithm to the samplebased method ALERGIA Carrasco and Oncina (1994), the spectral algorithm used in Ayache et al. (2018), and grams. An gram is a PDFA whose states are a sliding window of length over the input sequence, with transition function . The probability of a token from state is the MLE estimate , where is the number of times the sequence appears as a subsequence in the samples. For ALERGIA, we use the PDFA/DFA inference toolkit flexfringe Verwer and Hammerschmidt (2017).
Target Languages We train RNNs on a subset of the SPiCe grammars, covering languages generated by HMMs, and languages from the NLP, software, and biology domains. We train RNNs on PDFA adaptations of the Tomita languages Tomita (1982), made from the minimal DFA for each language by giving each of its states a nexttoken distribution as a function of whether it is accepting or not. We give a full description of the Tomita adaptations and extraction results in appendix D. As we show in (6.1), the gram models prove to be very strong competitors on the SPiCe languages. To this end, we consider three additional languages that need to track information for an unbounded history, and thus cannot be captured by any gram model. We call these UHLs (unbounded history languages).
UHLs 1 and 2 are PDFAs that cycle through 9 and 5 states with different next token probabilities. UHL 3 is a weighted adaptation of the 5^{th} Tomita grammar, changing its nexttoken distribution according to the parity of the seen 0s and 1s. The UHLs are drawn in appendix D.
Extraction Parameters Most of the extraction parameters differ between the RNNs, and are described in the results tables (1, 2). For our algorithm, we always limited the equivalence query to samples. For the spectral algorithm, we made WFAs for all ranks , , and . For the grams we used all . For these two, we always show the best results for NDCG and WER. For ALERGIA in the flexfringe toolkit, we use the parameters symbol_count=50 and state_count=N, with N given in the tables.
Evaluation Measures We evaluate the extracted models against their target RNNs on word error rate (WER) and on normalised discounted cumulative gain (NDCG), which was the scoring function for the SPiCe challenge. In particular the SPiCe challenge evaluated models on , and we evaluate the models extracted from the SPiCe RNNs on this as well. For the UHLs, we use as they have smaller alphabets. We do not use probabilistic measures such as perplexity, as the spectral algorithm is not guaranteed to return probabilistic automata.

Word error rate (WER): The WER of model A against B on a set of predictions is the fraction of nexttoken predictions (most likely next token) that are different in A and B.

Normalised discounted cumulative gain (NDCG): The NDCG of A against B on a set of sequences scores A’s ranking of the top most likely tokens after each sequence , , in comparison to the actual most likely tokens given by B, . Formally:
For NDCG we sample the RNN repeatedly, taking all the prefixes of each sample until we have prefixes. We then compute the NDCG for each prefix and take the average. For WER, we take full samples from the RNN, and return the fraction of errors over all of the nexttoken predictions in those samples. An ideal WER and NDCG is and , we note this with in the tables.
6.1 Results and Discussion
Tables 1 and 2 show the results of extraction from the SPiCe and UHL RNNs, respectively. In them, we list our algorithm as WL(Weighted L). For the WFAs and grams, which are generated with several values of (rank) and , we show the best scores for each metric. We list the size of the best model for each metric. We do not report the extraction times separately, as they are very similar: the majority of time in these algorithms is spent generating the samples or Hankel matrices.
For PDFAs and WFAs the size columns present the number of states, for the WFAs this is equal to the rank with which they were reconstructed. For grams the size is the number of table entries in the model, and the chosen value of is listed in brackets. In the SPiCe languages, our algorithm did not reach equivalence, and used between 1 and 6 counterexamples for every language before being stopped – with the exception of SPiCe1 with , which reached equivalence on a single state. The UHLs and Tomitas used 02 counterexamples each before reaching equivalence.
The SPiCe results show a strong advantage to our algorithm in most of the small synthetic languages (13), with the spectral extraction taking a slight lead on SPiCe 0. However, in the remaining SPiCe languages, the gram strongly outperforms all other methods. Nevertheless, gram models are inherently restricted to languages that can be captured with bounded histories, and the UHLs demonstrate cases where this property does not hold. Indeed, all the algorithms outperform the grams on these languages (Table 2).
Language ()  Model  WER  NDCG  Time (h)  WER Size  NDCG Size 
SPiCe 0 (, )  WL  0.084  0.987  0.3  4988  4988 
Spectral  0.053  0.996  0.3  k=150  k=200  
NGram  0.096  0.991  0.8  1118 (n=6)  1118 (n=6)  
ALERGIA  0.353  0.961  2.9  66  66  
SPiCe 1 (, )  WL  0.093  0.971  0.4  152  152 
WL  0.376  0.891  0.1  1  1  
Spectral  0.319  0.909  2.9  k=12  k=11  
NGram  0.337  0.897  0.8  8421 (n=4)  421 (n=3)  
ALERGIA  0.376  0.892  1.2  7  7  
SPiCe 2 (, )  WL  0.08  0.972  0.8  962  962 
Spectral  0.263  0.893  1.6  k=7  k=5  
NGram  0.278  0.894  0.8  1111 (n=4)  1111 (n=4)  
ALERGIA  0.419  0.844  1.2  11  11  
SPiCe 3 (, )  WL  0.327  0.928  1.0  675  675 
Spectral  0.466  0.843  1.2  k=6  k=8  
NGram  0.46  0.847  0.8  1111 (n=4)  11110 (n=5)  
ALERGIA  0.679  0.79  1.2  8  8  
SPiCe 4 (, )  WL  0.301  0.829  0.7  4999  4999 
Spectral  0.453  0.727  1.2  k=450  k=250  
NGram  0.099  0.968  0.8  186601 (n=6)  61851 (n=5)  
ALERGIA  0.639  0.646  4.4  42  42  
SPiCe 6 (, )  WL  0.593  0.644  2.5  5000  5000 
Spectral  0.705  0.535  6.1  k=17  k=32  
NGram  0.285  0.888  0.8  127817 (n=5)  127817 (n=5)  
ALERGIA  0.687  0.538  1.9  26  26  
SPiCe 7 (, )  WL  0.626  0.642  0.5  4996  4996 
Spectral  0.801  0.472  2.4  k=50  k=27  
NGram  0.441  0.812  0.7  133026 (n=5)  133026 (n=5)  
ALERGIA  0.735  0.569  1.4  8  8  
SPiCe 9 (, )  WL  0.503  0.721  0.5  4992  4992 
Spectral  0.303  0.877  1.9  k=44  k=44  
NGram  0.123  0.961  1.0  44533 (n=6)  44533 (n=6)  
ALERGIA  0.501  0.739  1.1  44  44  
SPiCe 10 (, )  WL  0.651  0.593  0.9  4987  4987 
Spectral  0.845  0.4  1.7  k=42  k=41  
NGram  0.348  0.845  0.8  153688 (n=5)  153688 (n=5)  
ALERGIA  0.81  0.51  2.0  13  13  
SPiCe 14 (, )  WL  0.442  0.716  0.8  4999  4999 
Spectral  0.531  0.653  2.4  k=100  k=100  
NGram  0.079  0.977  0.7  125572 (n=6)  46158 (n=5)  
ALERGIA  0.641  0.611  1.2  19  19 
Our algorithm succeeds in perfectly reconstructing the target PDFA structure for each of the UHL languages, and giving it transition weights within the given variation tolerance (when extracting from the RNN and not directly from the original target, the weights can only be as good as the RNN has learned). The samplebased PDFA learning method, ALERGIA, achieved good WER and NDCG scores but did not manage to reconstruct the original PDFA structure. This may be improved by taking a larger sample size, though it comes at the cost of efficiency.
Language ()  Model  WER  NDCG  Time (s)  WER Size  NDCG Size 

UHL 1 (, )  WL  0.0  1.0  15  9  9 
Spectral  0.0  1.0  56  k=80  k=150  
NGram  0.129  0.966  259  63 (n=6)  63 (n=6)  
ALERGIA  0.004  0.999  278  56  56  
UHL 2 (, )  WL  0.0  1.0  73  5  5 
Spectral  0.002  1.0  126  k=49  k=47  
NGram  0.12  0.94  269  3859 (n=6)  3859 (n=6)  
ALERGIA  0.023  0.979  329  25  25  
UHL 3 (, )  WL  0.0  1.0  55  4  4 
Spectral  0.0  1.0  71  k=44  k=17  
NGram  0.189  0.991  268  63 (n=6)  63 (n=6)  
ALERGIA  0.02  0.999  319  47  47 
Tomita Grammars The full results for the Tomita extractions are given in Appendix D.
All of the methods reconstruct them with perfect or nearperfect WER and NDCG, except for gram which sometimes fails. For each of the Tomita RNNs, our algorithm extracted and accepted a PDFA with identical structure to the original target in approximately 1 minute (the majority of this time was spent on sampling the RNN and hypothesis before accepting the equivalence query). These PDFAs had transition weights within the variation tolerance of the corresponding target transition weights.
On the effectiveness of ngrams
The ngram models prove to be a very strong competitors for many of the languages. Indeed, ngram models are very effective for learning in cases where the underlying languages have strong local properties, or can be well approximated using local properties, which is rather common (see e.g., Sharan et al. Sharan et al. (2016)). However, there are many languages, including ones that can be modeled with PDFAs, for which the locality property does not hold, as demonstrated by the UHL experiments.
As grams are merely tables of observed samples, they are very quick to create. However, their simplicity also works against them: the table grows exponentially in and polynomially in . In the future, we hope that our algorithm can serve as a base for creating reasonably sized finite state machines that will be competitive on real world tasks.
7 Conclusions
We present a novel technique for learning a distribution over sequences from a trained LMRNN. The technique allows for some variation between the predictions of the RNN’s internal states while still merging them, enabling extraction of a PDFA with fewer states than in the target RNN. It can also be terminated before completing, while still maintaining guarantees of local similarity to the target. The technique does not make assumptions about the target model’s representation, and can be applied to any language model – including LMRNNs and transformers. It also does not require a probabilistic target, and can be directly applied to recreate any WDFA.
When applied to stochastic models such as LMRNNs, the algorithm returns PDFAs, which are a desirable model for LMRNN extraction because they are deterministic and therefore faster and more interpretable than WFAs. We apply it to RNNs trained on data taken from small PDFAs and HMMs, evaluating the extracted PDFAs against their target LMRNNs and comparing to extracted WFAs and ngrams. When the LMRNN has been trained on a small target PDFA, the algorithm successfully reconstructs a PDFA that has identical structure to the target, and local probabilities within tolerance of the target. For simple languages, our method is generally the strongest of all those considered. However for natural languages grams maintain a strong advantage. Improving our method to be competitive on naturally occuring languages as well is an interesting direction for future work.
Acknowledgments
The authors wish to thank Rémi Eyraud for his helpful discussions and comments, and Chris Hammerschmidt for his assistance in obtaining the results with flexfringe . The research leading to the results presented in this paper is supported by the Israeli Science Foundation (grant No.1319/16), and the European Research Council (ERC) under the European Union’s Seventh Framework Programme (FP720072013), under grant agreement no. 802774 (iEXTRACT).
References
 [1] (1987) Learning regular sets from queries and counterexamples. Inf. Comput. 75 (2), pp. 87–106. External Links: Link, Document Cited by: §1, §2, §4.1.
 [2] (2018) Explaining black boxes on sequential data using weighted automata. In Proceedings of the 14th International Conference on Grammatical Inference, ICGI 2018, Wrocław, Poland, September 57, 2018, pp. 81–103. External Links: Link Cited by: §1, §2, §6.

[3]
(2009)
Grammatical inference as a principal component analysis problem
. InProceedings of the 26th Annual International Conference on Machine Learning
, pp. 33–40. Cited by: §2.  [4] (2011) Quadratic weighted automata:spectral algorithm and likelihood maximization. In Proceedings of the Asian Conference on Machine Learning, C. Hsu and W. S. Lee (Eds.), Proceedings of Machine Learning Research, Vol. 20, pp. 147–163. External Links: Link Cited by: §1.
 [5] (2014) Spectral learning of weighted automata  A forwardbackward perspective. Machine Learning 96 (12), pp. 33–63. External Links: Link, Document Cited by: §1, §2.
 [6] (2013) Learning probabilistic automata: A study in state distinguishability. Theor. Comput. Sci. 473, pp. 46–60. External Links: Link, Document Cited by: §1, §2.
 [7] (2016) Results of the sequence prediction challenge (spice): a competition on learning the next symbol in a sequence. In Proceedings of the 13th International Conference on Grammatical Inference,ICGI, pp. 132–136. External Links: Link Cited by: Appendix E, §2, §6.
 [8] (2015) Learning weighted automata. In Algebraic Informatics  6th International Conference, CAI 2015, Stuttgart, Germany, September 14, 2015. Proceedings, pp. 1–21. External Links: Link, Document Cited by: §2.
 [9] (1994) Learning stochastic regular grammars by means of a state merging method. In Grammatical Inference and Applications, R. C. Carrasco and J. Oncina (Eds.), Berlin, Heidelberg, pp. 139–152. External Links: ISBN 9783540489856 Cited by: §1, §2, §6.
 [10] (1999) Learning deterministic regular grammars from stochastic samples in polynomial time. ITA 33 (1), pp. 1–20. External Links: Link, Document Cited by: §2.
 [11] (2008) Towards feasible paclearning of probabilistic deterministic finite automata. In Grammatical Inference: Algorithms and Applications, 9th International Colloquium, ICGI 2008, SaintMalo, France, September 2224, 2008, Proceedings, pp. 163–174. External Links: Link, Document Cited by: §2.

[12]
(2003)
State automata extraction from recurrent neural nets using kmeans and fuzzy clustering
. In Proceedings of the XXIII International Conference of the Chilean Computer Science Society, SCCC ’03, Washington, DC, USA, pp. 73–78. External Links: ISBN 0769520081, Link Cited by: §1. 
[13]
(2014)
On the properties of neural machine translation: encoderdecoder approaches
. CoRR abs/1409.1259. External Links: Link, 1409.1259 Cited by: §1.  [14] (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. External Links: Link, 1412.3555 Cited by: §1.
 [15] (2004) PAClearnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research 5, pp. 473–497. External Links: Link Cited by: §1, §2.
 [16] (2018) BERT: pretraining of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1.
 [17] (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211. External Links: ISSN 15516709, Link, Document Cited by: §1.
 [18] (1996) A densitybased algorithm for discovering clusters a densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. External Links: Link Cited by: Appendix C.
 [19] (2001) A bit of progress in language modeling. Computer Speech & Language 15 (4), pp. 403–434. Cited by: §2.
 [20] (201611) Interpreting Finite Automata for Sequential Data. arXiv eprints, pp. arXiv:1611.07100. External Links: 1611.07100 Cited by: §1.
 [21] (1997) Long shortterm memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, Link, https://doi.org/10.1162/neco.1997.9.8.1735 Cited by: §1.

[22]
(2008)
A spectral algorithm for learning hidden markov models
. CoRR abs/0811.4413. External Links: Link, 0811.4413 Cited by: §2.  [23] (2018) Regular inference on artificial neural networks. In Machine Learning and Knowledge Extraction  Second IFIP TC 5, TC 8/WG 8.4, 8.9, TC 12/WG 12.9 International CrossDomain Conference, CDMAKE 2018, Hamburg, Germany, August 2730, 2018, Proceedings, pp. 350–369. External Links: Link, Document Cited by: §1.
 [24] (2019) Weighted automata extraction from recurrent neural networks via regression on state spaces. External Links: 1904.02931 Cited by: §1, §2, §2, §4.1.
 [25] (1996) Extraction of rules from discretetime recurrent neural networks. Neural Networks 9 (1), pp. 41–52. External Links: Link, Document Cited by: §1.
 [26] (2007) PAClearnability of probabilistic deterministic finite state automata in terms of variation distance. Theor. Comput. Sci. 387 (1), pp. 18–31. External Links: Link, Document Cited by: §2.
 [27] (2017) A maximum matching algorithm for basis selection in spectral learning. CoRR abs/1706.02857. External Links: Link, 1706.02857 Cited by: §2.
 [28] (201916–18 Apr) Connecting weighted automata and recurrent neural networks through spectral learning. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, pp. 1630–1639. External Links: Link Cited by: §2.
 [29] (1998) On the learnability and usage of acyclic probabilistic finite automata. J. Comput. Syst. Sci. 56 (2), pp. 133–152. External Links: Link, Document Cited by: §2.
 [30] (2000) Two decades of statistical language modeling: where do we go from here?. Proceedings of the IEEE 88 (8), pp. 1270–1278. Cited by: §2.

[31]
(1988)
An efficient algorithm for the inference of circuitfree automata.
In
Syntactic and Structural Pattern Recognition
, G. Ferraté, T. Pavlidis, A. Sanfeliu, and H. Bunke (Eds.), pp. 173–184. External Links: ISBN 0387192093, Link Cited by: §2.  [32] (2016) Prediction with a short memory. CoRR abs/1612.02526. External Links: Link, 1612.02526 Cited by: §6.1, footnote 4.

[33]
(2000)
Probabilistic DFA inference using kullbackleibler divergence and minimality
. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29  July 2, 2000, pp. 975–982. Cited by: §2.  [34] (1982) Dynamic construction of finite automata from examples using hillclimbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, Ann Arbor, Michigan, pp. 105–108. Cited by: §D.1, §6, §6.
 [35] (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1.
 [36] (2017Sep.) Flexfringe: a passive automaton learning package. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 638–642. External Links: Document Cited by: §6.
 [37] (20140701) PAutomaC: a probabilistic automata and hidden markov models learning competition. Machine Learning 96 (1), pp. 129–154. External Links: ISSN 15730565, Document, Link Cited by: §2.
 [38] (2017) An empirical evaluation of recurrent neural network rule extraction. CoRR abs/1709.10380. External Links: Link, 1709.10380 Cited by: §1.
 [39] (2018) Extracting automata from recurrent neural networks using queries and counterexamples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 5244–5253. External Links: Link Cited by: §1, §1, §2.
Supplementary Material
Appendix A Guarantees
We show that our algorithm returns a PDFA, and discuss the relation between the obtained PDFA and the target when anytime stopping is and isn’t used.
a.1 Probability
Theorem A.1.
The algorithm returns a PDFA.
Proof.
Let be the final clustering of achieved by the method in section 4.1. By construction, the algorithm returns a finite state machine with well defined states, initial state, transition weights and stopping weights. We show that this machine is deterministic and probabilistic, i.e.:

Deterministic: for every , is uniquely defined

Probabilistic: for every : , , and .
Proof of (1): By the final refinement of the clustering (Determinism II), and so by construction is assigned at most one value. If, and only if, , then is assigned some best available value. So is always assigned exactly one value.
Proof of (2): the values of and are weighted averages of probabilities, and so also in themselves. They also sum to as they are averages of distributions. Formally, for every :
where follows from the probabilistic behaviour of : for any . ∎
a.2 Progress
We consider extraction using noise tolerance from some target . For the observation table at any stage, we denote the size of the largest set of pairwise distinguishable rows .
Let be an automaton constructed by the algorithm, whether or not it was stopped ahead of time. Let be the observation table reached before making , be the clustering of attained when building from (i.e., the states of ), and denote . Denote the cluster for each prefix, i.e. for every . In addition, for every cluster , denote the prefix from which was defined when building .
We show that as the algorithm progresses, it defines a monotonically increasing group of sequences on which the target and the algorithm’s automata are consistent, and that this group is .
Lemma A.2.
is always prefix closed.
Proof.
begins as , which is prefix closed. Only two operations add to : closedness and counterexamples. When adding from closedness, the new prefix added to is of the form for and so remains prefix closed. When adding from a counterexample , is added along with all of its prefixes, and so remains prefix closed. ∎
Lemma A.3.
For every , , i.e. .
Proof.
We show this by induction on the length of . For i.e. for , by definition of the recursive application of , and = by construction (in the algorithm). We assume correctness of the lemma for . Consider , , denote . By the prefix closedness of , , and so by the assumption . Now by the definition of , . By the construction of , is defined such that for every s.t. , and so in particular for , as ). This results in , as desired. ∎
Lemma A.4.
For every and , .
Proof.
By construction of , in particular by the clique requirement for the clusters of , all of the prefixes satisfy , and in particular for : (recall that is initiated to and never reduced). is defined as the weighted average of for each of these , and so it is also equal to i.e. , as desired. ∎
Theorem A.5.
For every , are consistent on .
Proof.
This concludes the proof that are always consistent on . We now show that the algorithm increases every finite number of operations, beginning with a direct result from theorem A.5:
Corollary A.6.
Every counterexample increases by at least
Proof.
Recall that counterexamples to proposed automata are sequences for which , and that they are handled by adding all their strict prefixes to . Assume by contradiction some counterexample for which does not increase. Then in particular , and by theorem A.5, , a contradiction. ∎
Lemma A.7.
Always, . (i.e., every can only have had up to inconsistencies in its making.)
Proof.
is initiated to , so its initial size is . is increased only following inconsistencies, cases in which there exist s.t. , but . Once some cause a suffix to be added to , by construction of the algorithm, for the remainder of the run (as is a suffix for which ). There are exactly pairs and so that is the maximum number of possible may have been increased in any run, giving the maximum size . ∎
(Note: If the equality relation was transitive, it would be possible to obtain a linear bound in the size of . However as it is not, it is possible that a separating suffix may be added to that separates and while leaving them both equal to to some other .)
Corollary A.8 (Progress).
For as long as the algorithm runs, it strictly expands a group of sequences on which the automata it returns is consistent with its target .
Proof.
From theorem A.5, is a group of sequences on which is always consistent with . We show that is strictly expanding as the algorithm progresses, i.e. that every finite number of operations, is increased by at least one sequence.
The algorithm can be split into 4 operations: searching for and handing an unclosed prefix or inconsistency, building (and presenting) a hypothesis PDFA, or handling a counterexample. We show that each one runs in finite time, and that there cannot be infinite operations without increasing .
Finite Runtime of the Operations
Building : Finding and handling an unclosed prefix requires a pass over all , while comparing row values to – all finite as is finite (rows are also finite as is bounded by ’s size). Similarly finding and handling inconsistencies requires a pass over rows for all , also taking finite time.
Building an Automaton requires finding a clustering of satisfying the conditions and then a straightforward mapping of the transitions between these clusters. The clustering is built by one initial clustering (DBSCAN) over the finite set and then only refinement operations (without merges). As putting each prefix in its own cluster is a solution to the conditions, a satisfying clustering will be reached in finite time. Counterexamples Handling a counterexample requires adding at most new rows to . As is finite, this is a finite operation.
Finite Operations between Additions to P Handling an unclosed prefix by construction increases , and as shown in corollary A.6, so does handling a counterexample. Building a hypothesis is followed by an equivalence query, after which the algorithm will either terminate or a counterexample will be returned (increasing ). Finally, by A.7, the number of inconsistencies between every increase of is bounded. ∎
Appendix B Example
We extract from the PDFA presented in B.1 using prefix and suffix thresholds and variation tolerance . We limit the number of samples per equivalence query to . This extraction will demonstrate both types of table expansions, both types of clustering refinements, and counterexamples. Notice that in our example, the state is equal with respect to nexttoken distribution to both and , though they themselves are not equal to each other.
Extraction begins by initiating the table with , and the queue with . We will pop from the queue in order of prefix weight, though this is not necessary when not considering anytime stopping. At this point the table is:
[width=3em]PS  a  b  $ 

0.5  0.4  0.1 
The first prefix considered is , it is already in . It is consistent simply as it is not similar to any other . However it might not be closed. Its continuations are added to , to check its closedness later. is now .
Next is (which has prefix weight ). , which is not equal to the only row in the table: . It follows that was not closed, and is added to . The table is now:
[width=3em]PS  a  b  $ 

0.5  0.4  0.1  
a  0.7  0.25  0.05 
is also consistent simply as it has no equal rows. Its continuations are added to to check closedness, giving .
Now for each of , , meaning that the table is closed. None of the prefixes in