Learning state machines via efficient hashing of future traces

State machines are popular models to model and visualize discrete systems such as software systems, and to represent regular grammars. Most algorithms that passively learn state machines from data assume all the data to be available from the beginning and they load this data into memory. This makes it hard to apply them to continuously streaming data and results in large memory requirements when dealing with large datasets. In this paper we propose a method to learn state machines from data streams using the count-min-sketch data structure to reduce memory requirements. We apply state merging using the well-known red-blue-framework to reduce the search space. We implemented our approach in an established framework for learning state machines, and evaluated it on a well know dataset to provide experimental data, showing the effectiveness of our approach with respect to quality of the results and run-time.


page 1

page 2

page 3

page 4


Finding Heavily-Weighted Features in Data Streams

We introduce a new sub-linear space data structure---the Weight-Median S...

Buffered Count-Min Sketch on SSD: Theory and Experiments

Frequency estimation data structures such as the count-min sketch (CMS) ...

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides ...

Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products

In the last decade, it has been shown that many hard AI tasks, especiall...

SALSA: Self-Adjusting Lean Streaming Analytics

Counters are the fundamental building block of many data sketching schem...

Measuring the Quality of B Abstract Machines with ISO/IEC 25010

The B method has facilitated the development of software by specifying t...

Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

In marketing we are often confronted with a continuous stream of respons...

1 Introduction

State machines are a well known means to model discrete systems and regular grammars. When learned from data, they provide a means to describe the underlying dynamics, as well as a method to visualize the behavior [chris_interpreting]

. State machines can be learned in different manners from data. One of the most well known ways to identify state machines is the evidence-driven-state-merging algorithm, a method using statistical evidence to find similarities in states of a state machine based on heuristics. We assume familiarity with standard algorithms for learning state machines from data. We refer the reader to 

[higuera] for an introduction. A drawback of most state machine learning algorithms, including evidence-driven-state-merging, is the fact that they require all input data at once in one pass, thus they cannot be learned in an adaptive manner. This also leads to a large memory footprint for large inputs, since data has to be stored all at once. Some of the few works tackling these two issues are from [balle_2012, balle2014, schmidt2014online]. In this paper we propose an alternative approach to overcome these two limitations, and introduce a heuristic that makes use of count-min-sketches to efficiently learn state machines from data-streams in an adaptive manner.

In each state, a count-min-sketch stores counts of hashes of observed futures. These futures are either subsequences starting in that state or sliding windows. Storing only counts allows us to process data streams, as we can forget most of the future sequences that are possible after reaching a state. The count-min-sketch should of course provide a sufficiently good estimate of a states’ future behavior. We therefore only allow merges between states with sufficient counts, determined by a threshold. To reduce memory, we only store the red core, the blue fringe, and the first layer of white states. As a consequence, the state-merging methods are much more efficient as every merge only induces a handful of additional ones due to the determinization/folding routines. The behavior in all other states is estimated by the sketches. Although this greatly reduces the number of states in memory, we show in experiments on the PAutomaC dataset 


that this provides good estimates. All of our source code will be released open source as part of the FlexFringe tool 


2 Related Work

In the literature, several approaches exist for learning state machines. Active learning learns a model via actively asking queries 

[queries, vaandrager2017model], where the learner can actively make queries to the system to extract information and pose a hypothesis. A drawback of this method however is that it assumes the presence of an oracle that processes and answers the queries asked. An efficient approach to learn from traces is [SATsolver], where they reformulated the problem and used SAT-solvers to learn deterministic finite automata (DFA). The method is shown to solve the problem exactly, however, it becomes inefficient on larger datasets. Another approach to learn state machines from input traces is by means of state merging. In this case a prefix tree, the so called Augmented Prefix Tree Acceptor (APTA), is built, which describes the set of input traces exactly. In line with Occam’s razor the goal of state merging then is to minimize the APTA while still representing the set of input traces. The algorithm achieves this via finding pairs of states that show similar behavior and then merging them. Although shown to be NP-hard [complexity], much research uses a state merging method, see e.g., [learning_grammars].

A popular approach is the evidence-driven-state-merging algorithm (EDSM) [lang_1998]. The classical Alergia algorithm [alergia_1994] is a version of state merging that uses statistical tests to compare and merge. The k-tails algorithm performs merges but requires identical futures up to a given depth k [ktails]. Updated search procedures improve the quality of state machines as well as improve run-time [search1, search2, search3]. Other types of state machines can be learned via specialized algorithms, such as timed automata via the RTI algorithm  [verwer_rti] or a likelihood-ratio test [likelihood], or learning extended finite state machines [walkinshaw_2016] or guarded finite state machines [gkplus]. Vodenčarević et al. proposed an algorithm that learns a hybrid state machine model by taking different aspects of a system into account during the learning process [vodencarevic_2011].

Despite the vast amount of research done on learning state machines, most of the state merging procedures assume that the complete data is available at the start of the program. Little work can be found on learning state machines in a streamed fashion. In [schmidt2014online] a method is presented that uses frequent pattern data stream mining techniques to make a streaming state machine learner. [balle_2012] learns state machines using modified Space-Saving sketches, and prove properties such as convergence and memory consumption. This work was extended with a parameter search strategy in [balle2014]. In  [schouten2018] a streamed merging method method was implemented in the Apache framework, also based on count-min-sketches to hash futures. Our approach is largely inspired by these two works.

3 Background

3.1 Probabilistic deterministic finite automata

A PDFA is a tuple , where is a finite alphabet, is a finite set of states, is a unique starting state, is the transition function,

is the symbol probability function, and

is the final probability function, such that for all . Given a sequence of symbols , a PDFA can be used to compute the probability for the given sequence: , where for all . A PDFA is called probabilistic because it assigns probabilities based on the symbol and final probability functions. It is called deterministic because the transition function

(and hence its structure) is deterministic. A PDFA computes a probability distribution over

, i.e., .

3.2 Learning State Machine Models

We focus on learning state machines via state merging [lang_1998]. Classical state merging starts by constructing a tree representing the input data, also called the APTA. The APTA is a tree accepting the input traces exactly. The goal of state merging then is to minimize this APTA iteratively while still representing the data. This is analogous to the generalization process in other machine learning algorithms. Minimization is done via comparing pairs of states on behavior and merging states with similar futures. Since in each step multiple merges can be possible, the heuristic computes a score. The highest scoring possible merge will be performed. In our work, we also employ the red-blue framework. The red-blue framework maintains a core of red states, with initially only the root state red. Non-red states that have a direct transition from a red state are called blue states, states that are neither red nor blue are called white. A candidate merge pair can then only be a pair of a red and a blue state.

4 Methodology

4.1 The heuristic

In order to efficiently store the future of states, our algorithm uses the count-min-sketch (CMS) data structure [count-min-sketch]. The main idea is to store the states’ future with constant memory footprint per state, at the potential cost of approximation errors. It works via using a matrix, and the columns represent counts of hashed items. Since the hashing can collide, multiple rows are instantiated, each with its own associated hash function. Upon retrieving counts of an element off the CMS, is hashed again for each row, and the minimum count in the columns is considered the best approximation of the counts. Fig.  1 shows the storage of a CMS. An element, in this case a tuple 4, 3, is being stored and thrown into each row’s hash function respectively. The increased counts are highlighted in red.

(a) Before store.
(b) After store.
Figure 1: A CMS before and after storing an element. The indices the hash functions hash the element onto are shown to the left.

We use CMS per state, one per future. If for example we take state from Fig.  2 and set to 3, the 3 sketches would look like the following: Sketch 1 would store a 2 and a 3, sketch 2 would store a 4 and a 15, sketch 3 would store a 3 and a 1. Additionally to the normal outgoing symbols, we save an extra dedicated bin in the sketch for terminating sequences. Whenever a sequence terminates in a state, we store a "sequence termination" on the last column of the sketch, and only terminations can be stored in that column.

Figure 2: An excerpt of a prefix tree.

Whenever we want to merge a state, all we have to do them is to compare the corresponding sketch pairs, i.e., sketch 1 of the red state with sketch 1 of the blue state, sketch 2 of the red state with sketch 2 of the blue state, and so on. We consider the rows of the CMS distributions, and perform the Hoeffding-bound similar to [alergia_1994] (Eq. 1) for each corresponding pair of rows of the two sketches. In this equation, is the relative frequency of element in distribution , and it’s frequency in distribution .

is a hyperparameter to be set. Apart from checking whether two states are can be merged, we also need a score function. In order to assign a score to a merge, we consider the rows of the CMS as vectors, and then compute the sum of the cosine-similarities of each corresponding vector (row) pair

and (Eq. 2) in between two sketches, averaged over the number of rows each sketch has. Furthermore, when performing or undoing a merge on two states, we need to update the state information. To this end, we simply treat the sketches of the individual states as matrices, which we sum up on a merge.


4.2 Streaming

In order to stream state machines we decided for an approach similar to the one in [balle_2012]. We adopt the red-blue framework and include an evidence threshold . Our streaming starts with the root node as a red node. Differently to the normal batch-wise approach we only create new states with direct transitions coming from red or blue states. We furthermore introduce a threshold for states. Every time an incoming sequence passes a state , we increase a counter on that state by one. Only when the counter of a state passes a threshold can that state become a blue state and be considered for a merge. Merges happen only in batches. Once the batch size is reached, we perform merges until no more are possible, then we read the next batch. The streaming procedure is described in Alg. 1.

Input: Set of sequences , batch size , threshold
Output: A hypothesis (automaton)
1 root node;
2 ;
3 foreach  do
4       root node;
5       foreach  do
6             // are the individual symbols of string ;
7             ;
8             if transition from with exists then
9                   ;
11             else if  is red or is blue then
12                   create new node satisfying ;
14             else if  is red and  then
15                   mark blue;
17             else
18                   // this is a white state;
19                   do nothing;
22      ;
23       if  then
24             while  do
25                   ;
27            ;
Algorithm 1

5 Experiments and results

In order to evaluate our approach, we implemented our streaming and our heuristic as modules in Flexfringe [flexfringe, flexfringeRepo]. In order to compare our approach, we compared it with a baseline approach, namely FlexFringe’s implementation of the Alergia algorithm. The streaming procedure described in Alg. 1 is generic enough to use any merge heuristic that supports a and an procedure, and thus our implementation allows us to exchange the sketching and Alergia heuristic very easily. The dataset we used is the well known PAutomaC dataset [verwer_pautomac]. We set the batchsize to and the threshold to to give the statistical tests enough evidence, and ran both the sketching with different future lengths , and the Alergia algorithm. In order to have a better comparison, we augmented the Alergia algorithm with the well known k-tails algorithm [ktails], from which we use the parameter. Since we only append to red and blue states, a value of greater than would be meaningless, hence we only used and .

Our performance metric is the difference in perplexities of the original automaton and our learned automaton, as described in [verwer_pautomac]. Fig.  3(a) shows the streamed Alergia results with the two different values for . It is clear that performs better. In order to check how the sketching algorithm scales with respect to , we plot the perplexities in a similar fashion, where we only vary the parameter this time. The plots are shown in Fig.  3(b). Increasing from to has a large impact, similar to increasing from to in Alergia. Increasing the value to improves some problems minimally, however we noticed that from on the quality of the resulting state machine starts to degrade. Also note that the sketching and Alergia have peaks at the same problems in their weaker settings, and both improve drastically on the same problems when giving them more information. In a next step we compare the sketching with Alergia in Fig.  3(c). We can see that the sketching performs better than Alergia on most problems, which is also evidenced by the average error of for the sketching in this setting and for Alergia. Increasing to would result in an average error of in our experiment for the sketching, where it performed always equal or minimally better than the same with .

We also compared runtime, where we found almost identical times. While sketching took in total on all problems with on our machine (Ubuntu 20.4, Intel i7@2.60Ghz and 16GB RAM), Alergia took with . Increasing to 3 increased the runtime to , the parameter had little effect on runtime from to , with a runtime of at . Last but not least it is worth noting that while Alergia’s performance increased with larger , doing the same with the sketching approach actually resulted in worse results. In order to compare the batch mode with the stream mode we ran Alergia in in batch mode and the sketching with in stream mode. The comparison is depicted in Fig.  3(d). It can be seen that the streamed version compares worse than the batched version, which is also imminent from the average error of for the batch-mode. However, this comes at the cost of much larger memory and runtime consumption. While the streamed version ran for less than , the batch mode took , or more than minutes. Memory consumption wise the batched version is also much more costly, as can be seen in Table 1, where we compare memory footprint of batch mode Alergia and stream mode sketching on a few selected PAutomaC scenarios. To get the consumption we measured dynamic memory allocation via the Massif tool from the Valgrind toolkit.

(a) Alergia with varying .
(b) Sketching heuristic with different values.
(c) Sketching heuristic with vs. Alergia with .
(d) Sketching heuristic with in stream mode vs. Alergia with in batch mode.
Figure 3: Perplexity score errors on all experiments. The x-axis indicates the dataset’s scenarios from 1 to 48
Method Scenario 8 Scenario 9 Scenario 20 Scenario 21
Alergia batch 1GB 58.47MB 957.7MB 2.049GB
Sketching stream 95MB 17.4MB 35.53MB 56.58MB
Table 1: Batch vs. stream mode memory consumption per maximum heap size for a few selected scenarios.

6 Discussion and limitations

At first we expect the gain in performance for Alergia with increasing . In batch mode, up to a value of performance slightly increases, then it enters a plateau. Apparently, for the PAutomaC dataset, it is sufficient to know the next steps ahead. The decrease in the performance of our sketching with can be possibly explained by the fact that the algorithm will prefer non-optimal merges, since the sketches "pool" the future subtrees together for the score computation.

The advantage of our sketching approach lies in the fact that we can look ahead further than other heuristics like Alergia while still discarding information, i.e., the states themselves. We also approximate infrequently used symbols, unlike  [balle_2012], who use a modification of the Space-Saving-Algorithm to approximate the most frequent features only. A drawback of our sketching is though that with too large alphabet size our approximations can collide too much, leading to worse results. We are not sure yet why our method’s results degrade with higher larger than . We can only provide a hypothesis at this stage. In order to do ktails and with our way of streaming, pairs of white states with transitions coming out of blue states have to be compared. Those states can, since they are usually infrequent, possibly not provide much evidence, leading to incorrect results on our statistical tests. Limitations of this work are mainly the nature of the dataset, which appears simple enough that not many lookaheads and simple methods are enough to get good results. But our results to demonstrate that the approach works. Another limitation is the size of the dataset, which are small for streaming algorithms. In the future, we aim to test our method on different data sets such as network traffic and software logs, which are typically too large to fit into memory.

7 Conclusion

In this work we introduced a new approach for streaming state machines and showed first experimental results. We compared our sketching approach with a conventional method and pointed out advantages and disadvantages based on the results we got. We conclude that our method works as expected, and we expect the advantages to show better on larger datasets where deeper lookaheads into the future of a state are necessary to make good predictions. The streaming also enables the processing of really large data in a cost effective manner. A drawback of our method however is clearly that the sketches have a fixed size when starting the algorithm, and thus the number of potential symbols the traces may contain must be known or estimated before running the algorithm. When set too small, the results could get impaired by collisions from inside the CMS.

8 Ackknowledgements

This work is supported by NWO TTW VIDI project 17541 - Learning state machines from infrequent software traces (LIMIT).