Collections of sequences, also known as sequence databases, are a common data source for knowledge extraction in many domains. Consider for example DNA and protein sequences, business process execution traces, customer purchasing histories, and software execution traces. Accordingly, the task of mining frequent patterns from sequence databases, known as sequential pattern mining, is a mainstream research area in data mining. Originally, sequential pattern mining focused on extracting patterns that capture consecutive subsequences that recur in many sequences of a sequence database(Srikant and Agrawal, 1996; Zaki, 2001; Han et al, 2001). More recently, algorithms have been proposed to mine gapped sequential patterns (Ding et al, 2009; Tong et al, 2009; Wu et al, 2014), which allow gaps between two successive events of the pattern, and repetitive sequential patterns (Ding et al, 2009; Tong et al, 2009), which capture not only subsequences that recur in multiple sequences, but also subsequences that recur frequently within the same sequence.
While sequential patterns sometimes generate interesting insights, their practical application for data exploration is hindered by the fact that they often overload the analyst with a large number of patterns. To address this issue, several approaches have been proposed to describe sequence databases with smaller sets of patterns. One approach is to mine closed patterns (Yan et al, 2003; Ding et al, 2009) – patterns for which there does not exist an extension with the same support. Another approach is to mine maximal patterns (Fournier-Viger et al, 2014) – patterns for which there does not exist an extension that meets the support threshold. Yet another approach is to mine compressed patterns following the minimal description length principle (Tatti and Vreeken, 2012; Lam et al, 2014). These approaches however are limited by the fact that each pattern captures only one subsequence.
In alternative approaches, a pattern may capture multiple subsequences, including subsequences that are not observed in the sequence database, but that are related to observed subsequences. In other words, the extracted patterns generalize the observed behavior. For example, episodes (Mannila et al, 1997) extend sequential patterns with parallelism by allowing a pattern to incorporate partial order relations. Harms et al (2001) proposed a technique to mine closed episodes, while Pei et al (2006) showed that episode mining techniques cannot mine arbitrary partial orders, and proposes a technique to mine closed partial order patterns. Sætrom and Hetland (2003)
use genetic programming to mine patterns that are expressed in the regular-expression-like Interagon Query Language (IQL), thereby allowing the patterns to generalize to a higher degree than episodes and partial order patterns by additionally allowing patterns to incorporate choice and Kleene star (i.e., repetition) constructs.
A recently proposed type of patterns that go beyond the generalization capabilities of episodes, partial order patterns, and IQL patterns are Local Process Models (LPMs) (Tax et al, 2016b). In this approach, a pattern is a process model consisting of an arbitrary combination of sequence, parallelism, choice, and loop constructs. LPM mining takes an approach based on iteratively expanding patterns into larger candidate patterns, followed by a step to evaluate the generated candidate patterns. The approach shares common traits with the CloGSgrow algorithm (Ding et al, 2009), in that it mines gapped patterns and uses a notion of repetitive support, meaning that it counts multiple occurrences of a pattern within the same sequence. The generated patterns (LPMs) are encoded as process trees (Buijs et al, 2012), which is a tree-based process modeling language to specify behavior over a set of symbols. Mined LPMs may be visualized as Petri nets for example (Murata, 1989).
Given their properties, it is possible for a small set of LPMs to cover the behavior of a much larger set of sequential patterns. However, the original LPM mining technique (Tax et al, 2016b) still suffers from the pattern explosion problem because it is designed to extract one LPM at a time (in isolation). When applied repeatedly to a sequence database, this algorithm leads to a set of redundant LPMs.
Table 1a shows a sequence database, and Table 1b shows the nine patterns produced by the CloGSgrow algorithm (Ding et al, 2009) with a minimum support of three. In total, CloGSgrow requires 29 patterns to describe the behavior in this sequence database. Applying basic LPM mining with a minimum support of three leads to 717 patterns, two of which are shown in Figure 1. The LPM of Figure 1a (LPM (a)) expresses that is followed by , , and , where the can only occur after , and the can occur at any point after . The LPM of Figure 1b (LPM (b)) is equivalent to regular expression E(BA)*F. The semantics of LPMs will be introduced formally in Section 2. The numbers printed in the LPMs respectively indicate the number of events explained by the LPM patterns and the number of occurrences of each activity in the sequence database, e.g., 4 out of 10 occurrences of activity in the sequence database are explained by LPM (a). The four instances of LPM (a) are indicated in red in Table 1a, and the five instances of LPM (b) are indicated in blue. LPMs (a) and (b) together describe almost all behavior in the sequence database in a compact manner. While basic LPM mining with a minimum support of three results in 717 patterns, the desired output would be only the two LPMs of Figure 1. In this paper, we propose heuristics to mine a set of non-redundant LPMs either from a set of redundant LPMs or from a set of gapped sequential patterns such as those produced by CloGSgrow.
Automated process discovery techniques (van der Aalst, 2016) mine a single process model that describes the behavior of a sequence database. The Inductive Miner (Leemans et al, 2013) is a representative of this family of techniques. Figure 2 shows the process model produced by the Inductive Miner when applied to the sequence database of Table 1a. While process discovery techniques produce useful results over simple sequence databases, they create process models that are either very complex (‘spaghetti’-like) or overgeneralizing (i.e., allowing for too much behavior, such as the process model of Figure 2) when applied to real-life datasets. In this paper, we show that on real-life datasets, a set of LPMs can capture the sequence database more accurately and with lower complexity than a single process model discovered using existing automated process discovery techniques.
This paper is structured as follows. Section 2 introduces basic concepts and notation. Section 3 outlines quality criteria for LPM sets. Section 4 presents the proposed heuristics to discover sets of non-redundant LPMs. Section 5 presents an empirical evaluation of the proposed heuristics using real-life sequence databases. Finally, Section 6 discusses related work, while Section 7 draws conclusions.
In this section, we introduce notation and basic concepts related to sequence databases, process models, process discovery, and Local Process Models (LPMs).
2.1 Events, Sequences, and Sequence Databases
denotes the set of all sequences over a set and a sequence of length , with and . is the empty sequence and is the concatenation of sequences and . We denote with the projection of sequence on set , e.g., for , and , . Likewise, indicates sequence where all members of are filtered out, e.g., . is the prefix of length (with ) of sequence , for example, . A multiset (or bag) over is a function which we write as , where for we have and . The set of all multisets over is denoted .
An event denotes the occurrence of an activity. We write to denote the set of all possible activities. An event sequence (called a trace in the process mining field) is a sequence . A sequence database (called an event log in the process mining field) is a finite multiset of sequences, . For example, the sequence database consists of two occurrences of sequences and three occurrences of sequence . We lift projection and filtering of sequences to multisets of sequences, i.e., .
2.2 Process Models and Process Discovery
We use Petri nets to represent process models due to their formal semantics. A Petri net is a directed bipartite graph consisting of places (depicted as circles) and transitions (depicted as rectangles), connected by arcs. A transition describes an activity, while places represent the enabling conditions of transitions. Labels of transitions indicate the type of activity that they represent. Unlabeled transitions (-transitions) represent invisible transitions (depicted as gray rectangles), which are only used for routing purposes and are not recorded in the sequence database.
Definition 1 (Labeled Petri net)
A labeled Petri net is a tuple where is a finite set of places, is a finite set of transitions such that , is a set of directed arcs, called the flow relation, and is a partial labeling function that assigns a label to a transition, or leaves it unlabeled (the -transitions).
We write and for the input and output nodes of (according to ). A state of a Petri net is defined by its marking being a multiset of places. A marking is graphically denoted by putting tokens on each place . State changes occur through transition firings. A transition is enabled (can fire) in a given marking if each input place contains at least one token. Once fires, one token is removed from each input place and one token is added to each output place , leading to a new marking . A firing of a transition leading from marking to marking is denoted as step . Steps are lifted to sequences of firing enabled transitions, written and is a firing sequence.
A partial function with domain can be lifted to sequences over using the following recursion: (1) ; (2) for any and :
Defining an initial and final markings allows to define the language accepted by a Petri net as a set of finite sequences of activities.
Definition 2 (Accepting Petri Net)
An accepting Petri net is a triplet , where is a labeled Petri net, is its initial marking, and its final marking. A sequence is a trace of an accepting Petri net if there exists a firing sequence , and .
In this paper, places that belong to the initial marking contain a token and places belonging to the final marking are marked as .
The language is the set of all its traces, i.e., , which can be of infinite size when contains loops. While we define the language for accepting Petri nets, in theory, can be defined for any process model with formal semantics. We denote the universe of process models as . For each , is defined.
A process discovery method is a function that produces a process model from a sequence database. The discovered process model should cover as much as possible the behavior observed in the sequence database (a property called fitness) while it should not allow for too much behavior that is not observed in the sequence database (called precision). For a sequence database , is the trace set of . For example, for sequence database , . For a sequence database and a process model , we say that is fitting on if . Precision is related to the behavior that is allowed by a model that was not observed in the sequence database , i.e., .
2.3 Local Process Models
Local Process Models (LPMs) (Tax et al, 2016b) are process models that describe frequent but partial behavior; i.e., they model a subset of the activities of the process, seen in the sequence database. An iterative expansion procedure is used to generate a ranked collection of LPMs. The iterative expansion procedure of LPM is often bounded to a maximum number of expansion steps (in practice often to 4 steps), as the expansion procedure is a combinatorial problem of which the size depends on the number of activities in the sequence database as well as the maximum number of activities in the LPMs that are mined. LPMs can be represented in any process modeling notation, such as BPMN111http://www.bpmn.org/, UML Actvity Diagrams222http://www.omg.org/spec/UML/2.5/, or Petri nets (Murata, 1989). In this paper we use the latter due to their formal semantics.
A process tree is a tree where leaf nodes represent activities, and non-leaf nodes represent operators that specify the allowed behavior over the activity nodes. Supported operator nodes are the sequence operator () that indicates that the first child is executed before the second, the exclusive choice operator () that indicates that exactly one of the children can be executed, the concurrency operator () that indicates that every child will be executed but allows for any ordering, and the loop operator (), which has one child node and allows for repeated execution of this node. represents the language of , i.e., the set of sequences allowed by the model. Figure (d)d shows an example process tree , with . It indicates that either activity A or D is executed first, followed by activities B and C in any order.
An algorithm to generate a ranked list of LPMs via iterative expansion of candidate process trees is proposed in Tax et al (2016b). An expansion step of an LPM is performed by replacing one of the leaf nodes of the process tree by an operator node (i.e., ,,, or ), where one of the child nodes is the activity of the replaced leaf node and the other is a new activity node . is the process model universe; i.e., the set of all possible process models. An LPM can be expanded in many ways, as it can be extended by replacing any one of its activity nodes, expanding it with any of the operator nodes, and with a new activity node that represents any of the activities in the sequence database. We define as the set of expansions of , and the maximum number of expansions allowed from an initial LPM; i.e., an LPM containing only one activity.
To evaluate a given LPM on a given sequence database , its sequences are first projected on the set of activities in the LPM, i.e., . The projected sequence is then segmented into -segments that fit the behavior of the LPM and -segments that do not fit the behavior of the LPM, i.e., such that and . We define to be a function that projects sequence on the LPM activities and obtains its subsequences that fit the LPM, i.e., .
Consider and trace from Figure 3. Function retrieves the set of process activities in the LPM, e.g., . Projection on the activities of the LPM gives . Figure (e)e shows the segmentation of the projected sequence on the LPM, leading to . The segmentation starts with an empty non-fitting segment , followed by a fitting segment , which completes one run through the process tree. The second event in cannot be replayed on , since it only allows for one and already contains a . This results in a non-fitting segment . again represents a run through process tree. The segmentation ends with non-fitting segment . We lift segmentation function to sequence databases, . An alignment-based (van der Aalst et al, 2012) implementation of , as well as a method to rank and select LPMs based on their support, i.e., the number of events in , is given in Tax et al (2016b).
3 Quality Criteria for Local Process Model Sets
We illustrate the need quality criteria for Local Process Model (LPM) sets that takes into account the redundancy of the pattern set through an example. Consider the sequence database shown in Table 1a and and Local Process Model set that consist of the three LPMs of Figure 4, to which, compared to Figure 1, LPM (c) is added as one example of a redundant LPM from the set of 717 LPMs. There is overlap in the activities that are described by the LPMs, e.g., LPM (a), (b), and (c) all contain a transition that is labeled , which means that there are multiple candidate patterns with which the occurrence of an instance of can be explained. Table 2 highlights the instances of the three LPMs in as found by the alignment-based support scoring approach for LPMs, indicating the events that are part of an LPM instance in bold, and indicating a single instance of the LPM pattern by . As shown earlier, LPMs (a) and (b) together explain all events except except for the single -event that is indicated in red. Notice that there is no overlap between LPMs (a) and (b) in the events that they explain: LPMs (a) and (b), therefore, together they provide a near perfect explanation of the sequence database. While red -event is part of a pattern instance of LPM (c), it cannot be explained by the LPM set, as the and events of the same pattern instance clash with an instance of LPM (a). One could choose to use these and events for LPM (c) instead of LPM(a), however, that would lead to two events (indicated in blue) remaining unexplained for instead of only one. It is clear that there is redundancy in LPM set , as LPM (c) does not contribute to the set of events of the sequence database that are explained by the LPMs. The degree of redundancy in a given LPM set is generally not immediately clear. Each event in the sequence database can be part of a pattern instance of multiple LPMs in an LPM set.
|Trace||LPM (a)||LPM (b)||LPM (c)|
To summarize a sequence database in the form of LPMs, it is sufficient to have each event described by only one of the LPMs. To obtain an allocation of events to LPMs that provides an optimal number of explained events we transform the set of LPMs into a single process model by merging the places of the initial markings of each LPM in into a single place , and set as new initial marking of the merged model. Furthermore, we merge the places of the final markings of the LPMs in into a new place , which we set as new final marking the merged model. We will show how this global model can be used to detect instances of the LPMs in the sequence database. Formally, given an LPM set with each LPM being represented by an accepting Petri net , with , we first transform each Petri net into where:
A single sequence may contain occurrences of multiple LPMs in the set. Therefore, we add a silent transition connecting the final place to the initial place. This allows the model to accept any concatenation of occurrences of LPMs. Furthermore, it can be the case that a sequence contains no instance of any of the LPMs. Therefore, we redefine the final marking of the merged model to its initial marking to allow it to accept the empty sequence . Formally, given an LPM set with each LPM being represented by an accepting Petri net , with , the merged global Petri net representing is a Petri net such that:
The alignments algorithm (van der Aalst et al, 2012) provides a minimal alignment between a sequence from a sequence database and a process model. An alignment between a sequence and a process model is a pairwise matching between events and activities allowed by the model. Sometimes, events cannot be matched to any of the transitions. For instance, an event occurs when not allowed according to the model (i.e., it is not enabled). In this case, the event cannot be matched to a transition firing, resulting in so-called moves in log. Other times, an activity should have been executed according to the model but is not observed in the sequence database. This results in a transition that cannot be matched to an event in the sequence database, thus resulting in a so-called move in model. When an event in the sequence database can be correctly matched to a transition firing in the process model, this is called a synchronous move. An optimal alignment (van der Aalst et al, 2012) can be computed by searching for the mapping between a process model and a sequence that minimizes the number of moves in model and log and optimizes the number of synchronous moves, using the A search algorithm.
Using alignments on the global model constructed from the LPM set and sequence database we can calculate the degree to which the LPM set explains the . The events that are explained by are the ones that are mapped to a synchronous move in the alignments, while the unexplained events correspond to moves in log. However, we want to count exact and complete observations of the LPMs, while the moves on model option in the alignment search space allows for observations of LPM instances where activities are missing. To prevent this, we calculate alignments where we only allow synchronous moves and moves in log, and we allow moves in model only on silent transitions. Without allowing moves in model on silent transitions we would never be able to fire such transitions, as they have no corresponding events is the sequence database. Table 3 shows the alignment of the first sequence of the sequence database, , on the global Petri net that we constructed from the LPM set shown in Figure 5. The alignment starts with a synchronous move on activity , which the model can mimic by firing (enabled in the initial marking). After that, the alignment likewise can perform synchronous moves on activities , , and . Finally, to complete one instance of LPM (a), silent transition is fired to join the two parallel branches. The sequence database cannot mimic , leading to a model move, which is allowed since the transition is silent. Then, a model move is performed on silent transition , leading to the final marking, where the process model could stop moving. However, the sequence database contains another instance of an LPM. The sequence database continues with two -events, however, the process model has no way to mimic this with two consecutive firings of , therefore it performs a synchronous move on one of the two -events and a model move on the other one. The choice is arbitrary which -event to consider for the synchronous move and which one for the model move, either choice leads to an optimal alignment. After that, synchronous moves on , , and bring the model to the final marking, where it can end.
The alignment of on the constructed global model allows us to lift the segmentation of from a single LPM to an LPM set , segmenting into such that and . Segmentation function lifted to LPM sets is defined as with , recognizing each event in either as part of an instance of LPM , or leaving it unexplained (i.e., the -segments). We again lift to sequence databases, . Furthermore, we use to denote the set of -segments that are assigned to LPM . Based on this segmentation, we define coverage, as the ratio of events that can be explained by one of the LPMs in an LPM set :
For example, the coverage of our example sequence database on the example LPM set of Figure 4 is , due to the one unexplained -event.
While coverage measures the share of events of explained by , it does not measure the redundancy. Escaping edges precision (Muñoz-Gama and Carmona, 2010) is a widely accepted precision measure in the field of process mining, which quantifies how much of the behavior allowed by the process model fits the behavior seen in the sequence database. Escaping edges precision is when the process model allows for all behavior over the process activities, while it is when it allows for exactly the behavior over the process activities that was seen in the sequence database. When applying escaping edges precision ti and to the global model constructed from , the measure punishes the presence of unnecessary LPMs in , since each LPM adds an extra transition that can be fired from the initial place of the global model. Furthermore, it penalizes LPMs in the set that allow for too much behavior, e.g., flower-like LPMs. Escaping edges precision requires the sequence database to be completely fitting on the process model. However, calculating the precision using instead of the sequence database itself, we are guaranteed to fulfill this requirement. Since we are interested in LPM set that have both high coverage and high precision, we measure their quality using F-score
: the harmonic mean of coverage and precision.
4 Local Process Model Collection Mining Approaches
A straightforward approach to post-process the output of the basic LPM mining technique into a smaller set of patterns is to select only those LPMs that actually occur in the sequence database according to the evaluation framework in Section 3 . To do so, we align the global model constructed from the LPMs in an LPM set to and filter out any LPM that has no instances in the sequence database. Algorithm 1 shows the procedure for this LPM set post-processing. We will refer to this approach as the alignment-based selection of LPMs. Applying this filter to the LPM set shown in Figure 4 and the example sequence database would result in LPM (c) getting filtered out of the LPM set, as it does not have instances in .
There may exist more than one optimal alignment. For example, sequence , can be aligned such that is one instance of LPM (b), while alternatively it can be aligned such that is one instance of LPM (a), as both alignments provide an explanation for 4 out of the 7 events. When multiple optimal alignments exist, the alignment algorithm is deterministic in which optimal alignments it returns, i.e., identical sequences are always aligned to identical sequences of occurrences of LPMs such that all segments are assigned to LPM , even when there exists an alternative LPM with such that . Therefore, alignment-based selection will select only one of such LPMs to represent , thereby reducing number of instances of , and potentially removing it if it has no instances left, thereby reducing the redundancy in the LPM set.
However, given two different sequences and () with for some prefix length , there is no guarantee that the events of and are assigned to instances of the same LPMs. To see that this can cause redundancy to remain in the LPM set, consider and , where . As shown, for there are two possible optimal alignments: as an instance of LPM (a) or as an instance of LPM (b). However, , has only one optimal alignment, which is the following, where the , with an instance of LPM (b) and an instance of LPM (c). However, if we would be mining from a sequence database , the possible optimal alignment of to LPM (a) would result in the alignment-based selection to create a redundant set of three LPMs consisting of LPMs (a), (b), and (c), while the alignment of to LPM (b) results in an LPM set consisting of only (b) and (c). To alleviate this cause of redundancy, a greedy approach to post-process an LPM set is proposed in Algorithm 2. The intuition behind this algorithm is as follows: first, we select the LPM from the set that explains the highest number of events in the sequence database . Then, we filter out all the events from sequence database that are already explained, resulting in a new sequence database . Iteratively, we search for the LPMs that explain the highest number of events that were still unexplained (i.e., are in ), and update . We call this the greedy selection approach.
Algorithm 2 removes LPMs from the LPM set without taking into account how much behavior the LPMs itself allow for. The selection of only a small number of LPMs that all allow for many sequences over their activities may still result in low precision values. In Algorithm 3 we propose a direct approach to greedily select the best combination of LPMs from the input LPM set that leads to the highest F-score according to the evaluation framework. We call this approach the greedy selection (F-score) method.
4.1 Re-mining of Selected Local Process Models
Algorithms 1-3 simply select a subset of LPMs from an initial set of LPMs , however, the LPMs in the set themselves are left unchanged, i.e., . However, It can be the case that two LPMs are overlapping in the sequences that they allow for, i.e., . If such a case, even though and are both non-redundant patterns, it does indicate that part of the behavior allowed for by and is redundant. We refer to such type of redundancy as within-LPM-redundancy, as opposed to the between-LPM-redundancy that Algorithms 1-3 aim to address. To mitigate within-LPM-redundancy from a selected set , we propose to re-mine a process model from the set of occurrences of each LPM, by applying any existing process discovery algorithm to the set of pattern instances of an LPM. Algorithm 4 shows the re-mining procedure. Re-mining is orthogonal to the selection approaches of Algorithms 1-3, and can be used in combination with any LPM selection procedure. Although re-mining can be done with any process discovery algorithm, we use the Split Miner algorithm (Augusto et al, 2017), which has been shown to discover precise and simple process models.