Anti-Alignments – Measuring The Precision of Process Models and Event Logs

Processes are a crucial artefact in organizations, since they coordinate the execution of activities so that products and services are provided. The use of models to analyse the underlying processes is a well-known practice. However, due to the complexity and continuous evolution of their processes, organizations need an effective way of analysing the relation between processes and models. Conformance checking techniques asses the suitability of a process model in representing an underlying process, observed through a collection of real executions. One important metric in conformance checking is to asses the precision of the model with respect to the observed executions, i.e., characterize the ability of the model to produce behavior unrelated to the one observed. In this paper we present the notion of anti-alignment as a concept to help unveiling runs in the model that may deviate significantly from the observed behavior. Using anti-alignments, a new metric for precision is proposed. In contrast to existing metrics, anti-alignment based precision metrics satisfy most of the required axioms highlighted in a recent publication. Moreover, a complexity analysis of the problem of computing anti-alignments is provided, which sheds light into the practicability of using anti-alignment to estimate precision. Experiments are provided that witness the validity of the concepts introduced in this paper.

READ FULL TEXT VIEW PDF
08/30/2019

Evaluating Conformance Measures in Process Mining using Conformance Propositions (Extended version)

Process mining sheds new light on the relationship between process model...
07/09/2019

A Conformance Checking-based Approach for Drift Detection in Business Processes

Real life business processes change over time, in both planned and unexp...
03/18/2021

CoCoMoT: Conformance Checking of Multi-Perspective Processes via SMT (Extended Version)

Conformance checking is a key process mining task for comparing the expe...
12/22/2020

Scalable Online Conformance Checking Using Incremental Prefix-Alignment Computation

Conformance checking techniques aim to collate observed process behavior...
09/29/2020

Alignment Approximation for Process Trees

Comparing observed behavior (event data generated during process executi...
12/02/2019

Conformance Checking Approximation using Subset Selection and Edit Distance

Conformance checking techniques let us find out to what degree a process...
03/19/2021

Continuous iterative algorithms for anti-Cheeger cut

As a judicious correspondence to the classical maxcut, the anti-Cheeger ...

1 Introduction

Relating observed and modelled process behavior is the lion’s share of conformance checking [9]. Observed behavior is often recorded in form of event logs, that store the footprints of process executions. Symmetrically, process models are representations of the underlying process, which can be automatically discovered or manually designed. With the aim of quantifying this relation, conformance checking techniques consider four quality dimensions: fitness, precision, generalization and simplicity [24]. For the first three dimensions, the alignment between a process model and an event log is of paramount importance, since it allows relating modeled and observed behavior [1].

Given a process model and a trace in the event log, an alignment provides the run in the model which mostly resembles the observed trace. When alignments are computed, the quality dimensions can be defined on top [1, 20]. In a way, alignments are optimistic: although observed behavior may deviate significantly from modeled behavior, it is always assumed that the least deviations are the best explanation (from the model’s perspective) for the observed behavior.

In this paper we present a somewhat symmetric notion to alignments, denoted as anti-alignments. Given a process model and a log, an anti-alignment is a run of the model that mostly deviates from any of the traces observed in the log. The motivation for anti-alignments is precisely to compensate the optimistic view provided by alignments, so that the model is queried to return highly deviating behavior that has not been seen in the log. In contexts where the process model should adhere to a certain behavior and not leave much room for exotic possibilities (e.g., banking, healthcare), the absence of highly deviating anti-alignments may be a desired property for a process model. Using anti-alignments one cannot only catch deviating behavior, but also use it to improve some of the current quality metrics considered in conformance checking. In this paper we highlight the strong relation of anti-alignments and the precision metric: a highly-deviating anti-alignment may be considered as a witness for a loss in precision. Current metrics for precision lack this ability of exploring the model behavior beyond what is observed in the log, thus being considered as short-sighted [2].

We cast the problem of computing anti-alignments as the satisfiability of a Boolean formula, and provide high-level techniques which can for instance compute the most deviating anti-alignment for a certain run length, or the shortest anti-alignment for a given number of deviations.

Anti-alignments are related to the completeness of the log; a log is complete if it contains all the behavior of the underlying process [28]. For incomplete logs, the alternatives for computing anti-alignments grow, making it difficult to tell the difference between behavior not observed but meant to be part of the process, and behavior not observed which is not meant to be part of the process. Since there exists already some metrics to evaluate the completeness of an event log (e.g., [36]), we assume event logs have a high level of completeness before they are used for computing anti-alignments. Notice that in presence of an incomplete event log, anti-alignments can be used to interactively complete it: an anti-alignment that is certified by the stake-holder as valid process behavior can be appended to the event log to make it more complete.

This work is an extension of recent publications related to anti-alignments: in [10] we established for the first time the notion of anti-alignments based on the Hamming distance, and proposed a simple metric to estimate precision. Then, the work in [30]

elaborated the notion of anti-alignments, heuristically computing them for the Levenshtein distance by adapting the

search technique, and proposed two new notions for trace-based and log-based precision, that can be combined to estimate precision of process models. However, as it was claimed recently in a survey paper advocating for properties precision metrics should have [26], it was not known the satisfiability of the properties for the aforementioned metrics.

The contributions of the paper with respect to our previous work are now enumerated.

  • We show how anti-alignments can be computed in an optimal way for the Levenshtein distance, without increasing the complexity class of the problem. Moreover we relate the two available distance encodings (Hamming and Levenshtein), and show the implications of using each one for anti-alignment based precision.

  • We adapt the precision metrics from [30] to not depend on a particular length defined apriori.

  • We prove the adherence of one of the new metrics proposed in this paper to most of the properties in [26].

  • A novel implementation is provided, with several improvements, which makes it able to deal with larger instances.

  • A new evaluation section is reported, that show empirically the capabilities of the proposed technique for large and real-life instances.

The remainder of the paper is organized as follows: in the next section, a simple example is used to emphasize the importance of anti-alignments and its application to estimate precision is shown. Then in Section 3 the basic theory needed for the understanding of the paper is introduced. Section 4 provides the formal definition of anti-alignments, whilst Section 5 formalizes the encoding into SAT of the problem of computing anti-alignments. In Section 6, we define a new metric, based on anti-alignments, for estimating precision of process models. Experiments are reported in Section 7, and related work in Section 8. Section 9 concludes the paper and gives some hints for future research directions.

2 A Motivating Example

Let us use the example shown in Figure 1 for illustrating the notion of anti-alignment. The example was originally presented in [31], and in this paper we present a very abstract version of it in Figure 1(a): The modeled process describes a realistic transaction process within a banking context. The process contains all sort of monetary checks, authority notifications, and logging mechanisms. The process is initiated when a new transaction is requested, opening a new instance in the system and registering all the components involved. The second step is to run a check on the person (or entity) origin of the monetary transaction. Then, the actual payment is processed differently, depending of the payment modality chosen by the sender (cash, cheque and payment). Later, the receiver is checked and the money is transferred. Finally, the process ends registering the information, notifying it to the required actors and authorities, and emitting the corresponding receipt.


(a)

(b)

Figure 1: Example (adapted from [31]). (a) Initial process model, (b) Modified process model.

Assume that a log covering all the three possible variants (corresponding to the three possible payment methods) with respect of the model in Figure 1(a) is given. The three different variants for this log will be:

ort, cs, pcap, cr, tm, nct

ort, cs, pchp, cr, tm, nct

ort, cs, pep, cr, tm, nct

where we use the acronym for each one of the actions performed, e.g., ort stands for open and register transaction.

For this pair of model and log, most of the current metrics for precision (e.g., [2]) will rightly assess a very high precision. In fact, since no deviating anti-alignment can be obtained because every model run is in the log, the anti-alignment based precision metric from this paper will also assess a high (in our case, perfect) precision.

Now assume that we modify a bit the model, adding a loop around the alternative stages for the payment. Intuitively, this (malicious) modification in the process model may allow to pay several times although only one transfer will be done. The modified high-level overview is shown in Figure 1(b). The aforementioned metric for precision will not consider this modification as a severe one: the precision of the model with respect to the log will be very similar to the one for the model in Figure 1(a).

Remarkably, this modification in the process model comes with a new highly deviating anti-alignment denoting a run of the model that contains more than one iteration of the payment:

ort, cs, pcap, pchp, pchp, pep, pcap, cr, tm, nct

Clearly, this model execution where five payments have been recorded is possible in the process of Figure 1(b). Correspondingly, the precision of this model in describing the log of only three variants will be significantly lowered in the metric proposed in this paper, since the anti-alignment produced is very different from any of the three variants recorded in the event log.

3 Preliminaries

Definition 1 ((Labeled) Petri net)

A (labeled) Petri Net [21] is a tuple , where is the set of places, is the set of transitions (with ), is the flow relation, is the initial marking, is the final marking, is an alphabet of actions and labels every transition by an action.

A marking is an assignment of a non-negative integer to each place. If is assigned to place by marking (denoted ), we say that is marked with tokens. Given a node , its pre-set and post-set are denoted by and respectively.

A transition is enabled in a marking when all places in are marked. When a transition is enabled, it can fire by removing a token from each place in and putting a token to each place in . A marking is reachable from if there is a sequence of firings that transforms into , denoted by . We define the language of as the set of full runs defined by . A Petri net is k-bounded if no reachable marking assigns more than tokens to any place. A Petri net is bounded if there exist a for which it is -bounded. A Petri net is safe if it is 1-bounded. A bounded Petri net has an executable loop if it has a reachable marking and sequences of transitions , , such that .

An event log is a collection of traces, where a trace may appear more than once. Formally:

Definition 2 (Event Log)

An event log (over an alphabet of actions ) is a multiset of traces .

Process mining techniques aim at extracting from a log a process model (e.g., a Petri net) with the goal to elicit the process underlying a system . is considered a language for the sake of comparison. By relating the behaviors of , and , particular concepts can be defined [8]. A log is incomplete if . A model fits log if . A model is precise in describing a log if is small. A model represents a generalization of log with respect to system if some behavior in exists in . Finally, a model is simple when it has the minimal complexity in representing , i.e., the well-known Occam’s razor principle.

4 Anti-Alignments

The idea of anti-alignments is to seek in the language of a model what are the runs which differ considerably with all the observed traces. Hence, this is the opposite of the notion of alignments [1] which is central in process mining: for many tasks in conformance checking like process model repair or decision point analysis, one needs indeed to find the run which is the most similar to a given log trace [28]. In this paper, we are focusing on precision and for this, traces which are not similar to any observed trace in the log serve as witnesses for bad precision. All these notions anyway depend on a definition of distance between two traces (typically a model trace, i.e. a run of the model, and an observed log trace). We assume a given distance function computable in polynomial time and such that 111Actually, we do not require that satisfies the usual properties of distance functions like symmetry or triangle inequality.

  • for every , ,

  • for every , converges to when diverges to .

Definition 3 (Anti-alignment)

For a distance threshold , an -anti-alignment of a model w.r.t. a log is a full run such that , where is defined as the .222Since the function takes its values in , we define by convention .

For the following examples, we show anti-alignments w.r.t. two possible choices of distance : Levenshtein’s distance and Hamming distance.

Definition 4 (Levenshtein’s edit distance )

Levenshtein’s edit distance between two traces is based on the minimum number of deletions and insertions needed to transform to . In order to get a normalized distance between 0 and 1, we define Levenshtein’s edit distance .

Example 1
Figure 2: A process model (taken from [29]) and an event log. The full run is a -anti-alignment for Levenshtein’s distance and a -anti-alignment for Hamming distance.

Consider the Petri net and log shown in Figure 2. With Levenshtein’s distance, the full run is at distance from the log trace (two deletions and one insertion). It is at larger distance from the other log traces. Therefore, it is a -anti-alignment.

Another interesting choice is Hamming distance. It is in general less informative than Levenshtein’s distance for relating observed and modelled behavior, but it has the interest of being very simple to compute. Variants of Hamming distance can also provide good compromises. In Sections 5 and 6, we will show how to efficiently compute anti-alignments for Hamming distance using SAT solvers.

Definition 5 (Hamming distance )

For two traces and , of same length , define . For longer than , we define , where

is a special padding symbol (

for ‘wait’); we proceed symmetrically when is shorter than .

Lemma 1

Observe that, for every and (assuming one of them at least is nonempty), .

Proof

Assume w.l.o.g. . Let . We have, . We have also because one way to transform to is to replace by (one deletion and one insertion) at each position where they differ ( editions), and then to insert the letters ( editions). It remains to see that , which implies

Example 2

Consider the Petri net and log shown in Figure 2. With Hamming distance, the full run is at distance from the log trace ( and do not match with and , and is shorter than , which counts for the third mismatch). It is at larger distance from the other log traces. Therefore, it is a -anti-alignment.

Lemma 2

For every log and finite model , we have:

  1. if the model has finitely many full runs, then there exists (at least) one maximal anti-alignment of w.r.t. , i.e.  that maximizes the distance ;

  2. if the model has infinitely many full runs, then there exist anti-alignments with arbitrarily close to . Yet, there may not exist any -anti-alignment, i.e. there is no guarantee that the limit is reached for any .

Proof

If the model has finitely many full runs, then one of them must be a maximal anti-alignment.

Conversely, if is finite and has infinitely many runs, then there must exist arbitrary long full runs; more formally, there exists an infinite sequence of full runs of strictly increasing length. For every , the sequence of converges to . Since there are finitely many , converges to as well. ∎

Maximal anti-alignments will be used in Section 6 to define our precision metric. The case of models with executable loops will be discussed in Subsection 6.1.1.

Lemma 3

The problem of deciding, given a finite model and a log , whether there exists a -anti-alignment of w.r.t. , has the same complexity as reachability for Petri nets.

Proof

This is equivalent to checking whether , i.e. whether is reachable from .∎

By definition, a -anti-alignment of w.r.t.  is a full run satisfying the trivial inequality . The same problem with a strict inequality is also of interest. We will need it in Section 6.1.1.

Lemma 4

The problem of deciding, given a finite model and a log , whether there exists a full run satisfying (or equivalently deciding if ), has the same complexity as reachability for Petri nets.

Proof

The reachability problem reduces to the existence of a full run satisfying : indeed for every .

Figure 3: A deterministic Petri net representing the log , for the alphabet of actions . Place is reached only by the runs which do not appear in the log.

Conversely, deciding if reduces to deciding reachability of in the synchronous product of with a deterministic Petri net which represents as a tree the log traces sharing their common prefixes, and, from the leaves, marks a sink place , as illustrated in Figure 3. Hence, every full run of , when synchronized with the Petri net representation of the log , leads to a marking of the form , and iff . ∎

The problem of reachability in Petri nets is known to be decidable, but non-elementary [12].

Yet, the complexity drops to NP if a bound is given on the length of the anti-alignment.

Lemma 5

The problem of deciding, for a Petri net , a log , a rational distance threshold and a bound , if there exists a -anti-alignment such that , is NP-complete. We assume that is encoded in unary.333 Since has typically the same order of magnitude as the length of the longest traces in the log, encoding in unary does not significantly affect the size of the problem instances.

Proof

The problem is clearly in NP: checking that a run is a -anti-alignment of w.r.t.  takes polynomial time (remember that we consider distance functions computable in polynomial time).

For NP-hardness, we propose a reduction from the problem of reachability of a marking in a 1-safe acyclic444a Petri net is acyclic if the transitive closure of its flow relation is irreflexive. Petri net , known to be NP-complete [25, 11]. Notice that, since is acyclic, each transition can fire only once; hence, the length of the firing sequences of is bounded by the number of transitions . Finally, is reachable in iff there exists a of length less or equal to which is a -anti-alignment of (with as final marking) w.r.t. the empty log. ∎

5 SAT-encoding of Anti-Alignments

In this section, we give hints on how SAT solvers can help to find anti-alignments. We detail the construction of a SAT formula , where is a Petri net, a log, and two integers. This formula will be used in the search of anti-alignments of w.r.t.  for Hamming distance (see Section 5.3 for the encoding using the Levenshtein distance). The formula characterizes precisely the full runs of of length which differs in at least positions with every log trace in .

5.1 Coding Using Boolean Variables

The formula is coded using the following Boolean variables:

  • for , (remind that is the special symbol used to pad the log traces, see Definition 5) means that transition .

  • for , means that place is marked in marking (remind that we consider only safe nets, therefore the are Boolean variables).

  • for , , means that the th mismatch with the observed trace is at position .

The total number of variables is .

Let us decompose the formula .

  • The fact that is coded by the conjunction of the following formulas:

    • Initial marking:

    • Final marking:

    • One and only one for each :

    • The transitions are enabled when they fire:

    • Token game (for safe Petri nets):

  • Now, the constraint that deviates from the observed traces (for every , ) is coded as:

    with the correctly affected w.r.t.  and :

    and that for , the th and th mismatch correspond to different ’s (i.e. a given mismatch cannot serve twice):

5.2 Size of the Formula

In the end, the first part of the formula () is coded by a Boolean formula of size , with .

The second part of the formula (for every , ) is coded by a Boolean formula of size .

The total size for the coding of the formula is

5.3 SAT-encoding of Anti-Alignments for Levenshtein’s Edit Distance

Our SAT-encoding of anti-alignments for Levenshtein’s edit distance uses the same boolean variables as the SAT-encoding of anti-alignments for Hamming distance of the previous section, completed with variables used to encode the edit distance.

Our encoding is based on the same relations that are used by the classical dynamic programming recursive algorithm for computing the edit distance between two words and :

We encode this computation in a SAT formula over variables , for , and . Formula will have exactly one solution, in which each variable is true iff and differ by at least editions.

In order to test equality between the and , we use variables and , for , and , and we set their value such that is true iff , and is true iff . Hence, the test becomes in our formulas: . For readability of the formulas, we refer to this coding by . We also write similarly .

In the following, we describe the different clauses of the formula of our SAT encoding of the edit distance.

(1)
(2)
(3)
(4)
(5)
Example 3

At instants and of words and , the letters are the same, then, by (4), the distance is only higher or equal to 0 : .

However at instants and , the letters and are different. A step before, and are true because of the length of the subwords. Then, by (5), the distance at instants and is higher or equal to 2 : . The result is understandable because the edit distance costs the deletion of and the addition of to transform to .

In order to insert this encoding of Leventshein’s edit distance into our formulas for anti-alignments, we need to compute the edit distance between the expected anti-alignment and every trace of the log, which requires to use variables for , , , to represent the fact that and differ by at least editions.

5.4 Solving the Formula in Practice

In practice, the coding of the formula can be done using the Boolean variables , and .

Then we need to transform the formula in conjunctive normal form (CNF) in order to pass it to the SAT solver. We use Tseytin’s transformation [27] to get a formula in conjunctive normal form (CNF) whose size is linear in the size of the original formula. The idea of this transformation is to replace recursively the disjunctions (where the are not atoms) by the following equivalent formula:

where are fresh variables.

In the end, the SAT solver tells us if there exists a run which differs by at least editions with every observed trace . If a solution is found, we extract the run using the values assigned by the SAT solver to the Boolean variables .

6 Using Anti-Alignments to Estimate Precision

In this section we show how to use anti-alignments to estimate precision of process models. Remarkably, we show how to modify the definitions of [10, 30] so that the new metric does not depend on a predefined length. In Section 6.2 we dive into the adherence of the metric with respect to a recent proposal for properties of precision metrics [26].

6.1 Precision

Our precision metric is an adaptation of our previous versions presented in [10, 30]. It relies on anti-alignments to find the model run that is as distant as possible to the log traces. Like anti-alignments, the definition of precision is parameterized by a distance . In the examples, we will specify each time if we use Levenshtein’s edit distance (Definition 4) or Hamming distance (Definition 5).

Definition 6 (Precision)

Let be an event log and a model. We define precision as follows:

For instance, consider the model and log shown in Figure 2. With Levenshtein’s distance, the full run is a maximal anti-alignment. It is at distance to any of the log traces, and hence .

6.1.1 Handling Process Models with Loops

Notice that a model with arbitrary long runs (i.e., a process model that contains loops) may cause the formula in Definition 6 to converge to 0. This is a natural artifact of comparing a finite language (the event log), with a possibly infinite language (the process model). Since process models in reality contain loops, an adaptation of the metric is done in this section, so that it can also handle this type of models without penalizing severely the loops.

Definition 7 (Precision for Models with Loops)

Let be an event log and a model. We define -precision as follows:

with some which is a parameter of this definition.

Informally, the formula computes the anti-alignment that provides maximal distance with any trace in the log, and at the same time tries to minimize its length. The penalization for the length is parametrized over the 555Although, admittedly, is a parameter that should be decided apriori, in practice one can use a particular value to this parameter thorough several instances, without impacting significantly the insights obtained through this metric.. Observe that is precisely the precision of Definition 6. By making Definition 7 not dependant on a predefined length, it deviates from the log-based precision metrics defined in previous work [10, 30].

Figure 4: Example from [2]

Let us now consider the model of Figure 4, and the log . Assume that . A possible anti-alignment is which is at least at Levenshtein’s distance to any of the log traces. For the value of the formula is . Another possible anti-alignment is which is at least at distance to any of the log traces. For the value of the formula is . Hence, since the anti-alignment that maximizes the second term of the formula is , the precision computed is . If instead, is set to a lower value, e.g., , the corresponding value of the formula for the anti-alignment will be the mainimal, and therefore it will be selected as the anti-alignment resulting in .

6.1.2 Computing

By incorporating the parameter in the definition of precision, now the metric can deal with models containing loops without predefining the length of the anti-alignment. In this section we show that the proposed extension is well-defined and can be computed, and provide some complexity results of the algorithms involved.

Lemma 6

For every finite model , log and , the supremum in the definition of is reached, i.e. there exists a full run such that .

Proof

Two cases have to be distinguished: if , then the supremum equals , is obviously reached by any , and ; otherwise, let and let ; we show that the supremum in the definition of becomes now a maximum over a finite set of runs, bounded by a given length that depends on and :

with . Indeed, for every strictly longer than , we have , which also shows that . Hence is considered in our , and then . ∎

Lemma 6 gives us the key for an algorithm to compute .

Algorithm 1

Algorithm for computing :

  • if , then

    • select

    • let

    • explore the reachability graph of until depth and return ;

  • else return (the model has perfect precision).

The correctness of this algorithm follows directly from Lemma 6. Its complexity resides essentially in the initial test, which corresponds to simply deciding if , whose complexity is given by the following lemma:

Lemma 7

The problem of deciding, for a finite model and a log , if , is equivalent to deciding reachability in Petri nets.

Proof

We simply observe that iff . Deciding this is equivalent to deciding reachability in Petri nets, as showed in Lemma 4. ∎

However, in practice, one would generally skip the first check and jump directly to the exploration until some depth , possibly computed form a given threshold , like the one given by the in Algorithm 1. Notice that the algorithm explores less deep (i.e.  is smaller) when is large (close to 1), i.e.  is close to the optimal anti-alignment. We can summarize this with the following variation of Algorithm 1:

Algorithm 2

Algorithm for estimating using a threshold as input:

  • explore the reachability graph of until depth

  • if the exploration finds a full run

  • then output “

  • else output “”.

Lemma 8

For any fixed , the problem of deciding, for a finite model , a log and a rational constant , if , is NP-complete.

Proof

The proof is similar to the one of Lemma 6; here, the bound is given directly, and we have the same equality

with . This means, in order to check that , it suffices to guess a full run of length , where depends linearly on the size of the representation of (number of bits in the numerator and denominator). Then one can check in polynomial time that .

For completeness, we proceed like in Lemma 5: we reduce reachability of in a 1-safe acyclic Petri net to with and . ∎

6.2 Discussion about Reference Properties for Precision

Recently, an effort to consolidate a set of desired properties for precision metrics has been proposed [26]. Five axioms are described that establish different features of a precision metric . Summarizing, the axioms proposed in [26] are:

  • : A precision metric should be a function, i.e. it should be deterministic.

  • : If a process model allows for more behavior not seen in a log than another model does, then should have a lower precision than regarding :

  • : Let be a model that allows for the behavior seen in a log , and at the same time its behavior is properly included in a model whose language is 666Actually, [26] writes “”, with for powerset, but we believe this is a mistake. (called a flower model). Then the precision of on should be strictly greater than the one for .

  • : The precision of a log on two language equivalent models should be equal:

  • : Adding fitting traces to a fitting log can only increase the precision of a given model with respect to the log:

In the aforementioned paper, it is shown that the previous version of our antialignment-based precision metric (from [30]) does not satisfy axiom (the satisfaction of the rest of axioms are declared as unknowns in the paper). With the new version of the metric presented in this paper, we here provide proofs for these axioms, except for . But at the same time, we show that any precision metric can be adapted in order to satisfy .

Lemma 9

The metric (for any fixed ) satisfies .

Proof

Everything in our definitions is functional. ∎

Lemma 10

The metric satisfies .

Proof

Let . The definitions take the of an expression which does not depend on the model. Since , the ranges over a -larger set than the . Therefore the result cannot be smaller, and we get . ∎

Our metrics may not satisfy the strict inequality required by : they satisfy only a weaker version of with non-strict inequality, but, as observed in [26], this is then simply subsumed by . The authors of [26] precisely introduced after arguing that, in case of a flower model, a strict inequality should be required.

Anyway, we show in Lemma 11 that any precision metric can be modified so that it satisfies this requirement of strict inequality for the flower models.

Lemma 11

Let be any precision metric. It is possible to define a metric from such that satisfies : it suffices to set the precision of the flower models to a value smaller than all the other precision values (after possibly extending the target set of the function). This guarantees and preserves all the other axioms.

Proof

The new metric satisfies by construction. Moreover, if is deterministic (), then also is. For preservation of , and , it suffices to study the different cases (separate flower model and others) to show that the equality and non-strict inequalities are preserved. ∎

We consider that satisfying is a very artificial issue. However, if really the transformation defined in Lemma 11 had to be implemented, it would imply that, in order to compute the precision, one would have to decide if the model is a flower model, i.e. if . This is known as the universality problem. This problem is, in theory, highly intractable777 This universality problem is PSPACE-complete for non-deterministic finite state automata (NFSA) [17], and here the NFSA to consider would be the reachability graph of (for -bounded), which is exponential in the size of . Hence, deciding universality for -bounded labeled Petri nets is in EXPSPACE. . But again, this is very artificial: in practice it suffices to explore at a very short finite horizon to detect many many non-flower models.

Lemma 12

The metric satisfies .

Proof

This trivially holds since both metrics are behaviorally defined. Also, we copied this axiom from [26], but observe that it is a simple corollary of as soon as . ∎

Lemma 13

Metrics (for any ) satisfies .

Proof

With , for every , we have , so the cannot be smaller for than for . The rest does not depend on the log. ∎

7 Tool Support and Experiments

In this section we present the new tool implementing the results of this paper, and both a qualitative and quantitative evaluation on state-of-the-art benchmarks from the literature. To compare the different distances based results, we denoted Leventshein distance based anti-alignment precision by and Hamming distance based anti-alignment precision by .

7.1 da4py: A Python Library Supporting Anti-Alignments

Several tools implement anti-alignments. Darksider, an Ocaml command line software, has already been presented in [30]. It creates the SAT formulas and calls the solver Minisat+ [14] to get the result. ProM software [33] also has an anti-alignment plugin, that computes anti-alignments in a brute force way. Recently, we have created a Python library in order to make our technique more accessible: da4py 888https://github.com/BoltMaud/da4py, a Python version of Darksider. Thanks to the use of the SAT library PySAT [16], da4py allows one to run different state-of-the-art SAT solvers. Moreover, this SAT library uses an implementation of the RC2 algorithm [15] in order to get MaxSAT solutions, a variant that improves a lot the efficiency of computing anti-alignment. Finally, da4py is compatible with the library pm4py [7], and uses the same data objects.

Remarkably, in order to deal with large logs (as the ones shown in the quantitative evaluation part), da4py has a variant that allows to compute a prefix of anti-alignments, thus alleviating the complexity by not requiring a full run but only a prefix. Accordingly, the corresponding precision measure is then a variant, that is normalized by the length of the anti-alignment prefix computed. Furthermore, for anti-alignments based on Levenshtein’s distance, another simplification is to add a threshold on the number of editions (max_d attribute) between the run and the traces, to compute a lower-bound for the anti-alignment instead of the complete anti-alignment.

7.2 Qualitative Comparison

Trace
Table 1: An example event log.
Figure 5: The ideal model. Fitting, fairly precise and properly generalizing.
Figure 6: Most frequent trace. Precise, but not fitting or generalizing.
Figure 7: The flower model. Fitting and generalizing, but very imprecise.
Figure 8: All traces separate. Fitting, precise, but not generalizing.
Figure 9: A model with G and H in parallel.
Figure 10: A model with G and H in self-loops
Figure 11: A model with D in a self-loop
Figure 12: A model with all transitions in parallel.
Figure 13: A model where C and F are in a loop, but need to be executed equally often to reach the final marking.
Figure 14: Round-robin model. The outer loop can be started at any point and then exited one transition before completing the loop.

A set of examples are taken from page 64 of [23], and consist of the simple event log shown in Table 1 aligned with 10 different process models. The log consists of only five different traces, with various frequencies. The models in Figures 5 to 8 are four examples of models often used to show the differences between fitness, precision and generalization. The model in Figure 5 shows the “ideal” process discovery result, i.e. the model that is fitting, fairly precise and properly generalizing. The models in Figures 9 to 12 present the same set of activities with varying loop and/or parallel constructs. Two new process models that describe particularly different routing logic from the previous models are depicted in Figures 13 and 14.

Model