Striking a new balance in accuracy and simplicity with the Probabilistic Inductive Miner

09/13/2021 ∙ by Dennis Brons, et al. ∙ TU Eindhoven 0

Numerous process discovery techniques exist for generating process models that describe recorded executions of business processes. The models are meant to generalize executions into human-understandable modeling patterns, notably parallelism, and enable rigorous analysis of process deviations. However, well-defined models with parallelism returned by existing techniques are often too complex or generalize the recorded behavior too strongly to be trusted in a practical business context. We bridge this gap by introducing the Probabilistic Inductive Miner (PIM) based on the Inductive Miner framework. PIM compares in each step the most probable operators and structures based on frequency information in the data, which results in block-structured models with significantly higher accuracy. All design choices in PIM are based on business context requirements obtained through a user study with industrial process mining experts. PIM is evaluated quantitatively and in an novel kind of empirical study comparing users' trust in discovered model structures. The evaluations show that PIM strikes a unique trade-off between model accuracy and model complexity, that is conclusively preferred by users over all state-of-the-art process discovery methods.



There are no comments yet.


page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Discovering a process model from an event log is a central step in any process mining analysis in an industrial setting. The discovered model summarizes the data and aids analysts in understanding the executed process and in distinguishing main behavior and deviations.

Research in the last 20 years contributed numerous algorithms for discovering process models supporting concurrency in syntax and semantics using human-understandable modeling notations [1, 2]. The models have to be sound [3], have high fitness and precision, and be simple in structure [4, 2]; best state-of-the-art techniques according to a recent benchmark [2] strike different trade-offs and do not meet all criteria. For instance, the Split Miner [5] returns models with high fitness and precision that tend to have a more complex structure and may be unsound; the Inductive Miner [6] returns sound, block-structured models with high fitness and simpler structure, but low precision.

Industrial applications still favor directly-follows graphs (DFGs) which is anecdotally attributed to their simple semantic concepts. For industrial analysts, process models with formal modeling notations are harder to understand and reason about than DFGs once models reach a certain complexity [7]. However, the perceived simplicity of DFGs is countered by their inability to describe processes with concurrency, leading to false statistics and wrong insights regarding performance and deviating behavior [8] while analysts require correct information.

In this paper we (1) investigate which balance of quality properties in a process model with concurrency is preferred by analysts in an industrial setting, and (2) address the problem of identifying and developing a corresponding process discovery algorithm.

Preferred quality properties. We answered (1) through computational requirements and a user study. As process models of an event log form a pareto-front along fitness, precision, and simplicity, models with a simpler structure have lower fitness to the data [4]. A low-fitting model can be visually augmented with the “missing” behavior by visually overlaying deviating paths computed from alignments [9]. However, computing alignments is expensive [10] hindering quick interactive exploration of data. Visual Alignments [11] are an approximation of alignments for the purpose of visualization that can be computed in linear time on block-structured BPMN models with only XOR- and AND-gateways (Appendix B shows an example). As block-structured models are also easier to understand [7], we identify requirement (R0) The discovered model must be block-structured with only XOR- and AND-gateways.

To further answer (1), we conducted a Delphi user study for which we prepared 9 fragments of 3 real-life event logs. For each fragment we manually created between 2 and 8 alternative block-structured process models differing in structural complexity, fitness, and precision. We asked 6 expert process mining analysts to indicate which models they prefer as representation for each fragment (and why). Preferences and reasons were consolidated in a second round, resulting in the following requirements for process discovery in an industrial context: (R1) The algorithm must have a parameter to control model complexity to allow the analyst include fewer/more details at the cost of fitness/precision. (R2) The algorithm must produce model structures for which there is significant evidence in the data. Specifically, parallelism, loops, and skipping of activities should only be shown when occurring so “frequently” that an analyst does not doubt the the algorithm’s choice.

Algorithm development. To develop an algorithm that satisfies (R0-R2), we chose the Inductive Miner framework as it ensures block-structured models. However, all existing IM algorithms, strike the wrong trade-off to satisfy (R2), specifically IMf [6]

filters infrequent behaviors using a very basic, local heuristics. The IMc 

[12] algorithm instead uses a probabilistic model to infer missing behavior. We chose to combine ideas from IMf and IMc to develop a Probabilistic IM algorithm, we called PIM, that can handle infrequent behavior.

We recall relevant preliminaries about the IM framework and IMc in Sect. II. We then introduce the Probabilistic Inductive Miner (PIM) in Section III. We implemented PIM in the UiPath Process Mining platform and evaluated PIM quantitatively and empirically (see Sect. IV): we found that PIM strikes a unique balance between fitness, precision, and model complexity: PIM models achieve high precision while sacrificing fitness, and PIM consistently returns models with lower complexity than other algorithms. By our empirical evaluation, this unique balance is preferred by users. We present our concluding remarks in Section V.

Ii Background

We recall event logs, process trees, and the IM framework and the IMf and IMc algorithms. A trace is a finite sequence of activity names observed for one process execution; an event log is multi-set of traces. The directly-follows graph (DFG) of has nodes ; an edge from to , written , iff directly follows in some trace . Correspondingly, the indirectly-follows graph (IFG) has an edge iff strictly indirectly follows in some trace . The frequency and are the number of occurrences of in which directly and indirectly follow some , respectively, e.g., in , (1st, 4th ) and (all 4 ). and denote the sets of first and last activities in the traces in ; with correspondingly defined frequencies.

The following log serves as our running example: ; Fig. 1(left) visualizes ; arcs without source/target node indicate and .

A process tree (PT) is an abstract representation of a sound, block-structured workflow net [13]. A PT is a tree where each leaf node is an activity and every non-leaf node is an operator . Each sub-tree defines a block of (exclusive choice), (sequence), (interleaved parallelism), or (loop) over its children; first child of is the loop body which can be repeated after executing any of the other “redo” children. For example, log was generated from the PT in Fig. 1(right) with some deviations, e.g., .

Fig. 1: Directly-follows relation of (left) and process tree (right) from which was generated with some deviations.

The Inductive Miner (IM) [14] defines a framework for recursively discovering process trees from event logs in 3 steps. (1) checks whether has a trivial structure for which a trivial solution can be found (e.g., contains just a single activity). Otherwise, (2) identifies a identifies a cut: a PT operator that best describes best the relation between partitions of the activities in (e.g., a sequence or a choice over blocks of activities). If a cut is found, (3) splits according to the identifier operator and partitions into , e.g., partition the set of traces (), sequentially split each trace (, ), or project each trace (), in a way that maximizes fitness, and recursively call .

IM infrequent (IMf) [6] detects an -ary cut in by trying to top-down partition according to an , , , and operator (in this order) based on the structure of . To detect cuts in the presence of deviations, IMf filters out relations from occurring relatively less often than the strongest relation at an activity. If still no cut is found, IMf returns the flower model which fits any log over . But IMf’s top-down approach and some design choices in filtering lead to IMf too often not finding the most likely cut in the presence of slightly “too much” deviating behavior. This results in low precision on real-life data [2].

In contrast, IM incomplete (IMc) [12] detects a binary cut in and in a bottom-up fashion: it computes for any pair of activities the probability that and are related by . Then, it constructs an SMT problem to search for a partition and operator where the aggregated probabilities of all pairs is maximal. IMc always finds the most likely cut and requires no flower model, but the are computed under the assumption of incomplete (missing) behavior in and cannot filter out deviating behavior, making it inapplicable on real-life data.

Of the other state-of-the-art algorithms [2], ETM [13] uses a genetic search over PTs to maximize fitness, precision, generalization and simplicity; it often finds block-structured models with better precision and fitness than IM but at very high running times. Heuristic mining techniques [15, 16, 17] determine the most likely (frequent) relations between activities through quotients over and and then derive split/join logic by counting succeeding/preceding activities in traces, but cannot guarantee block-structured process models; restructuring into blocks [5] fails on most real-life logs.

In the following, we combine the ideas of IMc to detect cuts through bottom-up computation of the most likely operator between activities with the idea of quotients over and in logs with deviations used in heuristic miners.

Iii The Probabilistic Inductive Miner

We now propose a new instantiation of the IM framework (cf. Sect. II) called Probabilistic Inductive Miner (PIM) with the following pseudo-code.

function PIM()
      if  then
             return ()
      end if
end function

We first discuss our new definitions of filtering the DFG (Filtering) in Sect. III-A, FindCut in Sect. III-B, SplitLog and BaseCase in Sect. III-C. We apply PIM on our running example in Sect. III-D before discussing some implementation details in Sect. III-E.

Iii-a Filtering

The IMf algorithm [6] introduced filtering infrequent edges from when no cut can be found in . IMf filters locally, by removing all outgoing edges with . This filters non-uniformly, as shown in Fig. 2(left) where edges and of same frequency are partially filtered (filtered edges in red). This impairs IMf’s ability to control model complexity (BCR1) in a uniform way.

Fig. 2: Filtering of Fig. 1 by IMf with (left) and our method with (right).

We chose to adopt percentile-based filtering for a more uniform control of model complexity. sorts all edges in and by their frequency, retains only the top of frequent (in)directly follows edges and discards all others; disconnected nodes are removed. Filtering Fig. 1(left) in this way removes edges uniformly as shown in Fig. 2(right). We use this method to parameterize the model complexity (BCR1).

Iii-B Cut Detection

For FindCut, we extend the basic idea of IMc [12] with principles of Heuristic Mining. First, we compute for each pair of activities a score how likely it is that and are related by in . For a given partition of the activities in , we then can compute an aggregated score , i.e., the likelihood that the activities and form two blocks related by in . We then determine the cut with maximal.

Iii-B1 Activity Relation Scores

The lattice in Fig. 3 illustrates the basic idea for determining the likelihood . The IMc [12] introduced the mutually exclusive conditions over and shown above the horizontal line to describe when and are related by a particular operator in the absence of deviations, e.g., and are related by if neither follows the other (indirectly). We translated these into the fuzzy conditions over how often and (in)directly follow each other shown below the horizontal line, e.g., and are related by if they rarely follow each other (indirectly).

Fig. 3: Activity Relation Lattice.

To quantify the likelihood that relates and we defined the following formulas; as for IMc [12] for we distinguish between ( directly follows in a loop) and ( indirectly follows in a loop). In addition to and we write for the number of times occurs in .

Definition 1 (activity relation scores).

In contrast to IMc [12], the scores are not probabilities with but heuristics measuring which of the fuzzy conditions shown in Fig. 3 is most likely to hold.

For , we compare how often and occur alone, i.e., and with how often they occur together, i.e., . If either only or occurs in a trace, then , and ; otherwise the second term grows and approaches . In , .

For , we adopt the directional dependency heuristic of the Structured Heuristic Miner [5]. is before in a sequence if the evidence for “a precedes b” significantly exceeds the evidence for “b precedes a” . To keep our scores within the 0 to 1 range, we normalize the difference and truncate negative values to 0. For any activities and , ; is maximal if . In , .

For , we adopt the heuristics of IMc [12] and Heuristic Miner [18]. The more equal and are, the more evidence is in for a parallel relation, and the higher is . We take the minimum of and to ensure lies between 0 and 1. In , .

For , we compute two scores to distinguish indirect loop relation () from entering/exiting the redo part of the loop (). If is in the loop body and enters the redo part of a loop ( is a redo activity), then we see (entering the redo) as often as (returning from the redo to the body), i.e., their quotient in is close to 1. If exits the redo part, then the converse will be close to 1 (see Fig. 3). An indirect loop relation is likely when and indirectly follow and precede with similar frequency.

The activity scores have the property that if is directly-follows complete and (mostly) holds between and in , then for all operators . If holds, then and for . If holds, then thus for ; also (c.f. Sect. II) thus . If holds, then thus and ; correspondingly and exclude and . Distinguishing from and requires counting repetitions [19] which we do when aggregating activity scores.

Iii-B2 Aggregated Scores for Cuts

We now aggregate the activity relation scores to sets of activities so that we can search for cuts with being maximal. We write for the bag of scores of activity pairs .

For having no deviations, IMc defined as the average of all activity relation scores in cut , with a special case for  [6]. However, deviations in may give a few pairs a biased score which could lead to a “wrong” average although for most pairs . We therefore introduce a correction term or factor to obtain a lower score when we see evidence of such wrong bias. We first present the formal definitions and provide a full example in Sect. LABEL:alg:sec:example.

For , we know from Sect. III-B1 that is low iff is high ( and related by and not by ). Thus, a high average over is falsely biased towards if some pairs have a significantly lower score than other pairs. We therefore use as correction term for and the standard deviation over a set of activity relation scores.

Definition 2 (aggregate score for , ).

Let ; . is an -cut with score .

For , we know from Sect. III-B1 and Def. 1 that (1) if is low then can be in or relation, and vice versa, and that (2) we cannot reliably distinguish parallel behavior from loop behavior using only binary activity relations.

Distinguishing from requires explicitly counting repetitions [19], which we achieve as follows. We write for the number of traces and for the total number of events in log . If has no repetition (no loop) then each trace in contain each at most once and . Thus, if has a loop. We thus can use as correction factor to reduce the score (presence of loops), which we bound to by .

Definition 3 (aggregate score for ).

Let . Then is a -cut with score .

As or only indicates absence of but not presence of , we have to use the inverse to boost a -score (reinforcing presence of loops). IMc [12] computes the score for as average over and as follows. In a loop, the redo part is entered from/exited to body at , respectively. Pairs directly entering/exiting the redo part are scored using , while all other (indirect) pairs are scored using (see Fig. 3).

Definition 4 (aggregated score for ).

Let and . Then is a -cut defining sets , , and .

The scores for are . The aggregated score for is .

The correction term boosts the loop score relative to the inverse which is high when , i.e., shows many repetitions.

Iii-B3 Cut finding

A naïve method for FindCut computes all activity relation scores for all and then exhaustively searches for a partition with or being maximal, which costs . Heuristics in and allow to significantly prune the search space for but are omitted for space limitations. Note that this method always returns a cut (possibly with a low score) while IMf may not find a cut and resort to a flower model [6].

Iii-C Log Splitting, Recursion, Base Cases, Skipping

Following the IM framework, we next split the log according to the found cut into and . As may not completely fit due to deviations, we base SplitLog on IMf’s [6] method which filters from and events that deviate from . Filtering may result in empty traces , denoting that the behavior in can also be skipped. Then Pim is invoked on and , returning a left and right sub-tree under (see start of Sect. III).

We use the IM base cases, with some extensions regarding the handling of empty behavior in a (sub)log.

  • [noitemsep, nolistsep, leftmargin=*]

  • Single activity (). When the sublog contains only a single activity, that activity is returned as a leaf node of the process tree.

  • No Activities (). If the sublog does not contain any activities, a silent activity () is returned as leaf node of the process tree.

  • Skip Sublog When the sublog contains empty traces , a skip sub-tree is returned as explained next.

IMf’s BaseCase always filters from and returns if wrt. threshold . This either forgets that some empty behavior has been seen or results in many -skips across all subtrees which lowers precision [2]. We delay generating as follows. If , we keep all empty traces in but ignore them when computing , , and . Upon log splitting, and both get all empty traces and filtering may add further ones. Only when we accumulated empty traces, i.e., the majority of the behavior in is skipping, Skip Sublog introduces the which addresses (R2).

Iii-D Example

Fig. 4: Cut vs

Consider with filtered DFG in Fig. 2(right) and cuts and in Fig. 4. The mean scores (as in IMc) are , because has strong relations to . However, shows that some other relation holds between and and is falsely biased towards ; this shows in . In contrast, is high for all due to or (no other activity is likely), and . Thus, and , making the more likely cut.

Fig. 5: Cut

After log splitting and applying BaseCase on , PIM finds on and of in Fig. 5. Log splitting filters from BaseCase returns . On .

Subsequently, PIM calculates and chooses cut , see Fig. 6. For the right-hand side, PIM finds with base cases for and , see Fig. 6, respectively. In and of of the left-hand side (Fig. 6) we see the effect of correction term for distinguishing and .

Consider cuts and in Fig. 6. Without correction, and . As , the correction factors yield and , and PIM returns . After log splitting, PIM finds on at the “correct” location (see Fig. 6). Applying BaseCase then results in the process tree of Fig. 1.

[] [] [] []

Fig. 6: Cuts - show effect of correction factor in Def.2-4.

Iii-E Complexity

The running time complexity of PIM depends on the size of the event log , and the size of its alphabet as follows. Per recursion step, BaseCase is executed once, constructing and filtering and takes , FindCut takes (see Sect. III-B3), and Split Log takes , decreasing the size of the alphabet in each sublog by at least one. Thus, there are at most invocations of Pim resulting in running-time complexity of . As discussed in Sect. III-B3, heuristics over and reduce the search space for FindCut significantly as measured in our experiments.

Iv Evaluation

We implemented PIM in the UiPath Process Mining platform and compared it to the state-of-the-art on a standard benchmark [2] and on additional data more similar to regular use cases encountered in industrial practice (IV-A). Additionally, we performed an empirical evaluation about end user preferences for models produced by various algorithms (IV-B).

Iv-a Quantitative Evaluation

Setup. The existing benchmark [2] uses a set of non-synthetic event logs to comparatively evaluate 7 automated process discovery methods of which Evolutionary Tree Miner (ETM)[13], IMf [6] and Split Miner (SM) [16], outperformed the other methods. We therefore compare PIM to ETM, IMf, and SM on the public benchmark event logs of [2], see Tab. I

. Model accuracy is measured by fitness, precision, and their F-score; simplicity is measured by size, control-flow complexity (CFC), structuredness, and soundness, see 


However, the benchmark is not representative of industrial workloads: (1) Most event logs were prefiltered to reduce complexity [2] while techniques in industrial practice face unfiltered logs, (2) industrial logs are larger, and (3) the benchmark lacks certain process types found in practice. We thus included 4 additional unfiltered datasets: 2 public datasets BPIC14 and BPIC17LC (distinguish activity life-cycles), and 2 proprietary datasets (Invoice and P2P); see Tab I.

Log Total Distinct Total Distinct
Name Traces Traces (%) Events Events
BPIC12 13,087 33.4 262,200 36
BPIC13cp 1,487 12.3 6,660 7
BPIC13inc 7,554 20.0 65,533 13
BPIC14f 41,353 36.1 369,485 9
BPIC151f 902 32.7 21,656 70
BPIC152f 681 61.7 24,678 82
BPIC153f 1,369 60.3 43,786 62
BPIC154f 860 52.4 29,403 65
BPIC155f 975 45.7 30,030 74
BPIC17f 21,861 40.1 714,198 41
RTFMP 150,370 .2 561,470 11
SEPSIS 1,050 80.6 15,214 16
BPIC14 46,616 48.5 466,737 39
BPIC17LC 31,509 53.0 1,202,267 66
Invoice 24,450 1.6 133,452 15
P2P 616,717 15.5 5,583,650 46
TABLE I: Statistics of the public event logs, extracted from [2], followed by the additional event logs.

We ran PIM twice: (1) bounding the naïve FindCut with complexity (Sect. III-B3) to only find cuts over the 30 most frequent activities to adhere to the 4 hour computation limit set in [2] (PIM30); (2) using fast heuristics in FindCut for which no activity bound was required; in both cases to filter only the most infrequent behavior. On the 4 additional logs, we ran IMf and SM with default parameters, but omit EMT due to its excessive running times.

Results. PIM and PIM30 discovered sound, block-structured models on all event logs while SM returned unsound models on BPIC14 and BPIC17LC.

Figure 7 shows fitness, precision, f-score, size, CFC, and running times for ETM, IMf, SM as reported in [2] and for PIM30 and PIM for all datasets; the Appendix shows individual measures. The scatter plots in Fig. 8 and Fig. 9 visualize fitness, precision, and F-score against size and CFC (normalized against the largest size/CFC measured per log).

On the benchmark event logs, PIM sacrifices fitness for increased precision (11/12 logs better than IM, 9/12 same or better than SM, 5/12 highest precision) and F-score (10/12 better than IMf, 4/12 highest, and 5/12 scoring 2nd best close to SM) while improving size (8/12 better than IMf, 4/12 better than SM) and CFC (10/12 better than IMf, 6/12 same or better than SM). PIM30 sacrifices fitness even more and further reduces size and CFC due to activity filtering. PIM consistently shows higher fitness (12/12) and higher F-score (11/12) than PIM30 but also shows higher precison on 7/12 logs. PIM’s and PIM30’s results are more similar to those of ETM wrt. accuracy than to IMf and SM but finds larger models as well as smaller models than ETM does. The scatter plots in Fig. 8 show that PIM strikes a novel balance in the pareto-front having more results with high precision and low size/CFC (top-left corner) compared to IMf and SM while the overall F-score for is comparable to all other techniques. PIM30 either finds a solution within a second (5/12 logs) or takes several hours to complete whereas PIM completes in less than a second for all data sets (fastest for 11/12) due to the heuristics in FindCut.

Fig. 7: Fitness, precision, f-score, size, CFC, and running time for ETM, IMf, SM [2], PIM30, and PIM
Fig. 8: Fitness, precision, f-score vs normalized size and CFC of ETM, IMf, SM, PIM30, and PIM on benchmark.

On the 4 additional datasets, SM achieves highest fitness and PIM sacrifices fitness most. PIM30 (and PIM) clearly outperforms IMf and SM in precision, size, and CFC on all (3/4) datasets, yielding F-scores comparable to IMf and SM (see Fig. 7); PIM models are slightly larger and less accurate than PIM30 models. The scatter plots in Fig. 9 reinforce the prior finding even stronger: PIM30 and PIM reach a novel area in the pareto-front of quality criteria by striking a much better balance in precision and simplicity not reached by other techniques while retaining comparable F-scores.

Fig. 9: F-score vs normalized control-flow complexity of IMf, SM, PIM30 on additional benchmark.

Parameter sensitivity. To verify that this is not due to parameter choices, we evaluated how accuracy and size change when changing IMf’s, SM’s, and PIM’s parameter for filtering infrequent behavior on the 4 additional datasets. Increasing filtering for SM traded fitness for precision which results in near-constant accuracy and in up to 50% fewer edges for models returned by SM and corresponding reduction of CFC. However SM size and CFC remained significantly higher than that of PIM with . Increasing filtering for IMf lead to unpredictable changes in accuracy and size; size and complexity always remained above PIM with . Thus SM and IMf cannot reach the particular spot of simplicity vs accuracy reached by PIM. Lowering PIM’s filtering parameter allowed to reduce model size and CFC down to the 2 most frequent activities; model size and complexity thereby falls off in a negative exponential curve as many infrequent edges in and that contribute to complexity appear to be “equally infrequent” and get filtered out together when lowering . This confirms that PIM’s models remain simpler also under filtering parameters. Moreover, both the effective filtering parameter and the option to consider only the most frequent activities in cut detection allow to control model detail and complexity, satisfying (R1).

Iv-B Empirical Evaluation

Setup. We performed an empirical evaluation to test whether the particular spot of high accuracy and simplicity taken by PIM indeed achieves R2: that the models produced by PIM only show structures for which there is significant evidence in the data. We provided participants (11 professionals, 4 researches, 2 students) with the following. (1) A visualization of the most frequent trace variants with activities color-coded as generated by Prom’s “Explore Event Log” visualizer that could fit legibly on one A4 sheet for RTFMP (99% of the log), Invoice (94% of the log), and BPIC17cp (BPIC17LC filtered to only “complete” events, 55% of the log). (2) Automatically laid-out diagrams of BPMN translations of the respective models produced by IMf, SM, and PIM (setup as in Sect. IV-A), with algorithm anonymized and in random order. (3) Instructions to highlight with pens of two distinct colors all model structures they consider “good” (understand and have sufficient evidence in the data) or “bad” (do not understand or lack evidence). (4) A questionnaire to rank model preference and indicate their reasons. No time limit was given.

Results. Fig. 10 shows how often the model of the respective algorithm was chosen as best model per dataset. Where SM was the most-preferred model for RTFMP, PIM was preferred most for Invoice and BPIC17cp. We aggregated the individual participants’ highlighting of “good” and “bad” model structures into heat maps. Fig. 11 shows the heat maps for the PIM and SM models of BPIC17LC; App. A shows all heat maps.

From the heat maps and questionnaires, we observed that the participants generally trusted and understood model structures produced by PIM more than model structures produced by SM and IMf (c.f. Fig 10b vs 10d). Models by PIM were sporadically described as representing too little of the data, but never too much of the data. Of the 51 written motivations justifying the participants’ model ranking, 30 (58.8%) justified their choice for PIM due to simplicity, clarity, or readability of the models. These results together confirm the particular usefulness of PIM’s unique balance in fitness vs precision vs simplicity seen in Sect. IV-A, satisfying R2.

[RTFMP ] [Invoice ] [BPIC17cp]

Fig. 10: Favorite model choices made by the participants of the questionnaire

[Good ] [Bad ] [Good ] [Bad ]

Fig. 11: Markings made on the BPIC17cp models discovered by PIM (a, b) and SM (c, d).

V Conclusion

We combined principles of the IMc [12] algorithm of the IM framework [14] with ideas from heuristic mining [18] to design process discovery algorithm PIM that returns block-structured models of high simplicity and accuracy even on large, unfiltered event logs. PIM strikes a novel balance between higher precision, significantly lower model complexity, and comparable overall accuracy (F-score) compared to existing techniques; thereby reaching a new area in the pareto-front of model quality metrics. Our empirical evaluation confirms that this pareto-front area represents models that are considered user-friendly and accurate for the data which was a central requirement raised by industrial analysts in a Delphi study prior to algorithm development. The models found by PIM are preferred over models produced by other state-of-the-art techniques; deviation analysis is then possible through model enhancement [9] such as visual alignments [11] (c.f. App B).

Threats to validity. The standard benchmark [2] only contains filtered variants of real-life event logs changing the nature of the problem, which we offset by additional but fewer unfiltered event logs. All participants of the empirical evaluation were based in the Eindhoven area in Netherlands possibly introducing bias in preferences.

Future work. Algorithmically, PIM seems to inherently favor precision over fitness while IMf’s and SM’s notable quality is ensuring fitness and high fitness/precision balance, respectively. It is worth exploring whether adding further parameters to PIM allows to produce fully fitting and precise models, possibly at the expense of simplicity.

However, in the qualitative evaluation, participants expressed a clear preference for simplicity and accuracy over fitness. This warrants reconsideration of the kinds of models automated process discovery techniques should produce. Process mining may need a focus shift: discovery of fully fitting models is no longer desirable if users do not find them usable.


  • [1] J. De Weerdt, M. De Backer, J. Vanthienen, and B. Baesens, “A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs,” Information Systems, 2012.
  • [2] A. Augusto, R. Conforti, M. Dumas, M. La Rosa, F. M. Maggi, A. Marrella, M. Mecella, and A. Soo, “Automated Discovery of Process Models from Event Logs: Review and Benchmark,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 4, pp. 686–705, 2019.
  • [3] W. M. P. van der Aalst,

    Process Mining, Data science in action

    .   Springer, 2016.
  • [4] J. C. A. M. Buijs, B. F. van Dongen, and W. M. P. van der Aalst, “Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplicity,” Int. J. Cooperative Inf. Syst., vol. 23, no. 1, 2014. [Online]. Available:
  • [5] A. Augusto, R. Conforti, M. Dumas, M. La Rosa, and G. Bruno, “Automated discovery of structured process models from event logs: The discover-and-structure approach,”

    Data and Knowledge Engineering

    , vol. 117, no. April, pp. 373–392, 2018. [Online]. Available:
  • [6] S. J. J. Leemans, D. Fahland, and W. M. P. van der Aalst, “Discovering block-structured process models from event logs containing infrequent behaviour,” in BPM 2013 Workshops, ser. LNBIP, vol. 171.   Springer, pp. 66–78.
  • [7] J. Mendling, H. A. Reijers, and W. M. P. van der Aalst, “Seven process modeling guidelines (7PMG),” Inf. Softw. Technol., vol. 52, no. 2, pp. 127–136, 2010. [Online]. Available:
  • [8] W. M. [van der Aalst], “A practitioner’s guide to process mining: Limitations of the directly-follows graph,” Procedia Computer Science, vol. 164, pp. 321 – 328, 2019, cENTERIS/ProjMAN/HCist 2019. [Online]. Available:
  • [9] S. J. J. Leemans, D. Fahland, and W. M. P. van der Aalst, “Exploring processes and deviations,” in BPM 2014 Workshops, ser. LNBIP, vol. 202.   Springer, 2014, pp. 304–316.
  • [10] J. Carmona, B. F. van Dongen, A. Solti, and M. Weidlich, Conformance Checking - Relating Processes and Models.   Springer, 2018. [Online]. Available:
  • [11] B. d. Bie, “Visual Conformance Checking using BPMN,” Master thesis, Eindhoven University of Technology, 2019.
  • [12] S. J. J. Leemans, D. Fahland, and W. M. P. van der Aalst, “Discovering block-structured process models from incomplete event logs,” in Petri Nets 2014, ser. LNCS, vol. 8489.   Springer, 2014, pp. 91–110.
  • [13] J. C. Buijs, B. F. Van Dongen, and W. M. Van Der Aalst, “Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplicity,” International Journal of Cooperative Information Systems, vol. 23, no. 1, 2014.
  • [14] S. J. J. Leemans, D. Fahland, and W. M. P. van der Aalst, “Discovering block-structured process models from event logs - A constructive approach,” in Petri Nets 2013, ser. LNCS, vol. 7927.   Springer, 2013, pp. 311–329.
  • [15] S. K. vanden Broucke and J. De Weerdt, “Fodina: A robust and flexible heuristic process discovery technique,” Decision Support Systems, vol. 100, pp. 109–118, 2017.
  • [16] A. Augusto, R. Conforti, M. Dumas, and M. L. Rosa, “Split miner: Discovering accurate and simple business process models from event logs,” in Proceedings - IEEE International Conference on Data Mining, ICDM, vol. 2017-Novem, 2017, pp. 1–10.
  • [17] A. Augusto, M. Dumas, and M. L. Rosa, “Automated discovery of process models with true concurrency and inclusive choices,” in ICPM 2020 Workshops, ser. LNBIP, vol. 406.   Springer, 2020, pp. 43–56. [Online]. Available:
  • [18] A. Weijters, W. M. P. van der Aalst, and A. K. A. de Medeiros;, “Process Mining with the HeuristicsMiner Algorithm,” 2006.
  • [19] S. J. Leemans and D. Fahland, “Information-preserving abstractions of event data in process mining,” Knowledge and Information Systems, 2019.

Appendix A Detailed Evaluation Results

75Log 75Method 75Fitness 75Precision 75F-score 75Gen. 3-fold 75Size 75CFC 75Structured 75Sound 75Time(s)
BPIC12 IMf .98 .50 .66 .98 59 37 yes yes 6.6
ETM .44 .82 .57 t/o 67 16 yes yes 14,400
SM .75 .76 .76 .75 53 32 no yes .58
PIM .63 .76* .69* .64 60 28* yes yes 9330
PIM .78 .93* ..85* 67 32* yes yes .63
BPIC13cp IMf .82 1.00 .90 .82 9 4 yes yes .1
ETM 1.00 .70 .82 t/o 38 38 yes yes 14,400
SM .94 .97 .96 .94 12 7 yes yes .03
PIM .72 .89 .80 .71 19 8 yes yes .01
PIM .98 .74 .84 15 8 yes yes .01
BPIC13inc IMf .92 .54 .68 .92 13 7 yes yes 1.0
ETM 1.00 .51 .68 t/o 32 144 yes yes 14,400
SM .91 .98 .94 .91 13 9 yes yes .23
PIM .55 .50 .52 .54 23 15 yes yes .1
PIM .62 .93* .74* 18 10 yes yes .07
BPIC14f IMf .89 .64 .74 .89 31 18 yes yes 3.4
ETM .61 1.00 .76 t/o 23 9 yes yes 14,400
SM .76 .67 .71 .76 27 16 no yes .59
PIM .69 .94* .80* .69 19* 8* yes yes .43
PIM .74 .69* .71 20* 9* yes yes .07
BPIC151f IMf .97 .57 .71 .96 164 108 yes yes .6
ETM .56 .94 .70 t/o 67 19 yes yes 14,400
SM .90 .88 .89 .90 110 43 no yes .48
PIM .55 .91* .69 .55 46* 11* yes yes 8246
PIM .79 .95* .86* 93* 29* yes yes .15
BPIC152f IMf .93 .56 .70 .94 193 123 yes yes .7
ETM .62 .91 .74 t/o 95 32 yes yes 14,400
SM .77 .90 .83 .77 122 41 no yes .25
PIM .53 .91* .67 .53 44* 10* yes yes 7425
PIM .66 .91* .77* 124* 56* yes yes .23
BPIC153f IMf .95 .55 .70 .95 159 108 yes yes 1.3
ETM .68 .88 .76 t/o 84 29 yes yes 14,400
SM .78 .94 .85 .78 90 29 no yes .36
PIM .62 .98* .76* .62 40* 8* yes yes 7072
PIM .68 .97* .80* 95* 33* yes yes .19
BPIC154f IMf .96 .58 .73 .96 162 111 yes yes .7
ETM .65 .93 .77 t/o 83 28 yes yes 14,400
SM .73 .91 .81 .73 96 31 no yes .25
PIM .59 .98* .74* .60 40* 7* yes yes 7331
PIM .73 .92* .81* 109* 45* yes yes .16
BPIC155f IMf .94 .18 .30 .94 134 95 yes yes 1.5
ETM .57 .94 .71 t/o 88 18 yes yes 14,400
SM .79 .94 .86 .79 102 30 no yes .27
PIM .52 .99* .68* .51 41* 7* yes yes 7061
PIM .65 .93* .77* 113* 41* yes yes .19
BPIC17f IMf .98 .70 .82 .98 35 20 yes yes 13.3
ETM .76 1.00 .86 t/o 42 4 yes yes 14,400
SM .95 .85 .90 .95 32 17 no yes 2.53
PIM .80 .89* .84* .80 48 15* yes yes 8392
PIM .86 .99 .92* 51 15* yes yes 1.01
RTFMP IMf .99 .70 .82 .99 34 20 yes yes 10.9
ETM .99 .92 .95 t/o 57 32 yes yes 14,400
SM .99 1.00 1.00 1.00 22 16 no yes 1.25
PIM .76 .87* .81 .76 25* 12* yes yes .011
PIM .76 1.00 .86 16* 6* yes yes .01
SEPSIS IMf .99 .45 .62 .96 50 32 yes yes .4
ETM .83 .66 .74 t/o 108 101 yes yes 14,400
SM .73 .86 .79 .73 31 20 no yes .05
PIM .77 .63* .69* .77 36* 19* yes yes .463
PIM .84 0.89* 0.87* 24* 11* yes yes .03
TABLE II: Evaluation results on IMf, ETM, SM, PIM, and PIM; bold = best value per log; * = PIM improvement over IMf.

Detailed results on the benchmark event logs are given in Table II. Detailed results on the additional event logs are given in Table III. Figure 12 shows the heat maps of user annotations regarding what users considered “good” (understood and trusted) and “bad” (too complex, not trusted) model structures for the RFTMP, Invoice, and BPIC17cp log.

75Log 75Method 75Fitness 75Precision 75F-score 75Size 75Edges 75CFC 75Structured 75Sound 75Time(s)
BPIC14 IM .90 .61 .73 76 120 56 yes yes 8.27
SM .97 .48 .65 100 266 194 no no .78
PIM .52 .64 .57 32 43 17 yes yes 74.64
PIM .57 .45 .5 66 92 11 yes yes .94
BPIC17LC IM .85 .15 .25 168 247 103 yes yes 22.64
SM .96 .73 .83 139 201 100 no no 1.18
PIM .68 .95 .79 52 61 17 yes yes 9,757
PIM .86 .91 .88 81 102 36 yes yes 3.2
BPIC17cp IM t/o t/o t/o 60 87 37 yes yes 4.82
SM .92 .986 .95 51 69 21 no yes 0.41
PIM .86 .999 .92 37 43 14 yes yes 21.1
PIM .90 .98 .94 37 45 14 yes yes 0.2
Invoice IM .88 .76 .82 34 44 18 yes yes .98
SM 1.00 .92 .96 41 64 36 no yes .25
PIM .88 .96 .92 17 19 6 yes yes .01
PIM .61 .94 .73 19 23 8 yes yes .01
P2P IM .80 .77 .78 101 134 70 yes yes 164.07
SM 1.00 .61 .76 114 287 207 no no 8.89
PIM .69 .94 .80 35 45 19 yes yes 3.09
PIM .74 .83 .78 37 50 20 yes yes 3.1
TABLE III: Evaluation results of IMf, SM, PIM and PIM on 4 additional event logs; bold = best value per log.

[PIM30] [IMf] [SM]

[PIM30] [IMf] [SM]

[PIM30] [IMf] [SM]

Fig. 12: Heatmaps of “good” and “bad” model structures for the RTFMP event log (a)-(c), Invoice event log (d)-(f), BPIC17cp event log (g)-(i).

Appendix B Example of Visual Alignments

Fig. 13 shows an example of a complete visual alignment [11] of BPIC17cp on the PIM model in the UiPath Process Mining platform; blue nodes and edges describe the discovered model; yellow edges are the visual alignment showing the behavior in the event log deviating from the discovered model.

Fig. 13: Visual Alignments of BPIC17cp on the PIM model in UiPath Process Mining