1 Introduction

Process Mining van der Aalst (2016) is a scientific discipline that bridges the gap between process analytics and data analysis and focuses on the analysis of event data logged during the execution of a business process. Events contain information on what was done, by whom, for whom, where, when, etc. Such event data is often readily available from information systems such as ERP, CRM, or BPM systems. Process discovery, which plays a prominent role in process mining, is the task of automatically generating a process model that accurately describes a business process based on such event data. Many process discovery techniques have been developed over the last decade (e.g. Buijs et al (2012); Goedertier et al (2009); Günther and van der Aalst (2007); Herbst (2000); Leemans et al (2013b); Solé and Carmona (2013); van Zelst et al (2015)), producing process models in various forms, such as Petri nets Murata (1989), process trees Buijs et al (2012), and BPMN models Object Management Group (2011).

Figure (b)b shows an example process model from van der Aalst (2016) that describes a compensation request process. The process model consists of eight process steps (called activities): (A) register request, (B) examine thoroughly, (C) examine casually, (D) check ticket, (E) decide, (F) reinitiate request, (G) pay compensation, and (H) reject request. Figure (a)a shows a small example event log consisting of six execution trails of the process model. The Inductive Miner Leemans et al (2013b) process discovery algorithm provides the guarantee that it can rediscover the process model from an event log given that all pairs of activities that can directly follow each other in the process are present in the event log, i.e., the log is directlyfollows complete. Since the log in Figure (a)a is directlyfollows complete, applying the Inductive Miner to this log results in the process model in Figure (b)b, which generated the log.
However, the presence of activities that can occur spontaneously at any point in the process execution, which we will call chaotic activities, substantially impacts the quality of the resulting process models obtained with process discovery techniques. Figure (a)a contains the event log obtained from the one in Figure (a)a by adding activity (X) the customer calls at random points, since customers can call the call center multiple times at any point in time during the execution of the process. Figure (b)b shows the resulting process model discovered by the Inductive Miner Leemans et al (2013b) from the event log of Figure (a)a. The process model discovered from the “clean” example log without activity X (Figure (b)b) was very simple, interpretable, and accurate with respect to the behavior allowed in the process. In contrast, the process model discovered from the log containing X (Figure (b)b) is very complex, hard to interpret, and it overgeneralizes by allowing for too much behavior that is not possible in the process. We consider X to be a socalled chaotic activity because it does not have a clear position in the process model and it complicates the discovery of the rest of the process. The reason for the decline in the quality of process models discovered from logs with chaotic activities is that the directly follows relations, which many process discovery algorithms operate on, are affected by chaotic activities. Examples of such process discovery algorithms include the Inductive Miner Leemans et al (2013a)
, the Heuristics Miner
Weijters and Ribeiro (2011), and Fodina vanden Broucke and De Weerdt (2017). In a sequence of activities , where was directly followed by , the addition of a chaotic activity can turn the sequence into , thereby obfuscating the directlyfollows relation between activities A and C.In this paper, we show that existing approaches do not solve the problem of chaotic activities and we present a technique to handle the issue. This paper is structured as follows: in Section 2 we introduce basic concepts used throughout the paper. In Section 3 we propose an approach to filter out chaotic activities. In Section 4 we evaluate our technique using synthetic data where we artificially insert chaotic activities and check whether the filtering techniques can filter out the inserted chaotic activities. Additionally, Section 4 proposes a methodology to evaluate activity filtering techniques in a reallife setting where there is no ground truth knowledge on which activities are truly chaotic, and motivates this methodology by showing that its results are consistent with the synthetic evaluation on the synthetic datasets. In Section 5 the results on a collection of seventeen reallife event logs are discussed. In Section 6 we discuss how the activity filtering techniques can be used in a togglebased approach for humanintheloop process discovery. In Section 7 we discuss related techniques in the domains of process discovery and the filtering of event logs. Section 8 concludes this paper and discusses several directions for future work.
2 Preliminaries
In this section, we introduce concepts and notation throughout this paper.
denotes a finite set. denotes the power set of , i.e., the set of all possible subsets of . denotes the set of elements that are in set but not in set , e.g., . denotes the set of all sequences over a set and denotes a sequence of length , with and the empty sequence. is the projection of on , e.g. . denotes the concatenation of sequences and , e.g., .
A partial function with domain can be lifted to sequences over using the following recursive definition: (1) ; (2) for any and :
A multiset (or bag) over is a function which we write as , where for we have and . The set of all multisets over is denoted .
In the context of process mining, we assume the set of all process activities to be given. Event logs consist of sequences of events where each event represents a process activity.
Definition 1 (Event, Trace, and Event Log)
An event in an event log is the occurrence of an activity . We call a (nonempty) sequence of events a trace. An event log is a multiset of traces.
is an example event log over process activities , consisting of 2 occurrences of trace and three occurrences of trace . denotes the set of process activities that occur in , e.g., . denotes the number of occurrences of activity in log , e.g., .
A process model notation that is frequently used in the area of process mining is the Petri net. Petri nets can be automatically transformed into process model notations that are commonly used in business environments, such as BPMN and BPEL Lohmann et al (2009). A Petri net is a directed bipartite graph consisting of places (depicted as circles) and transitions (depicted as rectangles), connected by arcs. A transition describes an activity, while places represent the enabling conditions of transitions. Labels of transitions indicate the type of activity that they represent. Unlabeled transitions (transitions) represent invisible transitions (depicted as gray rectangles), which are only used for routing purposes and are not recorded in the event log.
Definition 2 (Labeled Petri net)
A labeled Petri net is a tuple where is a finite set of places, is a finite set of transitions such that , is a set of directed arcs, called the flow relation, and is a partial labeling function that assigns a label to a transition, or leaves it unlabeled (the transitions).
We write and for the input and output nodes of (according to ). A state of a Petri net is defined by its marking being a multiset of places. A marking is graphically denoted by putting tokens on each place . State changes occur through transition firings. A transition is enabled (can fire) in a given marking if each input place contains at least one token. Once fires, one token is removed from each input place and one token is added to each output place , leading to a new marking .
A firing of a transition leading from marking to marking is denoted as step . Steps are lifted to sequences of firing enabled transitions, written and is a firing sequence.
Defining an initial and a set of final markings allows defining the language accepted by a Petri net as a set of finite sequences of activities.
Definition 3 (Accepting Petri Net)
An accepting Petri net is a triplet , where is a labeled Petri net, is its initial marking, and is its set of possible final markings. A sequence is a trace of an accepting Petri net if there exists a firing sequence such that , and .
In the Petri nets that are shown in this paper, places that belong to the initial marking contain a token and places belonging to a final marking contain a bottom right label with a final marking identifier or are simply marked as in case of a single final marking.
The language is the set of all its traces, i.e., , which can be of infinite size when contains loops. While we define the language for accepting Petri nets, in theory, can be defined for any process model with formal semantics. We denote the universe of process models as . For each , is defined.
A process discovery method is a function that provides a process model for a given event log. The goal is to discover a process model that is a good description of the process from which the event log was obtained, i.e., it should allow for all the behavior that was observed in the event log (called fitness) while it should not allow for too much behavior that was not seen in the event log (called precision). For an event log , is the trace set of . For example, for log , . For an event log and a process model , we say that is fitting on if . Precision is related to the behavior that is allowed by a model that was not observed in the event log , i.e., .
3 InformationTheoretic Approaches to Activity Filtering
We consider a chaotic activity to be an activity that can occur at any point in the process and that thereby complicates the discovery of the rest of the process by obfuscating the directlyfollows relations of the event log. In this section, we propose a technique to detect chaotic activities in event logs and to filter them out from those event logs.
We extend the function to the function to count the number of occurrence of a sequence , in :
.
The directlyfollows ratio, denoted , represents the ratio of the events of activity that are directly followed by an event of activity in event log , i.e., .
Likewise, the directlyprecedes ratio, denoted , represents the ratio of the events of activity that are directly preceded by an event of activity in event log , i.e., .
contains the traces of event log appended with an artificial end event that we represent with . For each in log L, log contains a trace . Likewise, contains the traces of event log prepended with an artificial start event , i.e., for each in log L, log contains a trace . The artificial start and end events allow us to define the ratio of start events of an activity, e.g., and represent the ratio of events of activity that respectively occur at the end of a trace and at the beginning of a trace.
Assuming an arbitrary but consistent order over the set of process activities ,
represents the vector of values
for all and represents the vector of values for all . From a probabilistic point of view, we can regard andas the empirical estimates of the categorical distributions over respectively the activities directly prior to
and directly after , where the empirical estimates are based on trials.3.1 Direct Entropybased Activity Filtering
We define the entropy of an activity in an event log
based on its directlyfollows ratio vector and the directlyprecedes ratio vector by using the usual definition of function for the categorical probability distribution:
. We define the entropy of activity in log as: . In case there are zero probability values in the directly follows or directly precedes vectors, i.e., , then the value of the corresponding summand is taken as , which is consistent with the limit .For example, let event log , then , using the arbitrary but consistent ordering , indicating that 20 out of 30 events of activity are followed by and 10 out of 30 by . Likewise , using the arbitrary but consistent ordering , indicating that all events of activity are preceded by . This leads to , , and . Furthermore, , , and , showing that activity has the highest entropy of the probability distributions for preceding and succeeding activities. We conjecture that activities that are chaotic and behave randomly to a high degree have high values of .
Algorithm 1 describes a greedy approach to iteratively filter the most randomly behaving (chaotic) activity from the event log. The algorithm takes an event log as input and produces a list of event logs, such that the first element of the list contains a version of with one activity filtered out, and each following element of the list has one additional activity filtered out compared to the previous element.
In the example event log , Algorithm 1 starts by filtering out activity , followed by activity or . The algorithm stops when there are two activities left in the event log. The reason not to filter any more activities past this point is closely related to the aim of process discovery: uncovering relations between activities. From an event log with less than two activities no relations between activities can be discovered.
3.2 The Entropy of Infrequent Activities and Laplace Smoothing
We defined entropy of the activities in an event log is based on the directlyfollows ratios and the directlyprecedes ratios of the activities in . The empirical estimates of the categorical distributions and become unreliable for small values of . In the extreme case, when , assigns an estimate of to the activity that the single activity in happens to be preceded by and contains a probability of for the other activities. Likewise, when , assigns value to one activity and value to all others. Therefore, leads to and . This shows an undesirable consequence of Algorithm 1, infrequent activities are unlikely to be filtered out. In the extreme case, the activities that occur only once, which are the last in line activities to be filtered out. This effect is undesired, as very infrequent activities should not be the primary focus of the process model discovered from an event log.
We aim to mitigate this effect by applying Laplace smoothing Zhai and Lafferty (2004) to the empirical estimate of the categorical distributions over the preceding and succeeding activities. Therefore, we define a smoothed version of the directlyfollows and directlyprecedes ratios, , with smoothing parameter . The value of will always be between the empirical estimate and the uniform probability , depending on the value . Similar to and , represents the vector of values for all and represents the vector of values for all . From a Bayesian point of view, Laplace smoothing corresponds to the expected value of the posterior distribution that consists of the categorical distribution given by and a Dirichlet distributed prior that assigns equal probability to each of the possible number of next activities (including ). Parameter indicates the weight that is assigned to the prior belief w.r.t. the evidence that is found in the data. An alternative definition of the entropy of log , based on the smoothed distributions over the preceding and succeeding activities, is as follows: . The smoothed direct entropybased activity filter is similar to Algorithm 1, where function in line 5 of the algorithm is replaced by . Function starts from the assumption that an activity is nonchaotic unless we see sufficient evidence in the data for it’s chaoticness, function in contrast starts from the assumption that is is chaotic, unless we see evidence sufficient evidence in the data for it’s nonchaoticness.
Categorical distribution consists of , therefore, the maximum entropy of an activity decreases as more activities get filtered out of the event log. The keep the values of comparable between iterations of the filtering algorithm, we propose to gradually increase the weight of the prior by setting weight parameter to .
3.3 Indirect Entropybased Activity Filtering
An alternative approach to the method proposed in Algorithm 1 is to filter out activities such that the other activities in the log become less chaotic. We define the total entropy of an event log as the sum of the entropies of the activities in the log, i.e., .
Algorithm 2 describes a greedy approach that iteratively filters out the activity that results in the lowest total log entropy. We call this approach the indirect entropybased activity filter, as opposed to the direct entropybased activity filter (Algorithm 1), which selects the tobefiltered activity directly based on the activity entropy, instead of based on the total log entropy after removal.
3.4 An Indirect Entropybased Activity Filter with Laplace Smoothing
Just like the direct entropybased activity filter, the indirect entropybased activity filter is sensitive to infrequent activities. To deal with this problem, the ideas of the indirect entropybased activity filtering method and Laplace smoothing can be combined, using the following definition for smoothed log entropy:
.
The algorithm for indirect entropybased activity filtering with Laplace smoothing is identical to Algorithm 2, in which function in line 5 is replaced by function .
4 Evaluation using Synthetic Data
In this section we evaluate the activity filtering techniques using synthetic data. Figure 3 gives an overview of the evaluation methodology. First, as step (1), we generate a synthetic event log from a process model such that we know that all activities of this model are nonchaotic. We take wellknown process models introduced by Maruster et al. Maruster et al (2006), which respectively consist of 12 and 22 activities and are commonly referred to as the Maruster A12, A22 models. The Maruster A12 and A22 models are shown respectively in Figures (a)a and (a)a. We generated 25 traces by simulation from Maruster A12 to form log and generated 400 traces from Maruster A22 to form log . Then, in step (2), we artificially insert activities that we position at random positions in the log. Since we chose the positions in the log of those activities randomly, we assume those activities to be chaotic. We vary the number () of randomlypositioned activities that we insert, to assess how well the chaotic activity filtering techniques are able to deal with different numbers of randomlypositioned activities in the event log. Furthermore, we vary the frequency of the randomlypositioned activities that we insert, where we distinguish between three types of randomlypositioned activities:
 Frequent randomlypositioned activities

the number of events inserted for all randomlypositioned activities is .
 Infrequent randomlypositioned activities

the number of events inserted for all randomlypositioned activities is .
 Uniform randomlypositioned activities

for each of the inserted randomlypositioned activities the frequency is chosen at randomly from a uniform probability distribution with minimum value and maximum value .
In step (3) we filter out all the inserted randomlypositioned activities from the event log, by removing activities onebyone using the activity filtering approaches, until all artificially inserted activities have been removed again. We then count how many of the activities that were originally in the process model we also removed during this procedure (step (4)). Using this approach, we compare the direct entropybased activity filtering approach (with and without Laplace smoothing) with the indirect entropybased activity filtering approach (with and without Laplace smoothing). Furthermore, we compare those activity filtering techniques with activity filtering techniques that are based on the frequency of activities, such as filtering out the activities starting from the least frequent activity (leastfrequentfirst), or starting from the most frequent activity (mostfrequentfirst). Frequencybased activity filtering techniques are the current default approach for filtering activities from event logs.
The original process models A12 and A22 can be rediscovered from generated event logs and with the Inductive Miner Leemans et al (2013a) when there are no added randomlypositioned activities. Figure (b)b shows the process model discovered by the Inductive Miner Leemans et al (2013a) after inserting one uniform randomlypositioned activity, activity , into . The insertion of activity causes the Inductive Miner to create a model that overgeneralizes the behavior of the event log, as indicated by many silent transitions in the process model that allow activities to be skipped. Adding a second uniform randomlypositioned activity to results in the Inductive Miner discovering a process model (shown in Figure (c)c) that overgeneralizes even further, allowing for almost all sequences over the set of activities. Figure (b)b shows the process model discovered by the Inductive Miner after inserting two uniform randomlypaced activities ( and ) into . The addition of and has the effect that activity is no longer positioned at the correct place in the process model, but it is instead put in parallel to the whole process, making the process model overly general, as it wrongly allows for activity to occur before and , or after , , , and . Figures (b)b, (c)c and (b)b further motivate the need for filtering out chaotic activities.
Frequent randomlypositioned activities will impact the quality of process models discovered with process discovery to a higher degree than infrequent randomlypositioned activities. Each randomlypositioned activity that is inserted at a random position in the event log is placed inbetween two existing events in that log (or at the start or end of the trace). By inserting randomlypositioned activity X inbetween two events of activities A and C respectively, the directlyfollows relation between activities A and C gets weakened. Therefore, the impact of randomlypositioned activity X is proportional to its frequency .
4.1 Results
Maruster A12 (Number of inserted randomlypositioned activities )  
Approach  1  2  4  8  16  32  64  128  
U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  
Direct  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  12  4  0  12  10  1  12 
Direct ()  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4  0  6  6  2  12 
Indirect  0  0  0  0  0  0  0  0  0  0  0  1  1  0  1  1  0  1  2  0  1  3  1  6 
Indirect ()  0  0  0  0  0  0  0  0  0  0  0  1  1  0  1  1  0  1  2  0  1  2  1  10 
Leastfrequentfirst  9  12  0  11  12  0  6  12  0  11  12  0  11  12  0  12  12  0  12  12  0  12  12  0 
Mostfrequentfirst  11  0  12  3  0  12  7  0  12  10  0  12  12  0  12  12  0  12  12  0  12  12  0  12 
Maruster A22 (Number of inserted randomlypositioned activities )  
Approach  1  2  4  8  16  32  64  128  
U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  U  F  I  
Direct  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  5 
Direct ()  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  5 
Indirect  0  0  0  0  0  0  0  0  0  0  0  1  0  0  1  1  0  1  1  0  1  1  0  1 
Indirect ()  0  0  0  0  0  0  0  0  0  0  0  1  0  0  1  1  0  1  0  0  1  1  0  1 
Leastfrequentfirst  16  22  0  17  22  0  6  22  0  21  22  0  19  22  0  22  22  0  22  22  0  22  22  0 
Mostfrequentfirst  7  0  22  8  0  22  19  0  22  17  0  22  19  0  22  22  0  22  22  0  22  22  0  22 
Table 1 reports the number of activities that were originally part of the synthetic process models A12 and A22 that were wrongly filtered out from and as an effect of removing all inserted randomlypositioned activities from these logs. If this number is 12 for Maruster A12 or 22 for Maruster A22 this indicates that all activities of the original process model needed to be filtered out before the activity filtering technique was able to remove all inserted chaotic activities. The results show that the direct filtering approach can perfectly distinguish between actual activities from the process and artificial chaotic activities for up to 32 uniform randomlypositioned activities inserted activities to , up to 64 frequent randomlypositioned activities, and up to 16 infrequent randomlypositioned activities. Infrequent randomlypositioned activities are the hardest type of randomlypositioned activities to correctly filter out, as their infrequency can have the effect that the probability distributions over their surrounding activities can by chance have low entropy. Using Laplace smoothing with mitigates this effect, but does not completely solve it: the number of incorrectly removed activities drops from 12 to 0 as an effect of Laplace smoothing for 32 added randomlypositioned activities, and from 12 to 6 for 64 added randomlypositioned activities. The indirect activity filter starts making errors already at lower numbers of added randomlypositioned activities than the direct activity filter; however, it is more stable to errors for higher numbers of added randomlypositioned activities, i.e., fewer activities get incorrectly removed for 64 and 128 added randomlypositioned activities. In contrast to direct activity filtering, Laplace smoothing does not seem to reduce the number of wrongly removed activities for indirect activity filtering. In fact, surprisingly, the number of incorrectly removed activities even increased from 6 to 10 as an effect of using Laplace smoothing for 128 infrequent randomlypositioned activities added to . The direct and indirect filtering approaches, both with and without Laplace smoothing, outperform the currently widely used approach of filtering out infrequent activities from the event log (leastfrequentfirst filtering). Furthermore, a second frequencybased activity filtering technique is included in the evaluation in which the mostfrequent activities are removed from the event log (mostfrequent first filtering). Both Frequencybased filtering approaches are not able to filter out the randomlypositioned activities inserted to and , even for small numbers of added randomlypositioned activities.
4.2 An Evaluation Methodology for Event Data without Ground Truth Information
In a reallife data evaluation that we perform in the following section, there is no ground truth knowledge on which activities of the process are chaotic. This motivates a more indirect evaluation in which we evaluate the quality of the process model discovered from the event log after filtering out activities with the proposed activity filtering techniques. In this section we propose a methodology for evaluation of activity filtering techniques by assessing the quality of discovered process models, we apply this evaluation methodology to the Maruster A12 and Maruster A22 event logs, and we discuss the agreement between the findings of Table 1 and the quality of the discovered process models.
There are several ways to quantify the quality of a process model for an event log. Ideally, a process model should allow for all behavior that was observed in the event log , i.e., should be as small as possible, preferably empty. The fitness quality dimension covers this. Furthermore, model should not allow for too much additional behavior that was not seen in the event log, i.e., should be as small as possible. This aspect is called precision. For each process model that we discovered, we measure fitness and precision with respect to the filtered log. Fitness is measured using the alignmentbased fitness measure Adriansyah et al (2011) and we measure precision using negative event precision Vanden Broucke et al (2013). Based on the fitness and precision results we also calculate Fscore De Weerdt et al (2011)
, i.e., the harmonic mean between fitness and precision.
Precision is likely to increase by filtering out one or more activities from an event log independently of which activities are removed from the log, as a result of two factors. First, precision measures express in terms of the number of activities that are enabled at certain points in the process, w.r.t. the number of activities seen that were actually observed at these points in the process. With the log and model containing fewer activities after filtering, the number of enabled activities is likely to decrease as well. Secondly, activity filtering leads to log that contains less behavior than original log (i.e., is smaller than ), this makes it easier for process discovery methods to discover a process model with less behavior. These two factors make precision values between event logs with different numbers of activities filtered out incomparable. The degree to which the behavior of filtered log decreases w.r.t. an unfiltered log depends on the activities that are filtered out: when very chaotic activities are filtered from the behavior decreases much more than when very structured activities are filtered from . One effect of this is that too much behavior in a process model affects the precision of that model more for the log from which the nonchaotic activities are filtered out than for the log from which the chaotic activities are filtered out.
To measure the behavior allowed by the process model independent of which activities are filtered from the event log is to determine the average number of enabled activities when replaying the traces of the log on the model. To deal with traces of the event log that do not fit the behavior of the process model, we calculate alignments Adriansyah et al (2011) between log and model. Alignments are a function that map each trace from the event log to a sequence of markings that are reached to replay that trace on the model, with the initial marking and , such that for each two consecutive markings there exists a transition such that . Furthermore, alignments also provide a function that provides the sequence of transitions that matches the changes in the sequence of markings, i.e., , etc. For each trace that fits a process model the alignment . For unfitting traces , the alignment is such that is as close as possible to according to some cost function. We refer to Adriansyah et al. Adriansyah et al (2011) for a more exhaustive introduction of alignments. Let denote the sequence consisting of only the visible transitions in , and let correspondingly denote the sequence of markings prior to each firing of a visible transition. Given a marking we define the nondeterminism of that marking to be the number of reachable visible transitions that can be fired as first next visible transition from , i.e., . We define the nondeterminism of a model given a trace as the average nondeterminism of the markings and define the nondeterminism for a model and a log as the average nondeterminism over the traces of .
Figure 6 shows the Fscores measured for different percentages of activities filtered out from the Maruster log with different numbers of uniform chaotic activities added. Note that the line stops when further removal of activities does not lead to further improvement in Fscore. Note that on the original event log with 0 chaotic activities added the Fscore on the original log is already 1.0, resulting in no lines being drawn. With one chaotic activity added, the leastfrequentfirst filter needs to remove 75% of the activities before it ends up with Fscore 1, which can be explained by the fact that 9 out of 12 nonchaotic needed to be removed in order with the leastfrequentfirst filter to remove all uniform chaotic activities, as shown in Table 1. All entropybased activity filtering techniques remove the chaotic activity in the first filtering step, immediately leading to an Fscore of 1.0. Up until 8 added chaotic activities there is no difference between the entropybased activity filtering techniques in terms of Fscore of the resulting process models, which is consistent with the fact that all these filtering techniques were found to filter without errors for these number of inserted chaotic activities in Table 1. For 16 and 32 activities, the direct filtering methods outperform the indirect filtering methods, consistent with the fact that the indirect approach made one filtering error according to the ground truth for these numbers of added chaotic activities. Note that the leastfrequentfirst filter is outperformed by the entropybased filtering methods in terms of Fscore of the discovered models, as would be expected given the filtering results according to the ground truth.
Figure 7 shows the results in terms of nondeterminism measured for different percentages of activities filtered out from the Maruster log with various numbers of uniform chaotic activities added. The results show very clearly that when filtering out a number of activities that is identical to the number of added chaotic activities (this corresponds to 92% for one added activity, 86% for two added activities, 75% for 4 added activities, 60% for 8 added activities, 43% for 16 added activities, and 27% for 32 added activities), the nondeterminism reaches a value of 1.5, which is the nondeterminism value of the model discovered from the original log without added chaotic activities. The leastfrequentfirst filter, however, leads to process models where many activities are enabled on average, therefore overgeneralizing the process behavior, as an effect of filtering out nonchaotic activities instead of the added chaotic activities.
5 Evaluation using Real Life Data
Name  Category  # traces  # events  # activities 
BPI’12 Van Dongen (2012)  Business  13087  164506  23 
BPI’12 resource 10939 Tax et al (2016b)  Business  49  1682  14 
Environmental permit Buijs (2014)  Business  1434  8577  27 
SEPSIS Mannhardt (2016)  Business  1050  15214  16 
Traffic Fine De Leoni and Mannhardt (2015)  Business  150370  561470  11 
Bruno Bruno et al (2013)  Human behavior  57  553  14 
CHAD 1600010 McCurdy et al (2000)  Human behavior  26  238  10 
MIT A Tapia et al (2004)  Human behavior  16  2772  27 
MIT B Tapia et al (2004)  Human behavior  17  1962  20 
Ordonez A Ordónez et al (2013)  Human behavior  15  409  12 
van Kasteren van Kasteren et al (2008)  Human behavior  23  220  7 
Cook hh102 labour Cook et al (2013)  Human behavior  18  576  18 
Cook hh102 weekend Cook et al (2013)  Human behavior  18  210  18 
Cook hh104 labour Cook et al (2013)  Human behavior  43  2100  19 
Cook hh104 weekend Cook et al (2013)  Human behavior  18  864  19 
Cook hh110 labour Cook et al (2013)  Human behavior  21  695  17 
Cook hh110 weekend Cook et al (2013)  Human behavior  6  184  14 
For the experiments on reallife event logs we do not artificially insert chaotic activities to event logs, but instead filter directly on the activities that are present in these logs. Whether these logs contain chaotic activities that impact process discovery results is not known upfront. Therefore, we apply different activity filtering techniques to these logs and use them to filter out a varying number of activities, after which we assess the quality of the process model that is discovered from these filtered logs. Table 2 gives an overview of the reallife event logs that we use in the experiment. In total, we include five event logs from the business domain. Furthermore, we include twelve event logs that contain events of human behavior, recorded in smart home environments or through wearable devices. Mining process model descriptions of daily life is a novel application of process mining that has recently gained popularity Dimaggio et al (2016); Leotta et al (2015); Sztyler et al (2015); Tax et al (2017, 2016a). Furthermore, human behavior event data are often challenging for process discovery because of the presence of highly chaotic activities, like going to the toilet. We perform the experiments with activity filtering techniques on reallife data with RapidProM van der Aalst et al (2017), which is an extension that adds process mining capabilities to the RapidMiner platform for repeatable scientific workflows.
For each event log, we apply seven different activity filtering techniques for comparison: 1) direct entropy filter without Laplace smoothing, 2) direct entropy filter with Laplace smoothing (), 3) indirect entropy filter without Laplace smoothing, 4) indirect entropy filter with Laplace smoothing (), 5) leastfrequentfirst filtering, 6) mostfrequentfirst filtering, 7) filtering the activities from the log in a random order. Recall that the activity filtering procedure stops at the point where all but two activities are filtered from the event log because process models that contain just one activity do not communicate any information regarding the relations between activities. For each event log and for each activity filtering approach we discover a process model after each filtering step (i.e., after each removal of an activity). The process discovery step is performed with two process discovery approaches: the Inductive Miner Leemans et al (2013a), and the Inductive Miner infrequent (20%) Leemans et al (2013b).
5.0.1 Results on Business Process Event Logs
Figure 8 shows the Fscore of the process models discovered with the Inductive Miner Leemans et al (2013a) and the Inductive Miner with infrequent behavior filtering Leemans et al (2013b) (20% filtering) on the five business event logs for different percentages of activities filtered out and different activity filtering techniques. The figure shows an increasing trend in Fscore for all event logs when more activities are filtered from the event log. Furthermore, the line for the leastfrequentfirst filtering approach is below the lines of the entropybased filtering techniques for most of the percentages of activities removed on most event logs, which shows that entropybased filtering enables the discovery of models with higher Fscore compared to simply filtering out infrequent activities. There are a few exceptions where filtering out infrequent activities outperforms the entropybased techniques, e.g., the Inductive Miner on the BPI ’12 resource 10939 event log (around 40% of activities explained) and the traffic fines event log (around 55% of activities explained). It differs between event logs which of the entropybased techniques performs best: for the environmental permit log the indirect filter without Laplace smoothing almost dominates the other techniques while for the SEPSIS log the direct filter without Laplace smoothing outperforms the other techniques. Generally, it seems that the use of Laplace smoothing harms Fscore, as most parts of the lines of indirect filtering with Laplace smoothing are below the lines of the indirect approach without Laplace smoothing, and similar for the direct approach with and without Laplace smoothing. However, the detrimental effect of Laplace smoothing does not seem to be large, and in some cases, the usage of Laplace smoothing in filtering increases the Fscore of the discovered models.
Figure 9 shows the nondeterminism of the process models as a function of the minimum percentage of activities. The green dashed line indicates the nondeterminism of the flower model, i.e., the process model that allows for all behavior over the activities. The lines stop when further removal of activities does not lead to further improvement of nondeterminism. It is clear that the filtering mechanism of the Inductive Miner helps to discover process models that are more behaviorally constrained, as the nondeterminism values are lower for the Inductive Miner infrequent 20% compared to the Inductive Miner without filtering. However, the results show even when already using the 20% frequency filter of the Inductive Miner infrequent, the chaotic activity filter can lead to an additional reduction of nondeterminism. Furthermore, the results on the environmental permit log and the SEPSIS log show that filtering several chaotic activities from the event log also enables the discovery of a model with low nondeterminism using the Inductive Miner without filtering. Which of the activity filtering approaches works best seems to be dependent on the event log: the indirect entropybased filter leads to the models with the lowest nondeterminism on the traffic fine event log, the environmental permit event log, while the direct entropybased filter works better for some percentages of remaining activities for the SEPSIS log and the BPI ’12 resource 10939 log.
Figures 10 and 11 show the fitness and precision values for the business process event logs at the filtering step that leads to the highest Fscore while describing at least 75% of the activities of the original log. In addition to the filtering techniques shown in Figure 8 it also shows the frequencybased activity filter where the most frequent activities are filtered out first, and a random baseline is shown which iteratively picks a random activity from the event log to filter out. The error bar for the random activity filter indicates one standard error of the mean (SEM) based on eight repetitions of applying the filter. The black dotted horizontal lines indicate the fitness and precision values of the process models discovered from the original event log without filtering any activities. Note that the fitness values are only shown for the Inductive Miner infrequent 20% Leemans et al (2013b) because the Inductive Miner without infrequent behavior filter Leemans et al (2013a) provides the formal guarantee that the fitness of the discovered model is . Figure 10 shows that generally, the differences in fitness between the models discovered from the filtered logs are very minor, and very close to the fitness of the unfiltered log (i.e., the dotted line). Figure 11, however, shows that the entropybased filtering approaches outperform filtering out activities based on frequency and filtering out random activities from the event log. The Fscores of the discovered process models is determined mostly by the precision of the models because the activity filtering impacts precision more than it impacts fitness. One exception is the BPI’12 resource 10939 log Tax et al (2016b), where the fitness decreases to below 0.75 as a result of applying one of the two frequencybased filters, while the precision increase as an effect of applying the filter is only minor.
5.0.2 Results on Human Behavior Event Logs
Figure 12 shows the maximum Fscore for different human behavior event logs as a function of the minimum percentage of activities that are remaining in the log. Again, the general pattern is that the Fscore of the discovered process model decreases when the minimum percentage of events explained increases, as the process discovery task gets easier for smaller numbers of activities. The figure shows that filtering infrequent activities from the event log is dominated in terms of Fscore by the entropybased filtering techniques. Like on the business process event logs, there are mixed results on which of the four configurations of the entropybased filtering technique leads to the highest Fscore: on the CHAD event log the indirect activity filter outperforms the direct activity filter when using the Inductive Miner infrequent 20%; however, the direct activity filter leads to higher Fscore for the Inductive Miner when filtering more than 50% of the activities.
Figure 13 shows the nondeterminism results for the human behavior event logs. It is noticeable that the nondeterminism values of the process models that are discovered when filtering very few activities are much closer to the flower model compared to what we have seen before for the business process event logs. This is caused by human behavior event logs having much more variability in behavior compared to execution data from business processes, resulting in a much harder process discovery task. After filtering several chaotic activities, the nondeterminism drops significantly to ranges comparable to nondeterminism values seen for logs from the business process domain. This shows that the problem of chaotic activities is much more prominent in human behavior event logs than in business process event logs. The entropybased activity filtering approaches lead to more deterministic process models compared to filtering out infrequent activities. Two clear examples of this are the MIT B log and the Ordonez A log, on which filtering out infrequent activities after several filtering steps results in a flower model (i.e., nondeterminism is identical to that of the flower model), while entropybased activity filters enable the discovery of a model with nondeterminism close to one (i.e., very close to a sequential model) while at the same time keeping 75% of the activities in the event log.
Figure 14 shows the precision values for the human behavior logs for the filtering step that leads to the highest Fscore while describing at least 50% of the activities of the original log. Similarly to what we have seen in the nondeterminism graph, removing random activities from the log and removing infrequent activities from the log results in smaller precision increases compared to the entropybased activity filters. Furthermore, it is noticeable that removing frequent activities from the log works quite well to improve the precision of models discovered from the human behavior application domain. The reason for this is that some of the chaotic activities that are present in many of those event logs, including going to the toilet and getting a drink, also happen to be frequent. On the van Kasteren event log the indirect activity filter with Laplace smoothing leads to the largest increase in precision when mining a model with at least 50% of the activities (from to with the Inductive Miner infrequent 20%).
Order 
Filtered activity (indirect entropybased filter with Laplace smoothing) 
Filtered activity (leastfrequentfirst filter) 
1  Use toilet  Prepare dinner 
2  Get drink  Get drink 
3  Leave house  Prepare breakfast 
4  Take shower  Take shower 
5  Go to bed  Go to bed 
6  Prepare breakfast  Leave house 
7  Prepare dinner  Use toilet 
Table 3 shows in which order activities are filtered from the van Kasteren event log by 1) the indirect entropybased activity filter with Laplace smoothing and 2) the leastfrequentfirst filter. It shows that the entropybased filter filters use toilet as the first activity, which from domain knowledge we know to be a chaotic activity, as people generally just go to the toilet whenever they need to, regardless of which other activities they have just performed. For the infrequent activity filter use toilet would be the last choice of the activities to filter out, because it is the most frequent activity in the van Kasteren event log.
Figures (a)a and (b)b show the corresponding process models discovered with the Inductive Miner infrequent 20% from the logs filtered with the indirect activity filter with Laplace smoothing and the infrequent activity filter respectively. The process model discovered after filtering three activities with the Indirect entropybased activity filter with Laplace smoothing is very specific on the behavior that it described: after going to bed, either the logging ends, or prepare breakfast occurs next, followed by taking a shower. After taking a shower, there is a possibility to either go to bed again or to prepare dinner before going to bed. The process model discovered after filtering three activities with the infrequent activity filter allows for many more traces: it starts with go to bed followed by use toilet, after which any of the activities go to bed, take shower, and leave house can occur as next event or the logging can end. Furthermore, the activities leave house and take shower can occur in any order, and take shower can also be skipped.
Figure 16 shows the results on Fscore for the human behavior event logs by Cook et al. Cook et al (2013). The results on the Cook event logs are inline with the results on the human behavior event logs, however, on these event logs, it is even more clear that filtering out infrequent activities leads to suboptimal process models in terms of Fscore. Which of the filtering approaches results in the optimal process model in terms of Fscore is very dependent on the event log and the minimum number of activities to be remained after filtering: each of the four configurations of the entropybased filtering approach is optimal for at least one combination of log and minimum percentage of activities explained.
Figure 17 shows the results in terms of nondeterminism for the same event logs. Filtering infrequent activities at high percentages of activities explained has much lower nondeterminism compared to the flower model, while further left on the graph, after filtering out more activities, the nondeterminism of filtering out infrequent activities gets closer to the flower model. This shows that filtering out infrequent activities can even be harmful to the quality of the obtained process discovery result. The nondeterminism values obtained with the four configurations of the entropybased filtering approach are generally close to each other, where the optimal configuration is dependent on the log and the number of filtered activities.
5.0.3 Aggregated Analysis Over All Event Logs
Direct  Direct ()  Indirect  Indirect ()  Leastfrequentfirst  
Direct  1.0  0.2956  0.0829  0.1408  0.0504 
Direct ()  0.2956  1.0  0.0698  0.0536  0.1454 
Indirect  0.0829  0.0698  1.0  0.6852  0.0275 
Indirect ()  0.1408  0.0536  0.6852  1.0  0.0392 
Leastfrequentfirst  0.0504  0.1454  0.0275  0.0392  1.0 
rank correlation between five activity filtering methods, mean and standard deviation over the 17 event logs.
We have observed in Figures 9, 13, and 17 that the entropybased activity filtering techniques perform differently on different datasets and for different numbers of activities filtered. To evaluate the overall performance of activity filtering techniques, we use the number of other filtering techniques that it can beat over all the seventeen event logs of Table 2. This metric, known as winning number, is commonly used for evaluation in the Information Retrieval (IR) field Qin et al (2010); Tax et al (2015). Formally, winning number is defined as
where is the index of an event log, and are indices of activity filtering techniques, is the performance of the th algorithm on the th event log in terms of nondeterminism where each least % of activities are explained and is the indicator function
We define as the average number of other activity filtering techniques that are outperformed by filtering technique at the point where at least % of activities are explained.
Direct  Direct ()  Indirect  Indirect ()  Leastfrequentfirst  
Direct  17  5  1  2  0 
Direct ()  5  17  1  1  3 
Indirect  1  1  17  17  3 
Indirect ()  2  1  17  17  3 
Leastfrequentfirst  0  3  3  3  17 
Number of event logs for which we can reject the null hypothesis that the orderings of activities returned by activity filters are uncorrelated, according to the tau test.
Figure 18 shows the average winning number for different values of and for the seven different activity filtering techniques. We observe that for higher ratios of activities explained the differences between filtering techniques are smaller than for lower numbers of activities explained. Intuitively this can be explained by the fact that for lower ratios of activities explained more activities have been filtered out from the log. Therefore the effect of the filtering techniques is more clearly visible. The figure shows that, up until +74% of activities explained, the indirect entropybased activity filtering technique leads to the most deterministic process models averaged over all event logs included in the experiment, where it outperforms between 4 and 4.5 other filtering techniques. Between +75% and + 87.5% the indirect entropybased activity filtering technique with Laplace smoothing results in the highest average winning number, although the difference with the indirect entropybased filtering technique seems negligible. Filtering out random activities from the event log outperforms none of the 6 other activities filtering techniques for the most of the graph, indicating that frequencybased filtering clearly outperforms filtering random activities.
To investigate to what degree the order in which activities are removed from the logs differs between the activity filtering techniques we calculate Kendall’s tau () rank correlation for each log between the activity filtering techniques in a pairwise way. Table 4 shows the rank correlation values found between the activity filters, averaged over the 17 event logs. The indirect activity filter with Laplace smoothing and the indirect activity filter without Laplace smoothing generate orderings over the activities of a log that are strongly correlated. Between the direct activity filter without Laplace smoothing and the direct activity filter without Laplace smoothing there is only a weak correlation. All the other activity filtering techniques are uncorrelated or very weakly correlated. Using the Kendall statistic, we apply a tau test for each pair of activity filtering techniques on each event log to test the null hypothesis that the two orderings in which activities are filtered by the two activity filtering techniques are uncorrelated, using a significance level .
For each pair of activity filtering techniques Table 5 shows the number of event logs for which the null hypothesis was rejected, i.e., the number of event logs for which the order in which activities are filtered is statistically correlated. The indirect activity filters with and without Laplace smoothing create correlated orderings of activities for all seventeen event logs. For all other pairs of activity filtering techniques the orderings in which activities are filtered are only correlated with for low numbers of event logs.
6 Entropybased Toggles for Process Discovery
In the previous section we have shown that all four configurations of the entropybased activity filtering technique lead to more deterministic process models compared to simply filtering out infrequent activities. However, the differences in determinism of the process models that are discovered after applying any of the four configurations are small and dependent on the event log to which they are applied. Furthermore, all four configurations of the activity filtering technique simply impose an ordering over the activities, but do not specify at which step the filtering should be stopped. Additionally, the proposed filtering technique ignores the semantics of activities: activities that are chaotic may still be relevant for the process. Leaving them out of the process model to discover will harm the usefulness of the discovered process model.
To address the three issues we propose to use the filtering technique as a sorting technique over the activities in combination with toggles that interactively allow the process analyst to “disable” (filter out) or “enable” activities, and then rediscover and visualize the process model according to the new settings. This approach is similar to the Inductive Visual Miner Leemans et al (2014), an interactive implementation of the Inductive Miner Leemans et al (2013b) algorithm which allows the process analyst to filter the event log interactively using a sliderbased approach. The Inductive visual miner contains two sliders: with one slider activities can be filtered using the leastfrequentfirst filter, where the user can control how many activities are filtered out by moving the slider up and down. We propose to replace this slider with a sorted list of activities and toggles, as this allows the process analyst to override the ordering of the activities that is determined by the activity filtering technique with domain knowledge. Figure 19 shows a mockup of the proposed way to use the activity filter. Activities are by default sorted using the chaotic activity filter, showing the entropy to indicate the assessed degree of chaoticness of each activity. Based on this information, the process analyst can choose to rely on the filtering technique and filter out the top of the list or to override this list with domain knowledge. Furthermore, other activity filtering techniques, such as the leastfrequentfirst filter, can be included as an additional column on which the activities of the process can be sorted. This allows the process analyst to control how many activities, and which activities, are filtered out of the process model, and thereby also empowers the user to prevent the removal of semantically important activities that should not be removed. Furthermore, this approach allows the process analyst to explore himself which of the filtering techniques leads to the most useful process model from the event log that he is analyzing.
7 Related Work
Real life events logs often contain all sorts of data quality issues Suriadi et al (2017), include incorrectly logged events, events that are logged in the wrong order, and events that took place without being logged. Instances of such data quality issues are often referred to as noise
. Many event log filtering techniques have been proposed to address the problem of noise. Existing filtering techniques in the process mining field can be classified into four categories: 1) event filtering techniques, 2) process discovery techniques that have an integrated filtering mechanism build in, 3) trace filtering techniques, and 4) activity filtering techniques. We use these categories to discuss and structure related work.
7.1 Event filtering
Conforti et al. Conforti et al (2017)
recently proposed a technique to filter out outlier events from an event log. The technique starts by building a prefix automaton of the event log, which is minimal in terms of the number of arcs in the automaton, using an Integer Linear Programming (ILP) solver. Infrequent arcs are removed from the minimal prefix automaton, and finally, the events belonging to removed arcs are filtered out from the event log.
Lu et al. Lu et al (2015) advocate the use of event mappings Lu et al (2014) to distinguish between events that are part of the mainstream behavior of a process and outlier events. Event mappings compute similar behavior and dissimilar behavior between each two executions of the process as a mapping: the similar behavior is formed by all pairs of events that are mapped to each other, whereas events that are not mapped are dissimilar behavior.
Fani Sani et al. Fani Sani et al (2017) proposes the use of sequential pattern mining techniques to distinguish between events that are part of the mainstream behavior and outlier events.
All three of the event filtering techniques listed above aim filter out outlier events from the event log, while keeping the mainstream behavior. Event filtering techniques model the frequently occurring contexts of activities and filter out the contexts of activities that occur infrequently in the log. For example, consider an activity such that 98% of its occurrences are in context , with the remaining 2% of the events of activity are in context , then the events that occur between and will be filtered out by event filtering techniques. Note that our filtering technique is orthogonal to event filtering: it would consider activity to be nonchaotic and would not filter out anything. However, when a log contains a chaotic activity , then event filtering techniques are not able to remove all events of this chaotic activity. One of the contexts of will by chance be more frequent than other contexts, i.e., for some activity , it will hold that , even though might only be slightly more frequent. This will result in events after a being removed, while the events after an remain in the log. Applying a process discovery technique to this filtered log will then result in a process model where activity is misleadingly positioned after activity , while in fact can happen anywhere in the process. The activity filtering technique presented in this paper will instead detect that activity is chaotic, and completely remove it from the event log, preventing the misleading effect of event filtering.
7.2 Process Discovery Techniques with Integrated Filtering
Several process discovery algorithms offer integrated filtering mechanisms as part of the approach. The Inductive Miner (IM) Leemans et al (2013a) is a process discovery algorithm which first discovers a directlyfollows graph from the event logs, where activities are connected that directly follow each other in the log, from which in a second step a process model is discovered. The directlyfollows relations are affected by the presence of a chaotic activity : sequence leads to false directlyfollows relations between and and between and , while the directlyfollows relation between and is obfuscated by . The Inductive Miner infrequent (IMf) Leemans et al (2013b) is an extension of the IM where infrequent directlyfollows relations are filtered out from the set of directlyfollows relations that are used to generate to process models. The filtering mechanism of IMf can help to filter out the directlyfollows relations between and and between and , but it does not help to recover the obfuscated directlyfollows relation between and . Instead, the activity filtering technique presented in this paper filters out the chaotic activity , leading to sequence being transformed into , thereby recovering the directly follows relation between and .
The Heuristics Miner Weijters and Ribeiro (2011) and the Fodina algorithm vanden Broucke and De Weerdt (2017), in addition to the directlyfollows relation, defines an eventuallyfollows relation between activities and allows the process analyst to filter out infrequent directlyfollows and eventually follows relations. Two activities and are in an eventuallyfollows relation when is eventually followed by , before the next appearance of or . The eventuallyfollows relation, unlike the directlyfollows relation, is not impacted by the presence of chaotic activities. The Heuristic Miner Weijters and Ribeiro (2011) and Fodina vanden Broucke and De Weerdt (2017) both include filtering methods for the directlyfollows and eventuallyfollows relations that are similar in nature to the filtering mechanism that is used in the Inductive Miner infrequent Leemans et al (2013b). However, the use of sequential orderings and parallel constructs in the mining approaches of the Heuristic Miner De Weerdt et al (2011) and Fodina vanden Broucke and De Weerdt (2017) is based on the directlyfollows relations only, with the eventually follows relations being used for the mining of longterm dependencies. Furthermore, in contrast to the Inductive Miner, the process models discovered with the Heuristic Miner Weijters and Ribeiro (2011) or Fodina vanden Broucke and De Weerdt (2017) can be unsound, i.e., the can contain deadlocks.
The ILPminer van der Werf et al (2009) is a process discovery algorithm where a set of behavioral constraints over activities is discovered for each prefix (called the prefixclosure) of the event log, based on which a process model is discovered that satisfies these constraints using Integer Linear Programming (ILP). Van Zelst et al. van Zelst et al (2015) proposed a filtering technique for the ILPminer where the prefix closure of the event log is filtered prior to solving the ILP problem by removing infrequently observed prefixes. It is easy to see that a chaotic activity affect the prefixclosure that is discovered from the event log: given log consisting of two traces and , activity causes the prefixes closures of the two traces to have no overlap in states, while without activity the two traces are identical. This makes the filtering method of the prefixclosure proposed by Van Zelst et al. van Zelst et al (2015) less effective, as frequent prefixes randomly get distributed over several infrequent prefixes when chaotic activities are present. Instead, the chaotic activity filtering technique presented in this paper would remove chaotic activity , leading to traces and becoming identical after filtering, therefore leading to a simpler process model while still describing the behavior of the event log accurately.
The Fuzzy Miner Günther and van der Aalst (2007) is a process discovery algorithm that aims at mining models from flexible processes, and it discovers a process model without formal semantics. The Fuzzy Miner discovers this graph by extracting the eventually follows relation from the event log, which is not affected by chaotic activities. Similar to the Heuristics Miner Weijters and Ribeiro (2011) and Fodina vanden Broucke and De Weerdt (2017) the Fuzzy Miner allows to filter out infrequent eventuallyfollows relations between activities. In practice, the lack of formal semantics of the Fuzzy Miner models hinders the usability of the models, as the models are not precise on what behavior is allowed in the process under analysis.
7.3 Trace filtering
Ghionna et al. Ghionna et al (2008) proposed a technique to identify outlier traces from the event log that consists of two steps: 1) mining frequent patterns from the event log, and 2) applying MCL clustering Van Dongen (2008) on the traces, where the similarity measure for traces is defined on the number of patterns that jointly characterize the execution of the traces. Traces that are not assigned to a cluster by the MCL clustering algorithm are considered to be outlier traces and are filtered from the event log. It is easy to see that trace filtering techniques address a fundamentally different problem than chaotic activity filtering: in the event log shown in Figure (b)b there are only two traces that do not contain an instance of chaotic activity , therefore, even if a trace filtering technique would be able to perfectly filter out traces that contain a chaotic event, the number of remaining traces will become too small to mine a fitting and precise process model when the chaotic activity is frequent.
7.4 Activity filtering
The modus operandi for filtering activities is to simply filter out infrequent activities from the event log. The plugin ’Filter Log using Simple Heuristics’ in the ProM process mining toolkit Van Dongen et al (2005) offers tool support for this type of filtering. The Inductive Visual Miner Leemans et al (2014) is an interactive process discovery tool that implements the Inductive Miner Leemans et al (2013b) process discovery algorithm in an interactive way: the process analyst can filter the event log using sliders and is then shown the process model that is discovered from this filtered log. One of the available sliders in the Inductive Visual Miner offers the same frequencybased activity filtering functionality. The working assumption behind filtering out infrequent activities is that when there are just a few occurrences of an activity, there is probably not enough evidence to establish their relation to other activities to model their behavior. However, as we have shown in this paper, for frequent but chaotic activities, while they are frequent enough to establish their relation to other activities, complicate the process discovery task by lowering directlyfollows counts between other activities in the event log. The activity filtering technique presented in this paper is able to filter out chaotic activities, thereby reconstructing the directlyfollows relations between the nonchaotic activities of the event log, at the expense of losing the chaotic activities.
8 Conclusion & Future Work
In this paper, we have shown the possible detrimental effect of the presence of chaotic activities in event logs on the quality of process models produced by process discovery techniques. We have shown through synthetic experiments that frequencybased techniques for filtering activities from event logs, which is currently the modus operandi
for activity filtering in the process mining field, do not necessarily handle chaotic activities well. As shown, chaotic activities can be frequent or infrequent. We have proposed four novel techniques for filtering chaotic from event logs, which find their roots in information theory and Bayesian statistics. Through experiments on seventeen reallife datasets, we have shown that all four proposed activity filtering techniques outperform frequencybased filtering on real data. The indirect entropybased activity filter has been found to be the best performing activity filter overall averaged over all datasets used in the experiments; however, the performance of the four proposed activity filtering techniques is highly dependent on the characteristics of the event log.
Because the performance of the filtering techniques was found to be logdependent, we propose the use the activity filtering techniques in a sliderbased approach where the user can filter activities interactively and directly see the process model discovered from the filtered event log. Ultimately, only the user can decide which activities to include. In future work, we aim to construct a hybrid activity filtering technique that combines the four techniques proposed in this paper by using supervised learning techniques from the data mining field to predict the effect of removing a particular activity.
References

van der Aalst (2016)
van der Aalst WMP (2016) Process mining: data science in action. Springer
 van der Aalst et al (2017) van der Aalst WMP, Bolt A, van Zelst SJ (2017) RapidProM: Mine your processes and not just your data. In: Hofmann M, Klinkenberg R (eds) RapidMiner: Data Mining Use Cases and Business Analytics Applications, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, p To Appear.
 Adriansyah et al (2011) Adriansyah A, van Dongen BF, van der Aalst WMP (2011) Conformance checking using costbased fitness analysis. In: Proceedings of the 15 IEEE International Enterprise Distributed Object Computing Conference (EDOC), IEEE, pp 55–64
 vanden Broucke and De Weerdt (2017) vanden Broucke SKLM, De Weerdt J (2017) Fodina: a robust and flexible heuristic process discovery technique. Decision Support Systems
 Bruno et al (2013) Bruno B, Mastrogiovanni F, Sgorbissa A, Vernazza T, Zaccaria R (2013) Analysis of human behavior recognition algorithms based on acceleration data. In: Proceedings of the IEEE International Conference on Robotics and Automation, IEEE, pp 1602–1607
 Buijs (2014) Buijs JCAM (2014) Receipt phase of an environmental permit application process ( WABO ), CoSeLoG project. doi:10.4121/uuid:a07386a57be34367953570bc9e77dbe6

Buijs et al (2012)
Buijs JCAM, van Dongen BF, van der Aalst WMP (2012) A genetic algorithm for discovering process trees. In: Proceedings of the 2012 IEEE Congress on Evolutionary Computation, IEEE, pp 1–8
 Conforti et al (2017) Conforti R, La Rosa M, ter Hofstede AHM (2017) Filtering out infrequent behavior from business process event logs. IEEE Transactions on Knowledge and Data Engineering 29(2):300–314
 Cook et al (2013) Cook DJ, Crandall AS, Thomas BL, Krishnan NC (2013) CASAS: A smart home in a box. Computer 46(7):62–69
 De Leoni and Mannhardt (2015) De Leoni M, Mannhardt F (2015) Road traffic fine management process. doi:10.4121/uuid:270fd44010574fb989a9b699b47990f5
 De Weerdt et al (2011) De Weerdt J, De Backer M, Vanthienen J, Baesens B (2011) A robust Fmeasure for evaluating discovered process models. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, pp 148–155
 Dimaggio et al (2016) Dimaggio M, Leotta F, Mecella M, Sora D (2016) Processbased habit mining: Experiments and techniques. In: Proceedings of the International IEEE Conference on Ubiquitous Intelligence & Computing, IEEE, pp 145–152
 Fani Sani et al (2017) Fani Sani M, van Zelst SJ, van der Aalst WMP (2017) Improving process discovery results by filtering outliers using conditional behavioural probabilities. In: Proceedings of the International Workshop on Business Process Intelligence, Springer

Ghionna et al (2008)
Ghionna L, Greco G, Guzzo A, Pontieri L (2008) Outlier detection techniques for process mining applications. In: International Symposium on Methodologies for Intelligent Systems, Springer, pp 150–159

Goedertier et al (2009)
Goedertier S, Martens D, Vanthienen J, Baesens B (2009) Robust process discovery with artificial negative events. Journal of Machine Learning Research 10(Jun):1305–1340
 Günther and van der Aalst (2007) Günther CW, van der Aalst WMP (2007) Fuzzy mining–adaptive process simplification based on multiperspective metrics. In: International Conference on Business Process Management, Springer, pp 328–343
 Herbst (2000) Herbst J (2000) A machine learning approach to workflow management. In: European Conference on Machine Learning, Springer, pp 183–194
 van Kasteren et al (2008) van Kasteren T, Noulas A, Englebienne G, Kröse B (2008) Accurate activity recognition in a home setting. In: Proceedings of the 10th International Conference on Ubiquitous Computing, ACM, pp 1–9
 Leemans et al (2013a) Leemans SJJ, Fahland D, van der Aalst WMP (2013a) Discovering blockstructured process models from event logs  a constructive approach. In: International Conference on Applications and Theory of Petri Nets and Concurrency, Springer, pp 311–329
 Leemans et al (2013b) Leemans SJJ, Fahland D, van der Aalst WMP (2013b) Discovering blockstructured process models from event logs containing infrequent behaviour. In: International Conference on Business Process Management, Springer, pp 66–78
 Leemans et al (2014) Leemans SJJ, Fahland D, van der Aalst WMP (2014) Process and deviation exploration with inductive visual miner. In: Proceedings of the BPM Demo Track, CEURWS.org, vol 1295, p 46
 Leotta et al (2015) Leotta F, Mecella M, Mendling J (2015) Applying process mining to smart spaces: Perspectives and research challenges. In: International Conference on Advanced Information Systems Engineering, Springer, pp 298–304
 Lohmann et al (2009) Lohmann N, Verbeek E, Dijkman R (2009) Petri net transformations for business processes–a survey. In: Transactions on petri nets and other models of concurrency II, Springer, pp 46–63
 Lu et al (2014) Lu X, Fahland D, van der Aalst WMP (2014) Conformance checking based on partially ordered event data. In: International Conference on Business Process Management, Springer, pp 75–88
 Lu et al (2015) Lu X, Fahland D, van den Biggelaar FJHM, van der Aalst WMP (2015) Detecting deviating behaviors without models. In: Proceedings of the International Workshop on Business Process Intelligence, Springer, pp 126–139
 Mannhardt (2016) Mannhardt F (2016) Sepsis cases  event log. doi:10.4121/uuid:915d2bfb7e8449ada286dc35f063a460
 Maruster et al (2006) Maruster L, Weijters AJMM, Aalst WMPvd, Bosch Avd (2006) A rulebased approach for process discovery: Dealing with noise and imbalance in process logs. Data Mining & Knowledge Discovery 13(1):67–87
 McCurdy et al (2000) McCurdy T, Glen G, Smith L, Lakkadi Y (2000) The national exposure research laboratory’s consolidated human activity database. Journal of Exposure Analysis and Environmental Epidemiology 10(6):566–578
 Murata (1989) Murata T (1989) Petri nets: Properties, analysis and applications. Proceedings of the IEEE 77(4):541–580
 Object Management Group (2011) Object Management Group (2011) Notation (BPMN) version 2.0. OMG Specification
 Ordónez et al (2013) Ordónez FJ, de Toledo P, Sanchis A (2013) Activity recognition using hybrid generative/discriminative models on home environments using binary sensors. Sensors 13(5):5460–5477
 Qin et al (2010) Qin T, Liu TY, Xu J, Li H (2010) LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13(4):346–374
 Solé and Carmona (2013) Solé M, Carmona J (2013) Regionbased foldings in process discovery. IEEE Transactions on Knowledge and Data Engineering 25(1):192–205
 Suriadi et al (2017) Suriadi S, Andrews R, ter Hofstede AHM, Wynn MT (2017) Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. Information Systems 64:132–150
 Sztyler et al (2015) Sztyler T, Völker J, Carmona Vargas J, Meier O, Stuckenschmidt H (2015) Discovery of personal processes from labeled sensor data: An application of process mining to personalized health care. In: Proceedings of the International Workshop on Algorithms & Theories for the Analysis of Event Data, CEURWS.org, pp 31–46
 Tapia et al (2004) Tapia EM, Intille SS, Larson K (2004) Activity recognition in the home using simple and ubiquitous sensors. In: International Conference on Pervasive Computing, Springer, pp 158–175
 Tax et al (2015) Tax N, Bockting S, Hiemstra D (2015) A crossbenchmark comparison of 87 learning to rank methods. Information Processing & Management 51(6):757–772
 Tax et al (2016a) Tax N, Sidorova N, Haakma R, van der Aalst WMP (2016a) Event abstraction for process mining using supervised learning techniques. In: Proceedings of the SAI Intelligent Systems Conference, Springer
 Tax et al (2016b) Tax N, Sidorova N, Haakma R, van der Aalst WMP (2016b) Mining local process models. Journal of Innovation in Digital Ecosystems 3(2):183–196
 Tax et al (2017) Tax N, Sidorova N, Haakma R, van der Aalst WMP (2017) Mining process model descriptions of daily life through event abstraction. In: Intelligent Systems and Applications, Springer, p To appear.
 Van Dongen (2012) Van Dongen B (2012) BPI challenge 2012. doi:10.4121/uuid:3926db30f7124394aebc75976070e91f
 Van Dongen et al (2005) Van Dongen BF, de Medeiros AKA, Verbeek HMW, Weijters AJMM, Van Der Aalst WMP (2005) The ProM framework: A new era in process mining tool support. In: International Conference on Application and Theory of Petri Nets, Springer, pp 444–454
 Van Dongen (2008) Van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM Journal on Matrix Analysis and Applications 30(1):121–141
 Vanden Broucke et al (2013) Vanden Broucke SKLM, De Weerdt J, Vanthienen J, Baesens B (2013) Determining process model precision and generalization with weighted artificial negative events. IEEE Transactions on Knowledge and Data Engineering
 Weijters and Ribeiro (2011) Weijters AJMM, Ribeiro JTS (2011) Flexible heuristics miner (FHM). In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, pp 310–317
 van der Werf et al (2009) van der Werf JMEM, van Dongen BF, Hurkens CAJ, Serebrenik A (2009) Process discovery using integer linear programming. Fundamenta Informaticae 94(3):387–412
 van Zelst et al (2015) van Zelst SJ, van Dongen BF, van der Aalst WMP (2015) Avoiding overfitting in ILPbased process discovery. In: International Conference on Business Process Management, Springer International Publishing, pp 163–171
 Zhai and Lafferty (2004) Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22(2):179–214