Conformance Checking Over Stochastically Known Logs

03/14/2022
by   Eli Bogdanov, et al.
0

With the growing number of devices, sensors and digital systems, data logs may become uncertain due to, e.g., sensor reading inaccuracies or incorrect interpretation of readings by processing programs. At times, such uncertainties can be captured stochastically, especially when using probabilistic data classification models. In this work we focus on conformance checking, which compares a process model with an event log, when event logs are stochastically known. Building on existing alignment-based conformance checking fundamentals, we mathematically define a stochastic trace model, a stochastic synchronous product, and a cost function that reflects the uncertainty of events in a log. Then, we search for an optimal alignment over the reachability graph of the stochastic synchronous product for finding an optimal alignment between a model and a stochastic process observation. Via structured experiments with two well-known process mining benchmarks, we explore the behavior of the suggested stochastic conformance checking approach and compare it to a standard alignment-based approach as well as to an approach that creates a lower bound on performance. We envision the proposed stochastic conformance checking approach as a viable process mining component for future analysis of stochastic event logs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

09/20/2019

Mining Uncertain Event Data in Process Mining

Nowadays, more and more process data are automatically recorded by infor...
06/07/2021

Uncertain Process Data with Probabilistic Knowledge: Problem Characterization and Challenges

Motivated by the abundance of uncertain event data from multiple sources...
07/08/2021

Probabilistic Trace Alignment

Alignments provide sophisticated diagnostics that pinpoint deviations in...
10/21/2020

Conformance Checking for a Medical Training Process Using Petri net Simulation and Sequence Alignment

Process Mining has recently gained popularity in healthcare due to its p...
10/22/2019

Scalable Alignment of Process Models and Event Logs: An Approach Based on Automata and S-Components

Given a model of the expected behavior of a business process and an even...
12/23/2021

A Framework for Efficient Memory Utilization in Online Conformance Checking

Conformance checking (CC) techniques of the process mining field gauge t...
11/23/2020

Conformance Checking of Mixed-paradigm Process Models

Mixed-paradigm process models integrate strengths of procedural and decl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

1 Introduction

Process mining relies on data that are typically stored in the form of event logs and collections of traces where each trace is a sequence of events and activities that were created following a process realization. Process mining tasks, such as conformance checking, use event logs to achieve their goal (e.g., assessing to what degree a process model and an event log conform) of improving the process model that generates these logs.

The fourth industrial revolution [schwab2017fourth], which is bridging our digital and physical worlds, is producing an abundance of event data from multiple sources such as social media networks [sener2018unsupervised], sensors located within smart cities (e.g., the ‘Green Wall’ project in Tel Aviv and Nanjing), medical devices and much more. Differently from data within traditional information systems, these data may involve uncertainty due to technical reasons such as sensor inaccuracy, the use of probabilistic data classification models, data quality reduction during processing, and low quality of data capturing devices. Human generated data may be uncertain as well, due to fake news and mediator interventions.

In this work, we focus on process mining with SK event data [cohen2021uncertain]

where the probability distribution functions of the event data are known.

111It is also denoted as ‘weakly uncertain’ event data in the process mining literature; see [pegoraro2020conformance].

By way of motivation, consider a use-case of food preparation processes, captured in video clips that are analyzed by a pre-trained CNN to predict activity classes and their sequence within an observed video. To extract the trace of the realized process, one can use the softmax layer of the CNN to yield a discrete probability distribution of the predicted activity classes in the observed video. This probabilistic knowledge, in turn, can serve as a basis for an SK log.

Specifically, we develop a conformance checking algorithm over SK data. Building on existing alignment-based conformance checking fundamentals, we mathematically define a stochastic trace model, a stochastic synchronous product, and a cost function that reflects the uncertainty of events in a log. Then, we search for an optimal alignment over the reachability graph of the stochastic synchronous product to find an optimal alignment.

The main contributions of this work are:

  1. We characterize and mathematically define the building blocks for stochastic conformance checking, including a stochastic trace model and a stochastic synchronous product.

  2. We develop a novel conformance checking algorithm between a model and an SK trace.

  3. Using publicly available data sets, we evaluate the performance of stochastic conformance checking and highlight unique features of our proposed algorithm.

The rest of the paper is organized as follows. In Section 2, we develop the model followed by presentation of our stochastic alignment algorithm (Section 3). Empirical evaluation of the two is given in detail in Section 4. The related literature is presented in Section 5 and the final section (Section 6) concludes the paper and offers directions for future research.

2 Stochastic Trace Model

Uncertain data have recently become a subject of interest among the process mining community [pegoraro2020conformance, pegoraro2019mining, pegoraro2019discovering]. Table 1 [cohen2021uncertain] presents a model/observation classification scheme that is based on the number of models present in a log and whether the log is deterministically or stochastically known. In this work we focus on Case 5, handling a DK process model and an SK trace, where the decision-maker wishes to identify a conformance measure between the process and the SK trace. While the suggested approach can be extended to solve Case 7, we leave this extension as well as other cases for future work.

             Model (Data set) Single process Multiple processes
Observation (Log) DK SK DK SK
Deterministically Known (DK) 1 2 3 4
Stochastically Known (SK) 5 6 7 8
Table 1: Eight cases according to the characteristics of the process and observed log, from cohen2021uncertain [cohen2021uncertain]. The present paper focuses on Case 5 (highlighted).

Following cohen2021uncertain [cohen2021uncertain], we use DK to describe a given and known process or event log, which is the common setting in the process mining literature. An SK event log has at least one event attribute that can be characterized via a probability distribution. Table 2 illustrates an SK trace, which we use as the running example throughout the paper.

Case ID Event ID Activity Timestamp
1 13-08-2020T12:00
1 13-08-2020T14:55
1 15-08-2020T17:39
1 15-08-2020T19:47
Table 2: SK data, which is aligned with Case 5 in Table 1 in [cohen2021uncertain].

We now introduce our primary notation and related definitions. We consider a finite set of activities and a Petri net with initial and final markings and , respectively. The Petri net is composed of finite sets of places , transitions and flow relations , which are directed edges among places and transitions. Each transition is associated with an activity by the labeling function . is a silent activity separate from the other activities in .

Differently from a DK trace that includes a sequence of activities with probability 1, the activities in an SK trace are associated with a probability function (e.g., the next transition may be ‘act1’ with probability or ‘act2’ with probability ). We reflect the stochastic nature of the traces using a weight function that assigns a firing probability to each transition.

Our modeling approach is inspired by a conformance checking algorithm [carmona2018conformance] (pp. 125-158) to align a DK trace and a model’s execution sequence such that the cost of dissimilarities is minimized. The algorithm by [carmona2018conformance] cannot be used directly with SK traces. Our proposed model, however, aims to provide this ability. In what follows, we assume prior knowledge about alignment-based conformance checking and related definitions (e.g., system net, process and trace models, and synchronous product). We refer interested readers to [carmona2018conformance] for a thorough description of relevant definitions and methods.

We start by defining a stochastic trace model.

Definition 1 (Stochastic Trace Model)

Let be a set of activities, and a sequence over these activities. A stochastic trace model, is a system net such that , , and where is the number of parallel transitions between place and . is a probability function assigning to each parallel transition a firing probability. Additionally, let and .

Figure 1: Stochastic trace model illustration

Figure 1 offers a visual illustration of a stochastic trace model for our running example from Table 2, where transition is activity , and are activities and , respectively, and so on. The stochastic trace model generalizes a trace model by allowing a place to have multiple incoming and outgoing edges denoted by , which lead to and from parallel transitions. Each transition has a single outgoing edge from a place and a single incoming edge to a place. Additionally, each transition is associated with a firing probability. For each two places in the Petri net, the sum of firing probabilities of their parallel transitions is 1.

3 Stochastic Alignment Algorithm

A synchronous product combines process and trace models such that each pair of transitions that are labeled with the same activity are denoted a synchronous transition. Nonsynchronous transitions are represented by pairing an activity with and are associated with a cost of 1. An optimal alignment between a trace and a model is the execution sequence of the model for which the alignment between the trace and the sequence has the lowest possible cost. De facto, this is an execution sequence of the synchronous product model that produces the lowest cost.

While deterministic traces have a single execution sequence, for SK traces a synchronous product procedure should align multiple model execution sequences with multiple trace execution sequences. We search for the optimal alignment using the reachability graph of the synchronous product. Towards this end, we need to extend the standard version of a synchronous product by including probability functions that capture the SK nature of the trace. The probability functions assign a firing probability to each synchronous move of the trace and the model. The probability of the synchronous move is equal to the probability of the same transition in the stochastic trace model as defined next.

Definition 2 (Stochastic Synchronous Product)

Let

be a process model and

a stochastic trace model. The stochastic synchronous product is a system net such that:

  • is the set of places,

  • is the set of transitions where denotes an transition in which either the model or the trace executes an activity and its counterpart does not, i.e., , with
       (model moves),
       (log moves), and
       (synchronous moves).

  • ,

  • ,

  • and,

  • it holds that , where if , and otherwise; and , if , and otherwise. Finally,

  • the probability function assigns firing probabilities to transitions of synchronous moves.

The stochastic synchronous product is a combination of a process model that may yield multiple execution sequences (traces) and a stochastically known trace model that is noisy. Thus, the ‘real’ deterministic trace can be only deduced with probability. The transitions of the stochastic synchronous product are a union of synchronous and nonsynchronous transitions. To combine a process model and a trace in a system net that represents the synchronous product, each pair of transitions that are labeled with the same activity is added as a synchronous transition. Nonsynchronous transitions, which include a process (trace) activity that cannot be matched with the same activity on the trace (model), are paired with .

Figure 2: Stochastic synchronous product illustration

Figure 2 illustrates the stochastic synchronous product of a model (its starting place is ) and the stochastic trace of our running example (its starting place is ). The first transition in both the model and the trace is given the label “activity ” and thus, a new synchronous transition is created—namely, transition . The original transitions both in the model and the trace are paired with the symbol and are added to the new net as well.

We are now ready to introduce our algorithm, S-ABCC (Stochastic Alignment-Based Conformance Checking), as a solution to the problem of finding the lowest-cost execution sequence of the synchronous product. We observe that this is equivalent to finding the shortest path over the synchronous product’s reachability graph, where the sum of costs across path edges is the total path length.

Given an initial marking of a stochastic synchronous process model , we denote the corresponding system net as and its set of reachable markings as . The reachability graph of , denoted by , is a graph in which the set of nodes is the set of markings and the edges correspond to firing transitions, where each edge in corresponds to a transition of the stochastic synchronous process . Formally, an edge exists, if and only if . The shortest path from the initial to the final marking in corresponds to the lowest-cost execution sequence of . We model the transition probabilities of the SK trace in the reachability graph by assigning weights (costs) to the edges as discussed next.

Recall that is the stochastic synchronous product of and a stochastic trace . For every synchronous move, transition in and its corresponding edge in , the cost of is calculated by

(1)

where is the firing probability of transition , and 1 otherwise ( (model moves or log moves, respectively)).

The cost function (Eq. 1) transforms firing probabilities into costs. We use a non-linear cost function such that each edge in the reachability graph satisfies the following: . The following property (which proof is omitted due to space considerations) offers guarantees with respect to synchronous moves.

Property 1

The cost function (Eq. 1), , satisfies the following properties for synchronous moves:

  1. The cost of an edge in approaches as the firing probability of its transition approaches ,

  2. it approaches as the firing probability of the transition approaches , and

  3. .

For the deterministic setting, the cost of each edge in is either 0 or 1 and thus, the deterministic setting can be seen as a special case of our setting with the firing probability of each transition set to 1. Given a stochastic synchronous product (Definition 2) and the cost function (Eq. 1), any shortest path algorithm (e.g., Dijkstra [carmona2018conformance]) can be applied to find the shortest (cheapest) path from the initial to the final markings – this path corresponds to an optimal alignment between the stochastic trace and the model. To illustrate, Figure 3 presents the reachability graph of the stochastic synchronous product in Figure 2 and the shortest path.

Figure 3: The reachability graph of the stochastic synchronous product in Figure 2. The red edges mark the optimal path after applying the Dijkstra algorithm.

4 Empirical evaluation

We evaluate S-ABCC against a standard alignment-based conformance and a lower bound on the conformance cost [pegoraro2020conformance]. We start with a description of the benchmark data sets (Section 4.1), followed by an explanation of the experiment design (Section 4.2). We report on the outcome of the empirical evaluation in Section 4.3.

4.1 The datasets

We used two publicly available real-world datasets as a baseline for our experiments: BPI 2019 and BPI 2012. The BPI 2019 data set contains over 1.5 million events for purchase orders that were collected from a large international coatings and paints company in the Netherlands. The dataset consists of over 250,000 traces relating to 42 activities performed by 627 users. The BPI 2012 dataset consists of about 262,000 events and 13,000 applications for personal loans or overdraft approvals held by a Dutch financial institute.

4.2 Data preparation and experiment design

For each of the data sets, we discovered a baseline model using 15 randomly chosen traces via the IM algorithm and the PM4PY package.

Stochastic traces were generated from traces that were not utilized for model discovery. We used 100 traces—15 for the model discovery while the remaining 85 were transformed into stochastic traces. The transformation procedure iterates over each trace, adding parallel transitions with random activities. Both original and added transitions are assigned a firing probability. For example, if the original log contained the following record: , a possible corresponding stochastic record after adding transitions with random activities and firing probabilities is .

We control the following parameters when preparing the stochastic traces.

  • Number of parallel transitions, , varied between 2 and 4. Consider, for example, , which is two parallel transitions for trace . Then for each of the three events, a second parallel transition is added with an activity that is randomly chosen from the set of activities.

  • Value of the firing probability assigned to the original transition in each set of parallel transitions, . This parameter is set to one of three values, . Since the sum of firing probabilities across each set of parallel transitions equals 1, the leftover probability, , is randomly split between the other parallel transitions.

  • Portion of the uncertain traces, . When , the considered trace is deterministic. We increased the parameter’s value in steps of . For each iteration in which we increased , we selected all the traces from the previous iteration and randomly selected of each trace transitions to be transformed into parallel transitions. The selected only included events without parallel transitions to ensure that when , of the trace events would have parallel transitions.

We note, in passing, that the stochastic traces that we generated resemble the stochastic output of neural networks for classifying activities in video clips or of sensors for identifying observed signals (for more information, refer to 

[cohen2021uncertain]).

4.3 Results

Figure 4 demonstrates the sensitivity of the suggested approach to the distribution of the firing probabilities in the sense that changing the firing probability affects the average conformance cost. Specifically, conformance cost decreases with as we get closer to the deterministic setting until it hits the red ‘+’ marker in Figure 4 in which . In fact, the suggested model accommodates the deterministic setting in the sense that when assigning , the suggested model generates the same conformance cost as does conventional alignment-based conformance checking.

(a) BPI 2012
(b) BPI 2019
Figure 4: Average conformance cost as as a function of the firing probability, of the original trace transition. We set , where each event in the original trace included 2–4 parallel transitions – . The ‘’ marker corresponds to a deterministic setting.

Under the suggested model, the optimal alignment carries additional conformance costs compared to its deterministic counterpart due to uncertainty. In a deterministic setting, synchronous moves do not induce a cost, which makes sense since there is only a single trace path. Under an SK setting, synchronous moves are associated with a non-negative cost due to uncertainty on the trace path. The extra cost embodies the level of uncertainty for each possible trace realization. Looking at the phenomenon from a different perspective, we can say that not accounting for the uncertainty costs would lead to a situation in which as the level of uncertainty increases (e.g., by having more transitions in parallel), the number of possible trace realizations grows and thus we have a greater chance of finding a better conforming trace that is associated with lower conformance costs. This situation is undesirable unless we are seeking a lower bound on the conformance cost (see [pegoraro2020conformance]).

Figure 5 presents the conformance cost as a function of the stochastic trace portion size for the BPI 2012 data set (results for BPI 2019 showed similar tendencies and are not included due to space considerations). Inspired by pegoraro2020conformance [pegoraro2020conformance], the original traces were modified prior to adding parallel transitions in one of four ways by: 1) randomly altering the activity label for of the events; 2) randomly swapping of the events with either their successor or predecessor where first and last events in a trace were only swapped with their successor and predecessor, respectively; 3) randomly duplicating of the trace events; and 4) all of the above modifications. After applying a modification, we turn back to the general preprocessing procedure of iteratively adding parallel transitions as detailed in Section 4.2. It can be seen in Figure 5 that the conformance cost of the SK traces increases with . On the other hand, the conformance cost of the lower bound, which does not account for probabilities, decreases with . This occurs because higher values imply more possible traces and thus additional alignment opportunities while the lower bound does not consider the realization probability of these traces. The result is that the gap, in conformance costs, between the lower bound and the suggested approach that acknowledge uncertainty increases with .

(a) Randomly changing labels for 30% of the events
(b) Randomly swapping labels for 30% of the events
(c) Randomly duplicating 30% of the events
(d) All of the above manipulations
Figure 5: Average conformance cost as a function of , the trace portion with parallel transitions for the four preprocessing modifications as evaluated for the BPI 2012 data set. Different types of markers denote different values and the lower bound;

Next, we evaluated the conformance cost of traces with different lengths. For this, the traces were sorted into groups according to their length, so that group 1 contains traces with a length of 0–9, group 2 contains traces with a length of 10–29 and so on. Following this, we randomly chose three traces from each group (a total of 15 traces) and discovered a model from these traces. Each data point in Figure 6 represents the average conformance cost of all the traces that were used for the evaluation, i.e., all the traces within a group excluding the traces that were used for the model discovery.

(a) BPI 2012
(b) BPI 2019
Figure 6: Average conformance cost as a function of the trace length; , ,

Figure 6 demonstrates that the conformance cost is increasing with the trace length (apart from the lower bound, for the same reasons explained earlier). The observed behavior follows from the fact that longer stochastic traces have a higher number of possible realizations, which may possibly lead to a better alignment, compared to shorter ones since the number of realizations of a stochastic trace with and is where is the length of the trace. We note that the additional cost from synchronous moves outweighs, on average, the reduced cost that may result from a better alignment.

5 Related work

Modeling uncertainty has been introduced in process mining only recently. Previous studies focused on uncertain data in the sense that some of the data are missing or incorrect and uncertainty is not quantified via any probability distribution. The common approach for dealing with such uncertainty is by preprocessing the event log either by filtering out the affected traces or by repairing existing values [suriadi2017event, wang2015cleaning, conforti2016filtering, sani2017improving, van2018filtering, conforti2018timestamp].

To the best of our knowledge, uncertainty in event logs was introduced explicitly for the first time in pegoraro2020conformance by pegoraro2020conformance [pegoraro2020conformance]

who introduced a new taxonomy of uncertainty on the attribute level. At this level, the values of the event attributes are not missing or incorrect but rather appear as a set of possible values and in some cases, the likelihood of each possible value is known or could be estimated. The authors defined two types of uncertainty—namely

strong uncertainty and weak uncertainty. The former relates to unknown probabilities between the possible values for the attribute while the latter assumes complete probabilistic knowledge in the form of a probability distribution. The strong uncertainty setting has been addressed in multiple works. A conformance checking technique was proposed by  [pegoraro2019mining] to compute a lower bound on the conformance cost. pegoraro2019discovering [pegoraro2019discovering] described a discovery technique based on uncertain logs that represent an underlying process. In [pegoraro2020efficient] and [pegoraro2020efficientspaceandtime], the authors proposed an efficient way to construct behavior graphs, which are a graphical representation of precedence relationships among events, for logs with strong uncertain data. By using these graphs, one can discover models from logs through methods based on directly-follows relationships such as the inductive miner [pegoraro2019discovering]. In another recent work by bergami2021tool [bergami2021tool], the authors suggested a technique to compute conformance cost in the setting where the discovered model is assigned probabilities while the traces in the log are deterministic. This work is the first to tackle the problem of conformance checking with SK logs.

6 Conclusion and future work

We developed a conformance checking model for a stochastically known trace in which the probability distribution functions are given. Such a setting may characterize situations in which data logs originate from sensors or probabilistic models. Differently from other conformance checking models, ours explicitly considers the probability values and at the same time accommodates standard (deterministic) alignment-based conformance checking.

When constructing the S-ABCC, in favor of model development, we defined a stochastic trace model and a stochastic synchronous product. Using the stochastic synchronous product and its set of reachable markings, we constructed the corresponding reachability graph. By formulating a bounded non-linear cost function that takes the firing probability as an input, we assigned costs to the edges of the reachability graph that correspond to the stochastic synchronous product. In a final step, we searched over the graph for the shortest (cheapest) path, which represents an optimal alignment where the cost is the conformance cost. Via structured experiments with two well-known benchmarks, we analyzed the characteristics of S-ABCC and compared it to the deterministic alignment-based conformance checking approach and to a lower bound on the conformance cost. On average, the conformance cost of the stochastically known traces converges to their deterministic counterparts as the firing probabilities approach 1. As expected, lower values of firing probability that imply higher uncertainty correspond to higher conformance costs for the same traces. This phenomenon is confirmed when the uncertainty increases due to larger uncertain trace portions. Finally, we observed that conformance costs tend to be higher for longer stochastic traces compared to shorter ones. This occurs because, in general, longer traces may include more synchronous moves that have non-negative costs in the stochastic settings.

This work opens up several interesting future research directions. The first is to use the suggested conformance checking approach to restore the most likely realization from SK traces. Possible applications may include improving the accuracy of machine learning classifiers and cleaning errors in datasets. Another direction is to find both upper and lower bounds on conformance cost. Finally, it is worth exploring how different cost functions and search algorithms may affect the performance of

S-ABCC.